Skip to content

Commit 75ac236

Browse files
DOC-5282 restructured pipeline/job config info
1 parent c399067 commit 75ac236

File tree

4 files changed

+159
-126
lines changed

4 files changed

+159
-126
lines changed
Lines changed: 126 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,15 @@
11
---
22
Title: Data pipelines
3-
aliases: /integrate/redis-data-integration/ingest/data-pipelines/
3+
aliases:
4+
- /integrate/redis-data-integration/ingest/data-pipelines/
5+
- /integrate/redis-data-integration/data-pipelines/data-pipelines/
46
alwaysopen: false
57
categories:
68
- docs
79
- integrate
810
- rs
911
- rdi
10-
description: Learn how an RDI pipeline can transform source data before writing
12+
description: Learn how to configure RDI for data capture and transformation.
1113
group: di
1214
hideListLinks: false
1315
linkTitle: Data pipelines
@@ -16,3 +18,125 @@ summary: Redis Data Integration keeps Redis in sync with the primary database in
1618
type: integration
1719
weight: 30
1820
---
21+
22+
RDI implements
23+
[change data capture](https://en.wikipedia.org/wiki/Change_data_capture) (CDC)
24+
with *pipelines*. (See the
25+
[architecture overview]({{< relref "/integrate/redis-data-integration/architecture#overview" >}})
26+
for an introduction to pipelines.)
27+
28+
## How a pipeline works
29+
30+
An RDI pipeline captures change data records from the source database, and transforms them
31+
into Redis data structures. It writes each of these new structures to a Redis target
32+
database under its own key.
33+
34+
By default, RDI transforms the source data into
35+
[hashes]({{< relref "/develop/data-types/hashes" >}}) or
36+
[JSON objects]({{< relref "/develop/data-types/json" >}}) for the target with a
37+
standard data mapping and a standard format for the key.
38+
However, you can also provide your own custom transformation [jobs](#job-files)
39+
for each source table, using your own data mapping and key pattern. You specify these
40+
jobs declaratively with YAML configuration files that require no coding.
41+
42+
The data tranformation involves two separate stages:
43+
44+
1. The data ingested during CDC is automatically transformed to an intermediate JSON
45+
change event format.
46+
1. This JSON change event data gets passed on to your custom transformation for further
47+
processing.
48+
49+
The diagram below shows the flow of data through the pipeline:
50+
51+
{{< image filename="/images/rdi/ingest/RDIPipeDataflow.webp" >}}
52+
53+
You can provide a job file for each source table for which you want to specify a custom
54+
transformation. You can also add a *default job file* for any tables that don't have their own.
55+
You must specify the full name of the source table in the job file (or the special
56+
name "*" in the default job) and you
57+
can also include filtering logic to skip data that matches a particular condition.
58+
As part of the transformation, you can specify any of the following data types
59+
to store the data in Redis:
60+
61+
- [JSON objects]({{< relref "/develop/data-types/json" >}})
62+
- [Hashes]({{< relref "/develop/data-types/hashes" >}})
63+
- [Sets]({{< relref "/develop/data-types/sets" >}})
64+
- [Streams]({{< relref "/develop/data-types/streams" >}})
65+
- [Sorted sets]({{< relref "/develop/data-types/sorted-sets" >}})
66+
- [Strings]({{< relref "/develop/data-types/strings" >}})
67+
68+
### Pipeline lifecycle
69+
70+
After you deploy a pipeline, it goes through the following phases:
71+
72+
1. *Deploy* - when you deploy the pipeline, RDI first validates it before use.
73+
Then, the [operator]({{< relref "/integrate/redis-data-integration/architecture#how-rdi-is-deployed">}}) creates and configures the collector and stream processor that will run the pipeline.
74+
1. *Snapshot* - The collector starts the pipeline by creating a snapshot of the full
75+
dataset. This involves reading all the relevant source data, transforming it and then
76+
writing it into the Redis target. You should expect this phase to take minutes or
77+
hours to complete if you have a lot of data.
78+
1. *CDC* - Once the snapshot is complete, the collector starts listening for updates to
79+
the source data. Whenever a change is committed to the source, the collector captures
80+
it and adds it to the target through the pipeline. This phase continues indefinitely
81+
unless you change the pipeline configuration.
82+
1. *Update* - If you update the pipeline configuration, the operator applies it
83+
to the collector and the stream processor. Note that the changes only affect newly-captured
84+
data unless you reset the pipeline completely. Once RDI has accepted the updates, the
85+
pipeline returns to the CDC phase with the new configuration.
86+
1. *Reset* - There are circumstances where you might want to rebuild the dataset
87+
completely. For example, you might want to apply a new transformation to all the source
88+
data or refresh the dataset if RDI is disconnected from the
89+
source for a long time. In situations like these, you can *reset* the pipeline back
90+
to the snapshot phase. When this is complete, the pipeline continues with CDC as usual.
91+
92+
## Using a pipeline
93+
94+
Follow the steps described in the sections below to prepare and run an RDI pipeline.
95+
96+
### 1. Prepare the source database
97+
98+
Before using the pipeline you must first prepare your source database to use
99+
the Debezium connector for *change data capture (CDC)*. See the
100+
[architecture overview]({{< relref "/integrate/redis-data-integration/architecture#overview" >}})
101+
for more information about CDC.
102+
Each database type has a different set of preparation steps. You can
103+
find the preparation guides for the databases that RDI supports in the
104+
[Prepare source databases]({{< relref "/integrate/redis-data-integration/data-pipelines/prepare-dbs" >}})
105+
section.
106+
107+
### 2. Configure the pipeline
108+
109+
RDI uses a set of [YAML](https://en.wikipedia.org/wiki/YAML)
110+
files to configure each pipeline. The following diagram shows the folder
111+
structure of the configuration:
112+
113+
{{< image filename="images/rdi/ingest/ingest-config-folders.webp" width="600px" >}}
114+
115+
The main configuration for the pipeline is in the `config.yaml` file.
116+
This specifies the connection details for the source database (such
117+
as host, username, and password) and also the queries that RDI will use
118+
to extract the required data. You should place job configurations in the `Jobs`
119+
folder if you want to specify your own data transformations.
120+
121+
See
122+
[Pipeline configuration file]({{< relref "/integrate/redis-data-integration/data-pipelines/pipeline-config" >}})
123+
for a full description of the `config.yaml` file and some example configurations.
124+
125+
### 3. Create job files (optional)
126+
127+
You can use one or more job files to configure which fields from the source tables
128+
you want to use, and which data structure you want to write to the target. You
129+
can also optionally specify a transformation to apply to the data before writing it
130+
to the target. See the
131+
[Job files]({{< relref "/integrate/redis-data-integration/data-pipelines/transform-examples" >}})
132+
section for full details of the file format and examples of common tasks for job files.
133+
134+
### 4. Deploy the pipeline
135+
136+
When your configuration is ready, you must deploy it to start using the pipeline. See
137+
[Deploy a pipeline]({{< relref "/integrate/redis-data-integration/data-pipelines/deploy" >}})
138+
to learn how to do this.
139+
140+
## More information
141+
142+
See the other pages in this section for more information and examples:

content/integrate/redis-data-integration/data-pipelines/deploy.md

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ weight: 10
1717
---
1818

1919
The sections below explain how to deploy a pipeline after you have created the required
20-
[configuration]({{< relref "/integrate/redis-data-integration/data-pipelines/data-pipelines" >}}).
20+
[configuration]({{< relref "/integrate/redis-data-integration/data-pipelines" >}}).
2121

2222
## Set secrets
2323

@@ -26,7 +26,9 @@ source and target databases. Each secret has a name that you can pass to the
2626
[`redis-di set-secret`]({{< relref "/integrate/redis-data-integration/reference/cli/redis-di-set-secret" >}})
2727
command (VM deployment) or the `rdi-secret.sh` script (K8s deployment) to set the secret value.
2828
You can then refer to these secrets in the `config.yaml` file using the syntax "`${SECRET_NAME}`"
29-
(the sample [config.yaml file]({{< relref "/integrate/redis-data-integration/data-pipelines/data-pipelines#the-configyaml-file" >}}) shows these secrets in use).
29+
(the sample
30+
[config.yaml file]({{< relref "/integrate/redis-data-integration/data-pipelines/pipeline-config#example" >}})
31+
shows these secrets in use).
3032

3133
The table below lists all valid secret names. Note that the
3234
username and password are required for the source and target, but the other
@@ -252,7 +254,7 @@ Note that the certificate paths contained in the secrets `SOURCE_DB_CACERT`, `SO
252254

253255
## Deploy a pipeline
254256

255-
When you have created your configuration, including the [jobs]({{< relref "/integrate/redis-data-integration/data-pipelines/data-pipelines#job-files" >}}), you are
257+
When you have created your configuration, including the [jobs]({{< relref "/integrate/redis-data-integration/data-pipelines/transform-examples" >}}), you are
256258
ready to deploy. Use [Redis Insight]({{< relref "/develop/tools/insight/rdi-connector" >}})
257259
to configure and deploy pipelines for both VM and K8s installations.
258260

content/integrate/redis-data-integration/data-pipelines/data-pipelines.md renamed to content/integrate/redis-data-integration/data-pipelines/pipeline-config.md

Lines changed: 27 additions & 120 deletions
Original file line numberDiff line numberDiff line change
@@ -1,74 +1,29 @@
11
---
2-
Title: Configure data pipelines
3-
linkTitle: Configure
4-
description: Learn how to configure ingest pipelines for data transformation
5-
weight: 1
2+
Title: Pipeline configuration file
63
alwaysopen: false
7-
categories: ["redis-di"]
8-
aliases: /integrate/redis-data-integration/ingest/data-pipelines/data-pipelines/
4+
categories:
5+
- docs
6+
- integrate
7+
- rs
8+
- rdi
9+
description: Learn how to specify the main configuration details for an RDI pipeline.
10+
group: di
11+
linkTitle: Pipeline configuration file
12+
summary: Redis Data Integration keeps Redis in sync with the primary database in near
13+
real time.
14+
type: integration
15+
weight: 3
916
---
1017

11-
RDI implements
12-
[change data capture](https://en.wikipedia.org/wiki/Change_data_capture) (CDC)
13-
with *pipelines*. (See the
14-
[architecture overview]({{< relref "/integrate/redis-data-integration/architecture#overview" >}})
15-
for an introduction to pipelines.)
18+
The main configuration details for an RDI pipeline are in the `config.yaml` file.
19+
This file specifies the connection details for the source and target databases,
20+
and also the set of tables you want to capture. You can also add one or more
21+
[job files]({{< relref "/integrate/redis-data-integration/data-pipelines/transform-examples" >}})
22+
if you want to apply custom transformations to the captured data.
1623

17-
## Overview
24+
## Example
1825

19-
An RDI pipeline captures change data records from the source database, and transforms them
20-
into Redis data structures. It writes each of these new structures to a Redis target
21-
database under its own key.
22-
23-
By default, RDI transforms the source data into
24-
[hashes]({{< relref "/develop/data-types/hashes" >}}) or
25-
[JSON objects]({{< relref "/develop/data-types/json" >}}) for the target with a
26-
standard data mapping and a standard format for the key.
27-
However, you can also provide your own custom transformation [jobs](#job-files)
28-
for each source table, using your own data mapping and key pattern. You specify these
29-
jobs declaratively with YAML configuration files that require no coding.
30-
31-
The data tranformation involves two separate stages. First, the data ingested
32-
during CDC is automatically transformed to a JSON format. Then,
33-
this JSON data gets passed on to your custom transformation for further processing.
34-
35-
You can provide a job file for each source table you want to transform, but you
36-
can also add a *default job* for any tables that don't have their own.
37-
You must specify the full name of the source table in the job file (or the special
38-
name "*" in the default job) and you
39-
can also include filtering logic to skip data that matches a particular condition.
40-
As part of the transformation, you can specify whether you want to store the
41-
data in Redis as
42-
[JSON objects]({{< relref "/develop/data-types/json" >}}),
43-
[hashes]({{< relref "/develop/data-types/hashes" >}}),
44-
[sets]({{< relref "/develop/data-types/sets" >}}),
45-
[streams]({{< relref "/develop/data-types/streams" >}}),
46-
[sorted sets]({{< relref "/develop/data-types/sorted-sets" >}}), or
47-
[strings]({{< relref "/develop/data-types/strings" >}}).
48-
49-
The diagram below shows the flow of data through the pipeline:
50-
51-
{{< image filename="/images/rdi/ingest/RDIPipeDataflow.webp" >}}
52-
53-
## Pipeline configuration
54-
55-
RDI uses a set of [YAML](https://en.wikipedia.org/wiki/YAML)
56-
files to configure each pipeline. The following diagram shows the folder
57-
structure of the configuration:
58-
59-
{{< image filename="images/rdi/ingest/ingest-config-folders.webp" width="600px" >}}
60-
61-
The main configuration for the pipeline is in the `config.yaml` file.
62-
This specifies the connection details for the source database (such
63-
as host, username, and password) and also the queries that RDI will use
64-
to extract the required data. You should place job configurations in the `Jobs`
65-
folder if you want to specify your own data transformations.
66-
67-
The sections below describe the two types of configuration file in more detail.
68-
69-
## The `config.yaml` file
70-
71-
Here is an example of a `config.yaml` file. Note that the values of the
26+
Below is an example of a `config.yaml` file. Note that the values of the
7227
form "`${name}`" refer to secrets that you should set as described in
7328
[Set secrets]({{< relref "/integrate/redis-data-integration/data-pipelines/deploy#set-secrets" >}}).
7429
In particular, you should normally use secrets as shown to set the source
@@ -201,18 +156,20 @@ processors:
201156
# error_handling: dlq
202157
```
203158

159+
## Sections
160+
204161
The main sections of the file configure [`sources`](#sources) and [`targets`](#targets).
205162

206163
### Sources
207164

208165
The `sources` section has a subsection for the source that
209166
you need to configure. The source section starts with a unique name
210-
to identify the source (in the example we have a source
167+
to identify the source (in the example, there is a source
211168
called `mysql` but you can choose any name you like). The example
212169
configuration contains the following data:
213170

214171
- `type`: The type of collector to use for the pipeline.
215-
Currently, the only types we support are `cdc` and `external`.
172+
Currently, the only types RDI supports are `cdc` and `external`.
216173
If the source type is set to `external`, no collector resources will be created by the operator,
217174
and all other source sections should be empty or not specified at all.
218175
- `connection`: The connection details for the source database: `type`, `host`, `port`,
@@ -255,8 +212,8 @@ configuration contains the following data:
255212

256213
Use this section to provide the connection details for the target Redis
257214
database(s). As with the sources, you should start each target section
258-
with a unique name that you are free to choose (here, we have used
259-
`target` as an example). In the `connection` section, you can specify the
215+
with a unique name that you are free to choose (here, the example uses the name
216+
`target`). In the `connection` section, you can specify the
260217
`type` of the target database, which must be `redis`, along with
261218
connection details such as `host`, `port`, and credentials (`username` and `password`).
262219
If you use [TLS](https://en.wikipedia.org/wiki/Transport_Layer_Security)/
@@ -269,7 +226,7 @@ that you should set as described in [Set secrets]({{< relref "/integrate/redis-d
269226

270227
{{< note >}}If you specify `localhost` as the address of either the source or target server during
271228
installation then the connection will fail if the actual IP address changes for the local
272-
VM. For this reason, we recommend that you don't use `localhost` for the address. However,
229+
VM. For this reason, it is recommended that you don't use `localhost` for the address. However,
273230
if you do encounter this problem, you can fix it using the following commands on the VM
274231
that is running RDI itself:
275232

@@ -278,53 +235,3 @@ sudo k3s kubectl delete nodes --all
278235
sudo service k3s restart
279236
```
280237
{{< /note >}}
281-
282-
## Job files
283-
284-
You can use one or more job files to configure which fields from the source tables
285-
you want to use, and which data structure you want to write to the target. You
286-
can also optionally specify a transformation to apply to the data before writing it
287-
to the target. See the
288-
[Job files]({{< relref "/integrate/redis-data-integration/data-pipelines/transform-examples" >}})
289-
section for full details of the file format and examples of common tasks for job files.
290-
291-
## Source preparation
292-
293-
Before using the pipeline you must first prepare your source database to use
294-
the Debezium connector for *change data capture (CDC)*. See the
295-
[architecture overview]({{< relref "/integrate/redis-data-integration/architecture#overview" >}})
296-
for more information about CDC.
297-
Each database type has a different set of preparation steps. You can
298-
find the preparation guides for the databases that RDI supports in the
299-
[Prepare source databases]({{< relref "/integrate/redis-data-integration/data-pipelines/prepare-dbs" >}})
300-
section.
301-
302-
## Deploy a pipeline
303-
304-
When your configuration is ready, you must deploy it to start using the pipeline. See
305-
[Deploy a pipeline]({{< relref "/integrate/redis-data-integration/data-pipelines/deploy" >}})
306-
to learn how to do this.
307-
308-
## Pipeline lifecycle
309-
310-
A pipeline goes through the following phases:
311-
312-
1. *Deploy* - when you deploy the pipeline, RDI first validates it before use.
313-
Then, the [operator]({{< relref "/integrate/redis-data-integration/architecture#how-rdi-is-deployed">}}) creates and configures the collector and stream processor that will run the pipeline.
314-
1. *Snapshot* - The collector starts the pipeline by creating a snapshot of the full
315-
dataset. This involves reading all the relevant source data, transforming it and then
316-
writing it into the Redis target. You should expect this phase to take minutes or
317-
hours to complete if you have a lot of data.
318-
1. *CDC* - Once the snapshot is complete, the collector starts listening for updates to
319-
the source data. Whenever a change is committed to the source, the collector captures
320-
it and adds it to the target through the pipeline. This phase continues indefinitely
321-
unless you change the pipeline configuration.
322-
1. *Update* - If you update the pipeline configuration, the operator applies it
323-
to the collector and the stream processor. Note that the changes only affect newly-captured
324-
data unless you reset the pipeline completely. Once RDI has accepted the updates, the
325-
pipeline returns to the CDC phase with the new configuration.
326-
1. *Reset* - There are circumstances where you might want to rebuild the dataset
327-
completely. For example, you might want to apply a new transformation to all the source
328-
data or refresh the dataset if RDI is disconnected from the
329-
source for a long time. In situations like these, you can *reset* the pipeline back
330-
to the snapshot phase. When this is complete, the pipeline continues with CDC as usual.

content/integrate/redis-data-integration/data-pipelines/prepare-dbs/_index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ linkTitle: Prepare source databases
1414
summary: Redis Data Integration keeps Redis in sync with the primary database in near
1515
real time.
1616
type: integration
17-
weight: 30
17+
weight: 1
1818
---
1919

2020
Each database uses a different mechanism to track changes to its data and

0 commit comments

Comments
 (0)