Skip to content

Commit c3df7c3

Browse files
Merge pull request #1683 from redis/DOC-5282-rdi-restructure-job-details
DOC-5282 RDI: restructure config/job file details
2 parents 4aa8bd4 + 342159c commit c3df7c3

File tree

12 files changed

+547
-512
lines changed

12 files changed

+547
-512
lines changed
Lines changed: 127 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,15 @@
11
---
22
Title: Data pipelines
3-
aliases: /integrate/redis-data-integration/ingest/data-pipelines/
3+
aliases:
4+
- /integrate/redis-data-integration/ingest/data-pipelines/
5+
- /integrate/redis-data-integration/data-pipelines/data-pipelines/
46
alwaysopen: false
57
categories:
68
- docs
79
- integrate
810
- rs
911
- rdi
10-
description: Learn how an RDI pipeline can transform source data before writing
12+
description: Learn how to configure RDI for data capture and transformation.
1113
group: di
1214
hideListLinks: false
1315
linkTitle: Data pipelines
@@ -16,3 +18,126 @@ summary: Redis Data Integration keeps Redis in sync with the primary database in
1618
type: integration
1719
weight: 30
1820
---
21+
22+
RDI uses *pipelines* to implement
23+
[change data capture](https://en.wikipedia.org/wiki/Change_data_capture) (CDC). (See the
24+
[architecture overview]({{< relref "/integrate/redis-data-integration/architecture#overview" >}})
25+
for an introduction to pipelines.)
26+
The sections below explain how pipelines work and give an overview of how to configure and
27+
deploy them.
28+
29+
## How a pipeline works
30+
31+
An RDI pipeline captures change data records from the source database, and transforms them
32+
into Redis data structures. It writes each of these new structures to a Redis target
33+
database under its own key.
34+
35+
By default, RDI transforms the source data into
36+
[hashes]({{< relref "/develop/data-types/hashes" >}}) or
37+
[JSON objects]({{< relref "/develop/data-types/json" >}}) for the target with a
38+
standard data mapping and a standard format for the key.
39+
However, you can also provide your own custom transformation [jobs](#job-files)
40+
for each source table, using your own data mapping and key pattern. You specify these
41+
jobs declaratively with YAML configuration files that require no coding.
42+
43+
Data transformation involves two stages:
44+
45+
1. The data ingested during CDC is automatically transformed to an intermediate JSON
46+
change event format.
47+
1. RDI passes this JSON change event data to your custom transformation for further
48+
processing.
49+
50+
The diagram below shows the flow of data through the pipeline:
51+
52+
{{< image filename="/images/rdi/ingest/RDIPipeDataflow.webp" >}}
53+
54+
You can provide a job file for each source table that needs a custom
55+
transformation. You can also add a *default job file* for any tables that don't have their own.
56+
You must specify the full name of the source table in the job file (or the special
57+
name "*" in the default job) and you
58+
can also include filtering logic to skip data that matches a particular condition.
59+
As part of the transformation, you can specify any of the following data types
60+
to store the data in Redis:
61+
62+
- [JSON]({{< relref "/develop/data-types/json" >}})
63+
- [Hashes]({{< relref "/develop/data-types/hashes" >}})
64+
- [Sets]({{< relref "/develop/data-types/sets" >}})
65+
- [Streams]({{< relref "/develop/data-types/streams" >}})
66+
- [Sorted sets]({{< relref "/develop/data-types/sorted-sets" >}})
67+
- [Strings]({{< relref "/develop/data-types/strings" >}})
68+
69+
### Pipeline lifecycle
70+
71+
After you deploy a pipeline, it goes through the following phases:
72+
73+
1. *Deploy* - when you deploy the pipeline, RDI first validates it before use.
74+
Then, the [operator]({{< relref "/integrate/redis-data-integration/architecture#how-rdi-is-deployed">}}) creates and configures the collector and stream processor that will run the pipeline.
75+
1. *Snapshot* - The collector starts the pipeline by creating a snapshot of the full
76+
dataset. This involves reading all the relevant source data, transforming it and then
77+
writing it into the Redis target. This phase typically takes minutes to
78+
hours if you have a lot of data.
79+
1. *CDC* - Once the snapshot is complete, the collector starts listening for updates to
80+
the source data. Whenever a change is committed to the source, the collector captures
81+
it and adds it to the target through the pipeline. This phase continues indefinitely
82+
unless you change the pipeline configuration.
83+
1. *Update* - If you update the pipeline configuration, the operator applies it
84+
to the collector and the stream processor. Note that the changes only affect newly-captured
85+
data unless you reset the pipeline completely. Once RDI has accepted the updates, the
86+
pipeline returns to the CDC phase with the new configuration.
87+
1. *Reset* - There are circumstances where you might want to rebuild the dataset
88+
completely. For example, you might want to apply a new transformation to all the source
89+
data or refresh the dataset if RDI is disconnected from the
90+
source for a long time. In situations like these, you can *reset* the pipeline back
91+
to the snapshot phase. When this is complete, the pipeline continues with CDC as usual.
92+
93+
## Using a pipeline
94+
95+
Follow the steps described in the sections below to prepare and run an RDI pipeline.
96+
97+
### 1. Prepare the source database
98+
99+
Before using the pipeline you must first prepare your source database to use
100+
the Debezium connector for *change data capture (CDC)*. See the
101+
[architecture overview]({{< relref "/integrate/redis-data-integration/architecture#overview" >}})
102+
for more information about CDC.
103+
Each database type has a different set of preparation steps. You can
104+
find the preparation guides for the databases that RDI supports in the
105+
[Prepare source databases]({{< relref "/integrate/redis-data-integration/data-pipelines/prepare-dbs" >}})
106+
section.
107+
108+
### 2. Configure the pipeline
109+
110+
RDI uses a set of [YAML](https://en.wikipedia.org/wiki/YAML)
111+
files to configure each pipeline. The following diagram shows the folder
112+
structure of the configuration:
113+
114+
{{< image filename="images/rdi/ingest/ingest-config-folders.webp" width="600px" >}}
115+
116+
The main configuration for the pipeline is in the `config.yaml` file.
117+
This specifies the connection details for the source database (such
118+
as host, username, and password) and also the queries that RDI will use
119+
to extract the required data. You should place job files in the `Jobs`
120+
folder if you want to specify your own data transformations.
121+
122+
See
123+
[Pipeline configuration file]({{< relref "/integrate/redis-data-integration/data-pipelines/pipeline-config" >}})
124+
for a full description of the `config.yaml` file and some example configurations.
125+
126+
### 3. Create job files (optional)
127+
128+
You can use one or more job files to configure which fields from the source tables
129+
you want to use, and which data structure you want to write to the target. You
130+
can also optionally specify a transformation to apply to the data before writing it
131+
to the target. See the
132+
[Job files]({{< relref "/integrate/redis-data-integration/data-pipelines/transform-examples" >}})
133+
section for full details of the file format and examples of common tasks for job files.
134+
135+
### 4. Deploy the pipeline
136+
137+
When your configuration is ready, you must deploy it to start using the pipeline. See
138+
[Deploy a pipeline]({{< relref "/integrate/redis-data-integration/data-pipelines/deploy" >}})
139+
to learn how to do this.
140+
141+
## More information
142+
143+
See the other pages in this section for more information and examples:

0 commit comments

Comments
 (0)