DOC-5282 restructured pipeline/job config info

andy-stark-redis · andy-stark-redis · commit 75ac236d47d0 · 2025-06-10T11:15:01.000+01:00
diff --git a/content/integrate/redis-data-integration/data-pipelines/_index.md b/content/integrate/redis-data-integration/data-pipelines/_index.md
@@ -1,13 +1,15 @@
 ---
 Title: Data pipelines
-aliases: /integrate/redis-data-integration/ingest/data-pipelines/
+aliases:
+- /integrate/redis-data-integration/ingest/data-pipelines/
+- /integrate/redis-data-integration/data-pipelines/data-pipelines/
 alwaysopen: false
 categories:
 - docs
 - integrate
 - rs
 - rdi
-description: Learn how an RDI pipeline can transform source data before writing
+description: Learn how to configure RDI for data capture and transformation.
 group: di
 hideListLinks: false
 linkTitle: Data pipelines
@@ -16,3 +18,125 @@ summary: Redis Data Integration keeps Redis in sync with the primary database in
 type: integration
 weight: 30
 ---
+
+RDI implements
+[change data capture](https://en.wikipedia.org/wiki/Change_data_capture) (CDC)
+with *pipelines*. (See the
+[architecture overview]({{< relref "/integrate/redis-data-integration/architecture#overview" >}})
+for an introduction to pipelines.)
+
+## How a pipeline works
+
+An RDI pipeline captures change data records from the source database, and transforms them
+into Redis data structures. It writes each of these new structures to a Redis target
+database under its own key. 
+
+By default, RDI transforms the source data into
+[hashes]({{< relref "/develop/data-types/hashes" >}}) or
+[JSON objects]({{< relref "/develop/data-types/json" >}}) for the target with a
+standard data mapping and a standard format for the key.
+However, you can also provide your own custom transformation [jobs](#job-files)
+for each source table, using your own data mapping and key pattern. You specify these
+jobs declaratively with YAML configuration files that require no coding.
+
+The data tranformation involves two separate stages:
+
+1.  The data ingested during CDC is automatically transformed to an intermediate JSON
+    change event format.
+1.  This JSON change event data gets passed on to your custom transformation for further
+    processing.
+
+The diagram below shows the flow of data through the pipeline:
+
+{{< image filename="/images/rdi/ingest/RDIPipeDataflow.webp" >}}
+
+You can provide a job file for each source table for which you want to specify a custom
+transformation. You can also add a *default job file* for any tables that don't have their own.
+You must specify the full name of the source table in the job file (or the special
+name "*" in the default job) and you
+can also include filtering logic to skip data that matches a particular condition.
+As part of the transformation, you can specify any of the following data types
+to store the data in Redis:
+
+- [JSON objects]({{< relref "/develop/data-types/json" >}})
+- [Hashes]({{< relref "/develop/data-types/hashes" >}})
+- [Sets]({{< relref "/develop/data-types/sets" >}})
+- [Streams]({{< relref "/develop/data-types/streams" >}})
+- [Sorted sets]({{< relref "/develop/data-types/sorted-sets" >}})
+- [Strings]({{< relref "/develop/data-types/strings" >}})
+
+### Pipeline lifecycle
+
+After you deploy a pipeline, it goes through the following phases:
+
+1. *Deploy* - when you deploy the pipeline, RDI first validates it before use.
+Then, the [operator]({{< relref "/integrate/redis-data-integration/architecture#how-rdi-is-deployed">}}) creates and configures the collector and stream processor that will run the pipeline.
+1. *Snapshot* - The collector starts the pipeline by creating a snapshot of the full
+dataset. This involves reading all the relevant source data, transforming it and then
+writing it into the Redis target. You should expect this phase to take minutes or
+hours to complete if you have a lot of data.
+1. *CDC* - Once the snapshot is complete, the collector starts listening for updates to
+the source data. Whenever a change is committed to the source, the collector captures
+it and adds it to the target through the pipeline. This phase continues indefinitely
+unless you change the pipeline configuration. 
+1. *Update* - If you update the pipeline configuration, the operator applies it
+to the collector and the stream processor. Note that the changes only affect newly-captured
+data unless you reset the pipeline completely. Once RDI has accepted the updates, the
+pipeline returns to the CDC phase with the new configuration.
+1. *Reset* - There are circumstances where you might want to rebuild the dataset
+completely. For example, you might want to apply a new transformation to all the source
+data or refresh the dataset if RDI is disconnected from the
+source for a long time. In situations like these, you can *reset* the pipeline back
+to the snapshot phase. When this is complete, the pipeline continues with CDC as usual. 
+
+## Using a pipeline
+
+Follow the steps described in the sections below to prepare and run an RDI pipeline.
+
+### 1. Prepare the source database
+
+Before using the pipeline you must first prepare your source database to use
+the Debezium connector for *change data capture (CDC)*. See the
+[architecture overview]({{< relref "/integrate/redis-data-integration/architecture#overview" >}})
+for more information about CDC.
+Each database type has a different set of preparation steps. You can
+find the preparation guides for the databases that RDI supports in the
+[Prepare source databases]({{< relref "/integrate/redis-data-integration/data-pipelines/prepare-dbs" >}})
+section.
+
+###  2. Configure the pipeline
+
+RDI uses a set of [YAML](https://en.wikipedia.org/wiki/YAML)
+files to configure each pipeline. The following diagram shows the folder
+structure of the configuration:
+
+{{< image filename="images/rdi/ingest/ingest-config-folders.webp" width="600px" >}}
+
+The main configuration for the pipeline is in the `config.yaml` file.
+This specifies the connection details for the source database (such
+as host, username, and password) and also the queries that RDI will use
+to extract the required data. You should place job configurations in the `Jobs`
+folder if you want to specify your own data transformations.
+
+See
+[Pipeline configuration file]({{< relref "/integrate/redis-data-integration/data-pipelines/pipeline-config" >}})
+for a full description of the `config.yaml` file and some example configurations.
+
+### 3. Create job files (optional)
+
+You can use one or more job files to configure which fields from the source tables
+you want to use, and which data structure you want to write to the target. You
+can also optionally specify a transformation to apply to the data before writing it
+to the target. See the
+[Job files]({{< relref "/integrate/redis-data-integration/data-pipelines/transform-examples" >}})
+section for full details of the file format and examples of common tasks for job files.
+
+### 4. Deploy the pipeline
+
+When your configuration is ready, you must deploy it to start using the pipeline. See
+[Deploy a pipeline]({{< relref "/integrate/redis-data-integration/data-pipelines/deploy" >}})
+to learn how to do this.
+
+## More information
+
+See the other pages in this section for more information and examples:
diff --git a/content/integrate/redis-data-integration/data-pipelines/deploy.md b/content/integrate/redis-data-integration/data-pipelines/deploy.md
@@ -17,7 +17,7 @@ weight: 10
 ---
 
 The sections below explain how to deploy a pipeline after you have created the required
-[configuration]({{< relref "/integrate/redis-data-integration/data-pipelines/data-pipelines" >}}).
+[configuration]({{< relref "/integrate/redis-data-integration/data-pipelines" >}}).
 
 ## Set secrets
 
@@ -26,7 +26,9 @@ source and target databases. Each secret has a name that you can pass to the
 [`redis-di set-secret`]({{< relref "/integrate/redis-data-integration/reference/cli/redis-di-set-secret" >}})
 command (VM deployment) or the `rdi-secret.sh` script (K8s deployment) to set the secret value. 
 You can then refer to these secrets in the `config.yaml` file using the syntax "`${SECRET_NAME}`"
-(the sample [config.yaml file]({{< relref "/integrate/redis-data-integration/data-pipelines/data-pipelines#the-configyaml-file" >}}) shows these secrets in use).
+(the sample
+[config.yaml file]({{< relref "/integrate/redis-data-integration/data-pipelines/pipeline-config#example" >}})
+shows these secrets in use).
 
 The table below lists all valid secret names. Note that the
 username and password are required for the source and target, but the other
@@ -252,7 +254,7 @@ Note that the certificate paths contained in the secrets `SOURCE_DB_CACERT`, `SO
 
 ## Deploy a pipeline
 
-When you have created your configuration, including the [jobs]({{< relref "/integrate/redis-data-integration/data-pipelines/data-pipelines#job-files" >}}), you are
+When you have created your configuration, including the [jobs]({{< relref "/integrate/redis-data-integration/data-pipelines/transform-examples" >}}), you are
 ready to deploy. Use [Redis Insight]({{< relref "/develop/tools/insight/rdi-connector" >}})
 to configure and deploy pipelines for both VM and K8s installations.
 
diff --git a/content/integrate/redis-data-integration/data-pipelines/pipeline-config.md b/content/integrate/redis-data-integration/data-pipelines/pipeline-config.md
@@ -1,74 +1,29 @@
 ---
-Title: Configure data pipelines
-linkTitle: Configure
-description: Learn how to configure ingest pipelines for data transformation
-weight: 1
+Title: Pipeline configuration file
 alwaysopen: false
-categories: ["redis-di"]
-aliases: /integrate/redis-data-integration/ingest/data-pipelines/data-pipelines/
+categories:
+- docs
+- integrate
+- rs
+- rdi
+description: Learn how to specify the main configuration details for an RDI pipeline.
+group: di
+linkTitle: Pipeline configuration file
+summary: Redis Data Integration keeps Redis in sync with the primary database in near
+  real time.
+type: integration
+weight: 3
 ---
 
-RDI implements
-[change data capture](https://en.wikipedia.org/wiki/Change_data_capture) (CDC)
-with *pipelines*. (See the
-[architecture overview]({{< relref "/integrate/redis-data-integration/architecture#overview" >}})
-for an introduction to pipelines.)
+The main configuration details for an RDI pipeline are in the `config.yaml` file.
+This file specifies the connection details for the source and target databases,
+and also the set of tables you want to capture. You can also add one or more
+[job files]({{< relref "/integrate/redis-data-integration/data-pipelines/transform-examples" >}})
+if you want to apply custom transformations to the captured data.
 
-## Overview
+## Example
 
-An RDI pipeline captures change data records from the source database, and transforms them
-into Redis data structures. It writes each of these new structures to a Redis target
-database under its own key. 
-
-By default, RDI transforms the source data into
-[hashes]({{< relref "/develop/data-types/hashes" >}}) or
-[JSON objects]({{< relref "/develop/data-types/json" >}}) for the target with a
-standard data mapping and a standard format for the key.
-However, you can also provide your own custom transformation [jobs](#job-files)
-for each source table, using your own data mapping and key pattern. You specify these
-jobs declaratively with YAML configuration files that require no coding.
-
-The data tranformation involves two separate stages. First, the data ingested
-during CDC is automatically transformed to a JSON format. Then,
-this JSON data gets passed on to your custom transformation for further processing.
-
-You can provide a job file for each source table you want to transform, but you
-can also add a *default job* for any tables that don't have their own.
-You must specify the full name of the source table in the job file (or the special
-name "*" in the default job) and you
-can also include filtering logic to skip data that matches a particular condition.
-As part of the transformation, you can specify whether you want to store the
-data in Redis as
-[JSON objects]({{< relref "/develop/data-types/json" >}}),
-[hashes]({{< relref "/develop/data-types/hashes" >}}),
-[sets]({{< relref "/develop/data-types/sets" >}}),
-[streams]({{< relref "/develop/data-types/streams" >}}),
-[sorted sets]({{< relref "/develop/data-types/sorted-sets" >}}), or
-[strings]({{< relref "/develop/data-types/strings" >}}).
-
-The diagram below shows the flow of data through the pipeline:
-
-{{< image filename="/images/rdi/ingest/RDIPipeDataflow.webp" >}}
-
-## Pipeline configuration
-
-RDI uses a set of [YAML](https://en.wikipedia.org/wiki/YAML)
-files to configure each pipeline. The following diagram shows the folder
-structure of the configuration:
-
-{{< image filename="images/rdi/ingest/ingest-config-folders.webp" width="600px" >}}
-
-The main configuration for the pipeline is in the `config.yaml` file.
-This specifies the connection details for the source database (such
-as host, username, and password) and also the queries that RDI will use
-to extract the required data. You should place job configurations in the `Jobs`
-folder if you want to specify your own data transformations.
-
-The sections below describe the two types of configuration file in more detail.
-
-## The `config.yaml` file
-
-Here is an example of a `config.yaml` file. Note that the values of the
+Below is an example of a `config.yaml` file. Note that the values of the
 form "`${name}`" refer to secrets that you should set as described in 
 [Set secrets]({{< relref "/integrate/redis-data-integration/data-pipelines/deploy#set-secrets" >}}). 
 In particular, you should normally use secrets as shown to set the source
@@ -201,18 +156,20 @@ processors:
   # error_handling: dlq
 ```
 
+## Sections
+
 The main sections of the file configure [`sources`](#sources) and [`targets`](#targets).
 
 ### Sources
 
 The `sources` section has a subsection for the source that
 you need to configure. The source section starts with a unique name
-to identify the source (in the example we have a source
+to identify the source (in the example, there is a source
 called `mysql` but you can choose any name you like). The example
 configuration contains the following data:
 
 - `type`: The type of collector to use for the pipeline. 
-  Currently, the only types we support are `cdc` and `external`.
+  Currently, the only types RDI supports are `cdc` and `external`.
   If the source type is set to `external`, no collector resources will be created by the operator, 
   and all other source sections should be empty or not specified at all.
 - `connection`: The connection details for the source database: `type`, `host`, `port`, 
@@ -255,8 +212,8 @@ configuration contains the following data:
 
 Use this section to provide the connection details for the target Redis
 database(s). As with the sources, you should start each target section
-with a unique name that you are free to choose (here, we have used
-`target` as an example). In the `connection` section, you can specify the
+with a unique name that you are free to choose (here, the example uses the name
+`target`). In the `connection` section, you can specify the
 `type` of the target database, which must be `redis`, along with 
 connection details such as `host`, `port`, and credentials (`username` and `password`).
 If you use [TLS](https://en.wikipedia.org/wiki/Transport_Layer_Security)/
@@ -269,7 +226,7 @@ that you should set as described in [Set secrets]({{< relref "/integrate/redis-d
 
 {{< note >}}If you specify `localhost` as the address of either the source or target server during
 installation then the connection will fail if the actual IP address changes for the local
-VM. For this reason, we recommend that you don't use `localhost` for the address. However,
+VM. For this reason, it is recommended that you don't use `localhost` for the address. However,
 if you do encounter this problem, you can fix it using the following commands on the VM
 that is running RDI itself:
 
@@ -278,53 +235,3 @@ sudo k3s kubectl delete nodes --all
 sudo service k3s restart
 ```
 {{< /note >}}
-
-## Job files
-
-You can use one or more job files to configure which fields from the source tables
-you want to use, and which data structure you want to write to the target. You
-can also optionally specify a transformation to apply to the data before writing it
-to the target. See the
-[Job files]({{< relref "/integrate/redis-data-integration/data-pipelines/transform-examples" >}})
-section for full details of the file format and examples of common tasks for job files.
-
-## Source preparation
-
-Before using the pipeline you must first prepare your source database to use
-the Debezium connector for *change data capture (CDC)*. See the
-[architecture overview]({{< relref "/integrate/redis-data-integration/architecture#overview" >}})
-for more information about CDC.
-Each database type has a different set of preparation steps. You can
-find the preparation guides for the databases that RDI supports in the
-[Prepare source databases]({{< relref "/integrate/redis-data-integration/data-pipelines/prepare-dbs" >}})
-section.
-
-## Deploy a pipeline
-
-When your configuration is ready, you must deploy it to start using the pipeline. See
-[Deploy a pipeline]({{< relref "/integrate/redis-data-integration/data-pipelines/deploy" >}})
-to learn how to do this.
-
-## Pipeline lifecycle
-
-A pipeline goes through the following phases:
-
-1. *Deploy* - when you deploy the pipeline, RDI first validates it before use.
-Then, the [operator]({{< relref "/integrate/redis-data-integration/architecture#how-rdi-is-deployed">}}) creates and configures the collector and stream processor that will run the pipeline.
-1. *Snapshot* - The collector starts the pipeline by creating a snapshot of the full
-dataset. This involves reading all the relevant source data, transforming it and then
-writing it into the Redis target. You should expect this phase to take minutes or
-hours to complete if you have a lot of data.
-1. *CDC* - Once the snapshot is complete, the collector starts listening for updates to
-the source data. Whenever a change is committed to the source, the collector captures
-it and adds it to the target through the pipeline. This phase continues indefinitely
-unless you change the pipeline configuration. 
-1. *Update* - If you update the pipeline configuration, the operator applies it
-to the collector and the stream processor. Note that the changes only affect newly-captured
-data unless you reset the pipeline completely. Once RDI has accepted the updates, the
-pipeline returns to the CDC phase with the new configuration.
-1. *Reset* - There are circumstances where you might want to rebuild the dataset
-completely. For example, you might want to apply a new transformation to all the source
-data or refresh the dataset if RDI is disconnected from the
-source for a long time. In situations like these, you can *reset* the pipeline back
-to the snapshot phase. When this is complete, the pipeline continues with CDC as usual. 
diff --git a/content/integrate/redis-data-integration/data-pipelines/prepare-dbs/_index.md b/content/integrate/redis-data-integration/data-pipelines/prepare-dbs/_index.md
@@ -14,7 +14,7 @@ linkTitle: Prepare source databases
 summary: Redis Data Integration keeps Redis in sync with the primary database in near
   real time.
 type: integration
-weight: 30
+weight: 1
 ---
 
 Each database uses a different mechanism to track changes to its data and