Merge pull request #1554 from stoyanr/RDSC-3533-update-rdi-docs-2

andy-stark-redis · web-flow · commit 7e569efbc690 · 2025-05-15T12:38:55.000+01:00
RDSC-3533 Update RDI documentation, part 2
diff --git a/content/integrate/redis-data-integration/architecture.md b/content/integrate/redis-data-integration/architecture.md
@@ -93,23 +93,29 @@ RDI supports the following database sources using [Debezium Server](https://debe
 
 ## How RDI is deployed
 
-RDI is designed with two *planes* that provide its services.
+RDI is designed with three *planes* that provide its services.
+
 The *control plane* contains the processes that keep RDI active.
 It includes:
 
-- An *operator* process that schedules the CDC collector and the
-stream processor to implement the two phases of the pipeline
-lifecycle (initial cache loading and change streaming)
-- A [Prometheus](https://prometheus.io/)
-endpoint to supply metrics about RDI
-- A REST API to control the VM.
+-   An *API server* process that exposes a REST API to observe and control RDI.
+-   An *operator* process that manages the *data plane* processes.
+-   A *metrics exporter* process that reads metrics from the RDI database
+    and exports them as [Prometheus](https://prometheus.io/) metrics.
+
+The *data plane* contains the processes that actually move the data.
+It includes the *CDC collector* and the *stream processor* that implement 
+the two phases of the pipeline lifecycle (initial cache loading and change streaming).
 
 The *management plane* provides tools that let you interact
-with the control plane. Use the CLI tool to install and administer RDI
-and to deploy and manage a pipeline. Use the pipeline editor
-(included in Redis Insight) to design or edit a pipeline. The
-diagram below shows the components of the control and management
-planes and the connections between them:
+with the control plane. 
+
+-   Use the CLI tool to install and administer RDI and to deploy 
+    and manage a pipeline. 
+-   Use the pipeline editor included in Redis Insight to design 
+    or edit a pipeline.
+    
+The diagram below shows all RDI components and the interactions between them:
 
 {{< image filename="images/rdi/ingest/ingest-control-plane.webp" >}}
 
@@ -118,11 +124,11 @@ deploy RDI.
 
 ### RDI on your own VMs
 
-For this deployment, you must provide two VMs. The
-collector and stream processor are active on one VM while the other is a standby to provide high availability. The operators run on both VMs and use an algorithm to decide which is the active one (the "leader").
-Both the active VM and the standby
-need access to the authentication secrets that RDI uses to encrypt network
-traffic. The diagram below shows this configuration:
+For this deployment, you must provide two VMs. The collector and stream processor 
+are active on one VM, while on the other they are in standby to provide high availability. 
+The two operators running on both VMs use a leader election algorithm to decide which 
+VM is the active one (the "leader").
+The diagram below shows this configuration:
 
 {{< image filename="images/rdi/ingest/ingest-active-passive-vms.webp" >}}
 
@@ -136,27 +142,26 @@ on [Kubernetes (K8s)](https://kubernetes.io/), including Red Hat
 [OpenShift](https://docs.openshift.com/). This creates:
 
 -   A K8s [namespace](https://kubernetes.io/docs/concepts/overview/working-with-objects/namespaces/) named `rdi`.
--   [Deployments](https://kubernetes.io/docs/concepts/workloads/controllers/deployment/) for the 
+    You can also use a different namespace name if you prefer.
+-   [Deployments](https://kubernetes.io/docs/concepts/workloads/controllers/deployment/) and 
+    [services](https://kubernetes.io/docs/concepts/services-networking/service/) for the 
     [RDI operator]({{< relref "/integrate/redis-data-integration/architecture#how-rdi-is-deployed" >}}),
     [metrics exporter]({{< relref "/integrate/redis-data-integration/observability" >}}), and API server.
--   A [service account](https://kubernetes.io/docs/concepts/security/service-accounts/) along with a
-    [role](https://kubernetes.io/docs/reference/access-authn-authz/rbac/#restrictions-on-role-creation-or-update)
-    and [role binding](https://kubernetes.io/docs/reference/access-authn-authz/rbac/#rolebinding-and-clusterrolebinding) for the RDI operator.
--   A [Configmap](https://kubernetes.io/docs/concepts/configuration/configmap/)
-    for the different components with RDI Redis database details.
+-   A [service account](https://kubernetes.io/docs/concepts/security/service-accounts/) 
+    and [RBAC resources](https://kubernetes.io/docs/reference/access-authn-authz/rbac) for the RDI operator.
+-   A [ConfigMap](https://kubernetes.io/docs/concepts/configuration/configmap/) with RDI database details.
 -   [Secrets](https://kubernetes.io/docs/concepts/configuration/secret/)
-    with the RDI Redis database credentials and TLS certificates.
+    with the RDI database credentials and TLS certificates.
+-   Other optional K8s resources such as [ingresses](https://kubernetes.io/docs/concepts/services-networking/ingress/) 
+    that can be enabled depending on your K8s environment and needs.
 
 See [Install on Kubernetes]({{< relref "/integrate/redis-data-integration/installation/install-k8s" >}})
 for more information.
 
 ### Secrets and security considerations
 
-RDI encrypts all network connections with
-[TLS](https://en.wikipedia.org/wiki/Transport_Layer_Security) or
-[mTLS](https://en.wikipedia.org/wiki/Mutual_authentication#mTLS).
-The credentials for the connections are saved as secrets and you
-can choose how to provide these secrets to RDI. Note that RDI stores
-all state and configuration data inside the Redis Enterprise cluster
-and does not store any other data on your RDI VMs or anywhere else
-outside the cluster.
+The credentials for the database connections, as well as the certificates 
+for [TLS](https://en.wikipedia.org/wiki/Transport_Layer_Security) and
+[mTLS](https://en.wikipedia.org/wiki/Mutual_authentication#mTLS) are saved in K8s secrets. 
+RDI stores all state and configuration data inside the Redis Enterprise cluster
+and does not store any other data on your RDI VMs or anywhere else outside the cluster.
diff --git a/content/integrate/redis-data-integration/data-pipelines/data-pipelines.md b/content/integrate/redis-data-integration/data-pipelines/data-pipelines.md
@@ -69,11 +69,10 @@ The sections below describe the two types of configuration file in more detail.
 ## The `config.yaml` file
 
 Here is an example of a `config.yaml` file. Note that the values of the
-form "`${name}`" refer to environment variables that you should set with the
-[`redis-di set-secret`]({{< relref "/integrate/redis-data-integration/reference/cli/redis-di-set-secret" >}})
-command. In particular, you should normally use environment variables as shown to set the source
-and target username and password rather than storing them in plain text in this
-file (see [Set secrets]({{< relref "/integrate/redis-data-integration/data-pipelines/deploy#set-secrets" >}}) for more information).
+form "`${name}`" refer to secrets that you should set as described in 
+[Set secrets]({{< relref "/integrate/redis-data-integration/data-pipelines/deploy#set-secrets" >}}). 
+In particular, you should normally use secrets as shown to set the source
+and target username and password rather than storing them in plain text in this file.
 
 ```yaml
 sources:
@@ -212,30 +211,43 @@ to identify the source (in the example we have a source
 called `mysql` but you can choose any name you like). The example
 configuration contains the following data:
 
-- `type`: The type of collector to use for the pipeline. Currently, the only type we support is `cdc`.
-- `connection`: The connection details for the source database: hostname, port, schema/ db name, database credentials and
-[TLS](https://en.wikipedia.org/wiki/Transport_Layer_Security)/
-[mTLS](https://en.wikipedia.org/wiki/Mutual_authentication#mTLS) secrets.
-- `tables`: The dataset you want to collect from the source. This subsection
-  specifies:
-  - `snapshot_sql`: A query that selects the tables to include in the dataset
-    (the default is to include all tables if you don't specify a query here).
+- `type`: The type of collector to use for the pipeline. 
+  Currently, the only types we support are `cdc` and `external`.
+  If the source type is set to `external`, no collector resources will be created by the operator, 
+  and all other source sections should be empty or not specified at all.
+- `connection`: The connection details for the source database: `type`, `host`, `port`, 
+  and credentials (`username` and `password`).
+  - `type` is the source database type, one of `mariadb`, `mysql`, `oracle`, `postgresql`, or `sqlserver`.
+  - If you use [TLS](https://en.wikipedia.org/wiki/Transport_Layer_Security)/
+    or [mTLS](https://en.wikipedia.org/wiki/Mutual_authentication#mTLS) to connect
+    to the source database, you may need to specify additional properties in the
+    `advanced` section with references to the corresponding certificates depending 
+    on the source database type. Note that these properties **must** be references to 
+    secrets that you should set as described in [Set secrets]({{< relref "/integrate/redis-data-integration/data-pipelines/deploy#set-secrets" >}}).
+- `databases`: List of all databases to collect data from for source database types
+  that support multiple databases, such as `mysql` and `mariadb`.
+- `schemas`: List of all schemas to collect data from for source database types
+  that support multiple schemas, such as `oracle`, `postgresql`, and `sqlserver`.
+- `tables`: List of all tables to collect data from. Each table is identified by its
+  full name, including a database or schema prefix. If there is a single 
+  database or schema, this prefix can be omitted. 
+  For each table, you can specify:
   - `columns`: A list of the columns you are interested in (the default is to
-    include all columns if you don't supply a list)
+    include all columns)
   - `keys`: A list of columns to create a composite key if your table
     doesn't already have a [`PRIMARY KEY`](https://www.w3schools.com/sql/sql_primarykey.asp) or
     [`UNIQUE`](https://www.w3schools.com/sql/sql_unique.asp) constraint.
+  - `snapshot_sql`: A query to be used when performing the initial snapshot.
+    By default, a query that contains all listed columns of all listed tables will be used.
 - `advanced`: These optional properties configure other Debezium-specific features.
   The available sub-sections are:
-  - `sink`: All advanced properties for writing to RDI (TLS, memory threshold, etc).
+  - `source`: Properties for reading from the source database.
+    See the Debezium [Source connectors](https://debezium.io/documentation/reference/stable/connectors/)
+    pages for more information about the properties available for each database type.
+  - `sink`: Properties for writing to Redis streams in the RDI database.
     See the Debezium [Redis stream properties](https://debezium.io/documentation/reference/stable/operations/debezium-server.html#_redis_stream)
     page for the full set of available properties.
-  - `source`: All advanced connector properties (for example, RAC nodes).
-    See [Database-specific connection properties](#db-connect-props) below and also
-    see the
-    Debezium [Connectors](https://debezium.io/documentation/reference/stable/connectors/)
-    pages for more information about the properties available for each database type.
-  - `quarkus`: All advanced properties for Debezium server, such as the log level. See the
+  - `quarkus`: Properties for the Debezium server, such as the log level. See the
     Quarkus [Configuration options](https://quarkus.io/guides/all-config)
     docs for the full set of available properties.
 
@@ -244,10 +256,16 @@ configuration contains the following data:
 Use this section to provide the connection details for the target Redis
 database(s). As with the sources, you should start each target section
 with a unique name that you are free to choose (here, we have used
-`my-redis` as an example). In the `connection` section, you can supply the
-`type` of target database, which will generally be `redis` along with the
-`host` and `port` of the server. You can also supply connection credentials
-and TLS/mTLS secrets here if you use them.
+`target` as an example). In the `connection` section, you can specify the
+`type` of the target database, which must be `redis`, along with 
+connection details such as `host`, `port`, and credentials (`username` and `password`).
+If you use [TLS](https://en.wikipedia.org/wiki/Transport_Layer_Security)/
+or [mTLS](https://en.wikipedia.org/wiki/Mutual_authentication#mTLS) to connect 
+to the target database, you must specify the CA certificate (for TLS), 
+and the client certificate and private key (for mTLS) in `cacert`, `cert`, and `key`.
+Note that these certificates **must** be references to secrets
+that you should set as described in [Set secrets]({{< relref "/integrate/redis-data-integration/data-pipelines/deploy#set-secrets" >}})
+(it is not possible to include these certificates as plain text in the file).
 
 {{< note >}}If you specify `localhost` as the address of either the source or target server during
 installation then the connection will fail if the actual IP address changes for the local
@@ -400,10 +418,9 @@ When your configuration is ready, you must deploy it to start using the pipeline
 [Deploy a pipeline]({{< relref "/integrate/redis-data-integration/data-pipelines/deploy" >}})
 to learn how to do this.
 
-## Ingest pipeline lifecycle
+## Pipeline lifecycle
 
-Once you have created the configuration for a pipeline, it goes through the
-following phases:
+A pipeline goes through the following phases:
 
 1. *Deploy* - when you deploy the pipeline, RDI first validates it before use.
 Then, the [operator]({{< relref "/integrate/redis-data-integration/architecture#how-rdi-is-deployed">}}) creates and configures the collector and stream processor that will run the pipeline.
@@ -415,8 +432,8 @@ hours to complete if you have a lot of data.
 the source data. Whenever a change is committed to the source, the collector captures
 it and adds it to the target through the pipeline. This phase continues indefinitely
 unless you change the pipeline configuration. 
-1. *Update* - If you update the pipeline configuration, the operator starts applying it
-to the processor and the collector. Note that the changes only affect newly-captured
+1. *Update* - If you update the pipeline configuration, the operator applies it
+to the collector and the stream processor. Note that the changes only affect newly-captured
 data unless you reset the pipeline completely. Once RDI has accepted the updates, the
 pipeline returns to the CDC phase with the new configuration.
 1. *Reset* - There are circumstances where you might want to rebuild the dataset
diff --git a/content/integrate/redis-data-integration/data-pipelines/deploy.md b/content/integrate/redis-data-integration/data-pipelines/deploy.md