Skip to content

Commit b6de8f1

Browse files
committed
Improve Citus documentation.
1 parent 56d24de commit b6de8f1

File tree

1 file changed

+119
-0
lines changed

1 file changed

+119
-0
lines changed

docs/ref/pgsql-citus-target.rst

Lines changed: 119 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -70,6 +70,125 @@ backfilled from referenced data. pgloader knows how to do that by generating
7070
a query like the following and importing the result set of such a query
7171
rather than the raw data from the source table.
7272

73+
Citus Migration Example
74+
^^^^^^^^^^^^^^^^^^^^^^^
75+
76+
With the migration command as above, pgloader adds the column ``company_id``
77+
to the tables that have a direct or indirect foreign key reference to the
78+
``companies`` table.
79+
80+
We run pgloader using the following command, where the file
81+
`./test/citus/company.load
82+
<https://github.com/dimitri/pgloader/blob/master/test/citus/company.load>`_
83+
contains the pgloader command as shown above.
84+
85+
::
86+
87+
$ pgloader --client-min-messages sql ./test/citus/company.load
88+
89+
The following SQL statements are all extracted from the log messages that
90+
the pgloader command outputs. We are going to have a look at the
91+
`impressions` table. It gets created with a new column `company_id` in the
92+
first position, as follows:
93+
94+
::
95+
96+
CREATE TABLE "public"."impressions"
97+
(
98+
company_id bigint,
99+
"id" bigserial,
100+
"ad_id" bigint default NULL,
101+
"seen_at" timestamp with time zone default NULL,
102+
"site_url" text default NULL,
103+
"cost_per_impression_usd" numeric(20,10) default NULL,
104+
"user_ip" inet default NULL,
105+
"user_data" jsonb default NULL
106+
);
107+
108+
The original schema for this table does not have the `company_id` column,
109+
which means pgloader now needs to change the primary key definition, the
110+
foreign keys constraints definitions from and to this table, and also to
111+
*backfill* the `company_id` data to this table when doing the COPY phase of
112+
the migration.
113+
114+
Then once the tables have been created, pgloader executes the following SQL
115+
statements::
116+
117+
SELECT create_distributed_table('"public"."companies"', 'id');
118+
SELECT create_distributed_table('"public"."campaigns"', 'company_id');
119+
SELECT create_distributed_table('"public"."ads"', 'company_id');
120+
SELECT create_distributed_table('"public"."clicks"', 'company_id');
121+
SELECT create_distributed_table('"public"."impressions"', 'company_id');
122+
123+
Then when copying the data from the source PostgreSQL database to the new
124+
Citus tables, the new column (here ``company_id``) needs to be backfilled
125+
from the source tables. Here's the SQL query that pgloader uses as a data
126+
source for the ``ads`` table in our example:
127+
128+
::
129+
130+
SELECT "campaigns".company_id::text, "ads".id::text, "ads".campaign_id::text,
131+
"ads".name::text, "ads".image_url::text, "ads".target_url::text,
132+
"ads".impressions_count::text, "ads".clicks_count::text,
133+
"ads".created_at::text, "ads".updated_at::text
134+
135+
FROM "public"."ads"
136+
JOIN "public"."campaigns"
137+
ON ads.campaign_id = campaigns.id
138+
139+
The ``impressions`` table has an indirect foreign key reference to the
140+
``company`` table, which is the table where the distribution key is
141+
specified. pgloader will discover that itself from walking the PostgreSQL
142+
catalogs, and you may also use the following specification in the pgloader
143+
command to explicitely add the indirect dependency:
144+
145+
::
146+
147+
distribute impressions using company_id from ads, campaigns
148+
149+
Given this schema, the SQL query used by pgloader to fetch the data for the
150+
`impressions` table is the following, implementing online backfilling of the
151+
data:
152+
153+
::
154+
155+
SELECT "campaigns".company_id::text, "impressions".id::text,
156+
"impressions".ad_id::text, "impressions".seen_at::text,
157+
"impressions".site_url::text,
158+
"impressions".cost_per_impression_usd::text,
159+
"impressions".user_ip::text,
160+
"impressions".user_data::text
161+
162+
FROM "public"."impressions"
163+
164+
JOIN "public"."ads"
165+
ON impressions.ad_id = ads.id
166+
167+
JOIN "public"."campaigns"
168+
ON ads.campaign_id = campaigns.id
169+
170+
When the data copying is done, then pgloader also has to install the indexes
171+
supporting the primary keys, and add the foreign key definitions to the
172+
schema. Those definitions are not the same as in the source schema, because
173+
of the adding of the distribution column to the table: we need to also add
174+
the column to the primary key and the foreign key constraints.
175+
176+
Here's the commands issued by pgloader for the ``impressions`` table:
177+
178+
::
179+
180+
CREATE UNIQUE INDEX "impressions_pkey"
181+
ON "public"."impressions" (company_id, id);
182+
183+
ALTER TABLE "public"."impressions"
184+
ADD CONSTRAINT "impressions_ad_id_fkey"
185+
FOREIGN KEY(company_id,ad_id)
186+
REFERENCES "public"."ads"(company_id,id)
187+
188+
Given a single line of specification ``distribute companies using id`` then
189+
pgloader implements all the necessary schema changes on the fly when
190+
migrating to Citus, and also dynamically backfills the data.
191+
73192
Citus Migration: Limitations
74193
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
75194

0 commit comments

Comments
 (0)