Skip to content

Commit 9259960

Browse files
authored
Improve pgloader docs (Table of Contents, titles, organisation). (#1440)
Make it easier to nagivate our docs, which are dense enough to warrant proper organisation and guided Table of Contents.
1 parent 6d73667 commit 9259960

17 files changed

+567
-562
lines changed

docs/batches.rst

Lines changed: 123 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,123 @@
1+
Batch Processing
2+
================
3+
4+
To load data to PostgreSQL, pgloader uses the `COPY` streaming protocol.
5+
While this is the faster way to load data, `COPY` has an important drawback:
6+
as soon as PostgreSQL emits an error with any bit of data sent to it,
7+
whatever the problem is, the whole data set is rejected by PostgreSQL.
8+
9+
To work around that, pgloader cuts the data into *batches* of 25000 rows
10+
each, so that when a problem occurs it's only impacting that many rows of
11+
data. Each batch is kept in memory while the `COPY` streaming happens, in
12+
order to be able to handle errors should some happen.
13+
14+
When PostgreSQL rejects the whole batch, pgloader logs the error message
15+
then isolates the bad row(s) from the accepted ones by retrying the batched
16+
rows in smaller batches. To do that, pgloader parses the *CONTEXT* error
17+
message from the failed COPY, as the message contains the line number where
18+
the error was found in the batch, as in the following example::
19+
20+
CONTEXT: COPY errors, line 3, column b: "2006-13-11"
21+
22+
Using that information, pgloader will reload all rows in the batch before
23+
the erroneous one, log the erroneous one as rejected, then try loading the
24+
remaining of the batch in a single attempt, which may or may not contain
25+
other erroneous data.
26+
27+
At the end of a load containing rejected rows, you will find two files in
28+
the *root-dir* location, under a directory named the same as the target
29+
database of your setup. The filenames are the target table, and their
30+
extensions are `.dat` for the rejected data and `.log` for the file
31+
containing the full PostgreSQL client side logs about the rejected data.
32+
33+
The `.dat` file is formatted in PostgreSQL the text COPY format as documented
34+
in `http://www.postgresql.org/docs/9.2/static/sql-copy.html#AEN66609`.
35+
36+
It is possible to use the following WITH options to control pgloader batch
37+
behavior:
38+
39+
- *on error stop*, *on error resume next*
40+
41+
This option controls if pgloader is using building batches of data at
42+
all. The batch implementation allows pgloader to recover errors by
43+
sending the data that PostgreSQL accepts again, and by keeping away the
44+
data that PostgreSQL rejects.
45+
46+
To enable retrying the data and loading the good parts, use the option
47+
*on error resume next*, which is the default to file based data loads
48+
(such as CSV, IXF or DBF).
49+
50+
When migrating from another RDMBS technology, it's best to have a
51+
reproducible loading process. In that case it's possible to use *on
52+
error stop* and fix either the casting rules, the data transformation
53+
functions or in cases the input data until your migration runs through
54+
completion. That's why *on error resume next* is the default for SQLite,
55+
MySQL and MS SQL source kinds.
56+
57+
A Note About Performance
58+
------------------------
59+
60+
pgloader has been developed with performance in mind, to be able to cope
61+
with ever growing needs in loading large amounts of data into PostgreSQL.
62+
63+
The basic architecture it uses is the old Unix pipe model, where a thread is
64+
responsible for loading the data (reading a CSV file, querying MySQL, etc)
65+
and fills pre-processed data into a queue. Another threads feeds from the
66+
queue, apply some more *transformations* to the input data and stream the
67+
end result to PostgreSQL using the COPY protocol.
68+
69+
When given a file that the PostgreSQL `COPY` command knows how to parse, and
70+
if the file contains no erroneous data, then pgloader will never be as fast
71+
as just using the PostgreSQL `COPY` command.
72+
73+
Note that while the `COPY` command is restricted to read either from its
74+
standard input or from a local file on the server's file system, the command
75+
line tool `psql` implements a `\copy` command that knows how to stream a
76+
file local to the client over the network and into the PostgreSQL server,
77+
using the same protocol as pgloader uses.
78+
79+
A Note About Parallelism
80+
------------------------
81+
82+
pgloader uses several concurrent tasks to process the data being loaded:
83+
84+
- a reader task reads the data in and pushes it to a queue,
85+
86+
- at last one write task feeds from the queue and formats the raw into the
87+
PostgreSQL COPY format in batches (so that it's possible to then retry a
88+
failed batch without reading the data from source again), and then sends
89+
the data to PostgreSQL using the COPY protocol.
90+
91+
The parameter *workers* allows to control how many worker threads are
92+
allowed to be active at any time (that's the parallelism level); and the
93+
parameter *concurrency* allows to control how many tasks are started to
94+
handle the data (they may not all run at the same time, depending on the
95+
*workers* setting).
96+
97+
We allow *workers* simultaneous workers to be active at the same time in the
98+
context of a single table. A single unit of work consist of several kinds of
99+
workers:
100+
101+
- a reader getting raw data from the source,
102+
- N writers preparing and sending the data down to PostgreSQL.
103+
104+
The N here is setup to the *concurrency* parameter: with a *CONCURRENCY* of
105+
2, we start (+ 1 2) = 3 concurrent tasks, with a *concurrency* of 4 we start
106+
(+ 1 4) = 5 concurrent tasks, of which only *workers* may be active
107+
simultaneously.
108+
109+
The defaults are `workers = 4, concurrency = 1` when loading from a database
110+
source, and `workers = 8, concurrency = 2` when loading from something else
111+
(currently, a file). Those defaults are arbitrary and waiting for feedback
112+
from users, so please consider providing feedback if you play with the
113+
settings.
114+
115+
As the `CREATE INDEX` threads started by pgloader are only waiting until
116+
PostgreSQL is done with the real work, those threads are *NOT* counted into
117+
the concurrency levels as detailed here.
118+
119+
By default, as many `CREATE INDEX` threads as the maximum number of indexes
120+
per table are found in your source schema. It is possible to set the `max
121+
parallel create index` *WITH* option to another number in case there's just
122+
too many of them to create.
123+

0 commit comments

Comments
 (0)