|
| 1 | +Batch Processing |
| 2 | +================ |
| 3 | + |
| 4 | +To load data to PostgreSQL, pgloader uses the `COPY` streaming protocol. |
| 5 | +While this is the faster way to load data, `COPY` has an important drawback: |
| 6 | +as soon as PostgreSQL emits an error with any bit of data sent to it, |
| 7 | +whatever the problem is, the whole data set is rejected by PostgreSQL. |
| 8 | + |
| 9 | +To work around that, pgloader cuts the data into *batches* of 25000 rows |
| 10 | +each, so that when a problem occurs it's only impacting that many rows of |
| 11 | +data. Each batch is kept in memory while the `COPY` streaming happens, in |
| 12 | +order to be able to handle errors should some happen. |
| 13 | + |
| 14 | +When PostgreSQL rejects the whole batch, pgloader logs the error message |
| 15 | +then isolates the bad row(s) from the accepted ones by retrying the batched |
| 16 | +rows in smaller batches. To do that, pgloader parses the *CONTEXT* error |
| 17 | +message from the failed COPY, as the message contains the line number where |
| 18 | +the error was found in the batch, as in the following example:: |
| 19 | + |
| 20 | + CONTEXT: COPY errors, line 3, column b: "2006-13-11" |
| 21 | + |
| 22 | +Using that information, pgloader will reload all rows in the batch before |
| 23 | +the erroneous one, log the erroneous one as rejected, then try loading the |
| 24 | +remaining of the batch in a single attempt, which may or may not contain |
| 25 | +other erroneous data. |
| 26 | + |
| 27 | +At the end of a load containing rejected rows, you will find two files in |
| 28 | +the *root-dir* location, under a directory named the same as the target |
| 29 | +database of your setup. The filenames are the target table, and their |
| 30 | +extensions are `.dat` for the rejected data and `.log` for the file |
| 31 | +containing the full PostgreSQL client side logs about the rejected data. |
| 32 | + |
| 33 | +The `.dat` file is formatted in PostgreSQL the text COPY format as documented |
| 34 | +in `http://www.postgresql.org/docs/9.2/static/sql-copy.html#AEN66609`. |
| 35 | + |
| 36 | +It is possible to use the following WITH options to control pgloader batch |
| 37 | +behavior: |
| 38 | + |
| 39 | + - *on error stop*, *on error resume next* |
| 40 | + |
| 41 | + This option controls if pgloader is using building batches of data at |
| 42 | + all. The batch implementation allows pgloader to recover errors by |
| 43 | + sending the data that PostgreSQL accepts again, and by keeping away the |
| 44 | + data that PostgreSQL rejects. |
| 45 | + |
| 46 | + To enable retrying the data and loading the good parts, use the option |
| 47 | + *on error resume next*, which is the default to file based data loads |
| 48 | + (such as CSV, IXF or DBF). |
| 49 | + |
| 50 | + When migrating from another RDMBS technology, it's best to have a |
| 51 | + reproducible loading process. In that case it's possible to use *on |
| 52 | + error stop* and fix either the casting rules, the data transformation |
| 53 | + functions or in cases the input data until your migration runs through |
| 54 | + completion. That's why *on error resume next* is the default for SQLite, |
| 55 | + MySQL and MS SQL source kinds. |
| 56 | + |
| 57 | +A Note About Performance |
| 58 | +------------------------ |
| 59 | + |
| 60 | +pgloader has been developed with performance in mind, to be able to cope |
| 61 | +with ever growing needs in loading large amounts of data into PostgreSQL. |
| 62 | + |
| 63 | +The basic architecture it uses is the old Unix pipe model, where a thread is |
| 64 | +responsible for loading the data (reading a CSV file, querying MySQL, etc) |
| 65 | +and fills pre-processed data into a queue. Another threads feeds from the |
| 66 | +queue, apply some more *transformations* to the input data and stream the |
| 67 | +end result to PostgreSQL using the COPY protocol. |
| 68 | + |
| 69 | +When given a file that the PostgreSQL `COPY` command knows how to parse, and |
| 70 | +if the file contains no erroneous data, then pgloader will never be as fast |
| 71 | +as just using the PostgreSQL `COPY` command. |
| 72 | + |
| 73 | +Note that while the `COPY` command is restricted to read either from its |
| 74 | +standard input or from a local file on the server's file system, the command |
| 75 | +line tool `psql` implements a `\copy` command that knows how to stream a |
| 76 | +file local to the client over the network and into the PostgreSQL server, |
| 77 | +using the same protocol as pgloader uses. |
| 78 | + |
| 79 | +A Note About Parallelism |
| 80 | +------------------------ |
| 81 | + |
| 82 | +pgloader uses several concurrent tasks to process the data being loaded: |
| 83 | + |
| 84 | + - a reader task reads the data in and pushes it to a queue, |
| 85 | + |
| 86 | + - at last one write task feeds from the queue and formats the raw into the |
| 87 | + PostgreSQL COPY format in batches (so that it's possible to then retry a |
| 88 | + failed batch without reading the data from source again), and then sends |
| 89 | + the data to PostgreSQL using the COPY protocol. |
| 90 | + |
| 91 | +The parameter *workers* allows to control how many worker threads are |
| 92 | +allowed to be active at any time (that's the parallelism level); and the |
| 93 | +parameter *concurrency* allows to control how many tasks are started to |
| 94 | +handle the data (they may not all run at the same time, depending on the |
| 95 | +*workers* setting). |
| 96 | + |
| 97 | +We allow *workers* simultaneous workers to be active at the same time in the |
| 98 | +context of a single table. A single unit of work consist of several kinds of |
| 99 | +workers: |
| 100 | + |
| 101 | + - a reader getting raw data from the source, |
| 102 | + - N writers preparing and sending the data down to PostgreSQL. |
| 103 | + |
| 104 | +The N here is setup to the *concurrency* parameter: with a *CONCURRENCY* of |
| 105 | +2, we start (+ 1 2) = 3 concurrent tasks, with a *concurrency* of 4 we start |
| 106 | +(+ 1 4) = 5 concurrent tasks, of which only *workers* may be active |
| 107 | +simultaneously. |
| 108 | + |
| 109 | +The defaults are `workers = 4, concurrency = 1` when loading from a database |
| 110 | +source, and `workers = 8, concurrency = 2` when loading from something else |
| 111 | +(currently, a file). Those defaults are arbitrary and waiting for feedback |
| 112 | +from users, so please consider providing feedback if you play with the |
| 113 | +settings. |
| 114 | + |
| 115 | +As the `CREATE INDEX` threads started by pgloader are only waiting until |
| 116 | +PostgreSQL is done with the real work, those threads are *NOT* counted into |
| 117 | +the concurrency levels as detailed here. |
| 118 | + |
| 119 | +By default, as many `CREATE INDEX` threads as the maximum number of indexes |
| 120 | +per table are found in your source schema. It is possible to set the `max |
| 121 | +parallel create index` *WITH* option to another number in case there's just |
| 122 | +too many of them to create. |
| 123 | + |
0 commit comments