Skip to content

Commit eab1cbf

Browse files
committed
More docs improvements.
Explain the feature list of pgloader better for improving discoverability of what can be achieved with our nice little tool.
1 parent ec071af commit eab1cbf

File tree

4 files changed

+231
-34
lines changed

4 files changed

+231
-34
lines changed

docs/index.rst

Lines changed: 215 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,11 +18,226 @@ mode of operations, pgloader handles both the schema and data parts of the
1818
migration, in a single unmanned command, allowing to implement **Continuous
1919
Migration**.
2020

21+
Features Overview
22+
=================
23+
24+
pgloader has two modes of operation: loading from files, migrating
25+
databases. In both cases, pgloader uses the PostgreSQL COPY protocol which
26+
implements a **streaming** to send data in a very efficient way.
27+
28+
Loading file content in PostgreSQL
29+
----------------------------------
30+
31+
When loading from files, pgloader implements the following features:
32+
33+
Many source formats supported
34+
Support for a wide variety of file based formats are included in
35+
pgloader: the CSV family, fixed columns formats, dBase files (``db3``),
36+
and IBM IXF files.
37+
38+
The SQLite database engine is accounted for in the next section:
39+
pgloader considers SQLite as a database source and implements schema
40+
discovery from SQLite catalogs.
41+
42+
On the fly data transformation
43+
Often enough the data as read from a CSV file (or another format) needs
44+
some tweaking and clean-up before being sent to PostgreSQL.
45+
46+
For instance in the `geolite
47+
<https://github.com/dimitri/pgloader/blob/master/test/archive.load>`_
48+
example we can see that integer values are being rewritten as IP address
49+
ranges, allowing to target an ``ip4r`` column directly.
50+
51+
Full Field projections
52+
pgloader supports loading data into less fields than found on file, or
53+
more, doing some computation on the data read before sending it to
54+
PostgreSQL.
55+
56+
Reading files from an archive
57+
Archive formats *zip*, *tar*, and *gzip* are supported by pgloader: the
58+
archive is extracted in a temporary directly and expanded files are then
59+
loaded.
60+
61+
HTTP(S) support
62+
pgloader knows how to download a source file or a source archive using
63+
HTTP directly. It might be better to use ``curl -O- http://... |
64+
pgloader` and read the data from *standard input*, then allowing for
65+
streaming of the data from its source down to PostgreSQL.
66+
67+
Target schema discovery
68+
When loading in an existing table, pgloader takes into account the
69+
existing columns and may automatically guess the CSV format for you.
70+
71+
On error stop / On error resume next
72+
In some cases the source data is so damaged as to be impossible to
73+
migrate in full, and when loading from a file then the default for
74+
pgloader is to use ``on error resume next`` option, where the rows
75+
rejected by PostgreSQL are saved away and the migration continues with
76+
the other rows.
77+
78+
In other cases loading only a part of the input data might not be a
79+
great idea, and in such cases it's possible to use the ``on error stop``
80+
option.
81+
82+
Pre/Post SQL commands
83+
This feature allows pgloader commands to include SQL commands to run
84+
before and after loading a file. It might be about creating a table
85+
first, then loading the data into it, and then doing more processing
86+
on-top of the data (implementing an ``ELT`` pipeline then), or creating
87+
specific indexes as soon as the data has been made ready.
88+
89+
One-command migration to PostgreSQL
90+
-----------------------------------
91+
92+
When migrating a full database in a single command, pgloader implements the
93+
following features:
94+
95+
One-command migration
96+
The whole migration is started with a single command line and then runs
97+
unattended. pgloader is meant to be integrated in a fully automated
98+
tooling that you can repeat as many times as needed.
99+
100+
Schema discovery
101+
The source database is introspected using its SQL catalogs to get the
102+
list of tables, attributes (with data types, default values, not null
103+
constraints, etc), primary key constraints, foreign key constraints,
104+
indexes, comments, etc. This feeds an internal database catalog of all
105+
the objects to migrate from the source database to the target database.
106+
107+
User defined casting rules
108+
Some source database have ideas about their data types that might not be
109+
compatible with PostgreSQL implementaion of equivalent data types.
110+
111+
For instance, SQLite since version 3 has a `Dynamic Type System
112+
<https://www.sqlite.org/datatype3.html>`_ which of course isn't
113+
compatible with the idea of a `Relation
114+
<https://en.wikipedia.org/wiki/Relation_(database)>`_. Or MySQL accepts
115+
datetime for year zero, which doesn't exists in our calendar, and
116+
doesn't have a boolean data type.
117+
118+
When migrating from another source database technology to PostgreSQL,
119+
data type casting choices must be made. pgloader implements solid
120+
defaults that you can rely upon, and a facility for **user defined data
121+
type casting rules** for specific cases. The idea is to allow users to
122+
specify the how the migration should be done, in order for it to be
123+
repeatable and included in a *Continuous Migration* process.
124+
125+
On the fly data transformations
126+
The user defined casting rules come with on the fly rewrite of the data.
127+
For instance zero dates (it's not just the year, MySQL accepts
128+
``0000-00-00`` as a valid datetime) are rewritten to NULL values by
129+
default.
130+
131+
Partial Migrations
132+
It is possible to include only a partial list of the source database
133+
tables in the migration, or to exclude some of the tables on the source
134+
database.
135+
136+
Schema only, Data only
137+
This is the **ORM compatibility** feature of pgloader, where it is
138+
possible to create the schema using your ORM and then have pgloader
139+
migrate the data targeting this already created schema.
140+
141+
When doing this, it is possible for pgloader to *reindex* the target
142+
schema: before loading the data from the source database into PostgreSQL
143+
using COPY, pgloader DROPs the indexes and constraints, and reinstalls
144+
the exact same definitions of them once the data has been loaded.
145+
146+
The reason for operating that way is of course data load performance.
147+
148+
Repeatable (DROP+CREATE)
149+
By default, pgloader issues DROP statements in the target PostgreSQL
150+
database before issing any CREATE statement, so that you can repeat the
151+
migration as many times as necessary until migration specifications and
152+
rules are bug free.
153+
154+
On error stop / On error resume next
155+
The default behavior of pgloader when migrating from a database is ``on
156+
error stop``. The idea is to let the user fix either the migration
157+
specifications or the source data, and run the process again, until it
158+
works.
159+
160+
In some cases the source data is so damaged as to be impossible to
161+
migrate in full, and it might be necessary to then resort to the ``on
162+
error resume next`` option, where the rows rejected by PostgreSQL are
163+
saved away and the migration continues with the other rows.
164+
165+
Pre/Post SQL commands, Post-Schema SQL commands
166+
While pgloader takes care of rewriting the schema to PostgreSQL
167+
expectations, and even provides *user-defined data type casting rules*
168+
support to that end, sometimes it is necessary to add some specific SQL
169+
commands around the migration. It's of course supported right from
170+
pgloader itself, without having to script around it.
171+
172+
Online ALTER schema
173+
At times migrating to PostgreSQL is also a good opportunity to review
174+
and fix bad decisions that were made in the past, or simply that are not
175+
relevant to PostgreSQL.
176+
177+
The pgloader command syntax allows to ALTER pgloader's internal
178+
representation of the target catalogs so that the target schema can be
179+
created a little different from the source one. Changes supported
180+
include target a different *schema* or *table* name.
181+
182+
Materialized Views, or schema rewrite on-the-fly
183+
In some cases the schema rewriting goes deeper than just renaming the
184+
SQL objects to being a full normalization exercise. Because PostgreSQL
185+
is great at running a normalized schema in production under most
186+
workloads.
187+
188+
pgloader implements full flexibility in on-the-fly schema rewriting, by
189+
making it possible to migrate from a view definition. The view attribute
190+
list becomes a table definition in PostgreSQL, and the data is fetched
191+
by querying the view on the source system.
192+
193+
A SQL view allows to implement both content filtering at the column
194+
level using the SELECT projection clause, and at the row level using the
195+
WHERE restriction clause. And backfilling from reference tables thanks
196+
to JOINs.
197+
198+
Distribute to Citus
199+
When migrating from PostgreSQL to Citus, a important part of the process
200+
consists of adjusting the schema to the distribution key. Read
201+
`Preparing Tables and Ingesting Data
202+
<https://docs.citusdata.com/en/v8.0/use_cases/multi_tenant.html>`_ in
203+
the Citus documentation for a complete example showing how to do that.
204+
205+
When using pgloader it's possible to specify the distribution keys and
206+
reference tables and let pgloader take care of adjusting the table,
207+
indexes, primary keys and foreign key definitions all by itself.
208+
209+
Encoding Overrides
210+
MySQL doesn't actually enforce the encoding of the data in the database
211+
to match the encoding known in the metadata, defined at the database,
212+
table, or attribute level. Sometimes, it's necessary to override the
213+
metadata in order to make sense of the text, and pgloader makes it easy
214+
to do so.
215+
216+
217+
Continuous Migration
218+
--------------------
219+
220+
pgloader is meant to migrate a whole database in a single command line and
221+
without any manual intervention. The goal is to be able to setup a
222+
*Continuous Integration* environment as described in the `Project
223+
Methodology <http://mysqltopgsql.com/project/>`_ document of the `MySQL to
224+
PostgreSQL <http://mysqltopgsql.com/project/>`_ webpage.
225+
226+
1. Setup your target PostgreSQL Architecture
227+
2. Fork a Continuous Integration environment that uses PostgreSQL
228+
3. Migrate the data over and over again every night, from production
229+
4. As soon as the CI is all green using PostgreSQL, schedule the D-Day
230+
5. Migrate without suprise and enjoy!
231+
232+
In order to be able to follow this great methodology, you need tooling to
233+
implement the third step in a fully automated way. That's pgloader.
234+
21235
.. toctree::
22236
:maxdepth: 2
23237
:caption: Table Of Contents:
24238

25239
intro
240+
quickstart
26241
tutorial/tutorial
27242
pgloader
28243
ref/csv

docs/intro.rst

Lines changed: 6 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -35,30 +35,14 @@ expected input properties must be given to pgloader. In the case of a
3535
database, pgloader connects to the live service and knows how to fetch the
3636
metadata it needs directly from it.
3737

38-
Continuous Migration
39-
--------------------
40-
41-
pgloader is meant to migrate a whole database in a single command line and
42-
without any manual intervention. The goal is to be able to setup a
43-
*Continuous Integration* environment as described in the `Project
44-
Methodology <http://mysqltopgsql.com/project/>`_ document of the `MySQL to
45-
PostgreSQL <http://mysqltopgsql.com/project/>`_ webpage.
46-
47-
1. Setup your target PostgreSQL Architecture
48-
2. Fork a Continuous Integration environment that uses PostgreSQL
49-
3. Migrate the data over and over again every night, from production
50-
4. As soon as the CI is all green using PostgreSQL, schedule the D-Day
51-
5. Migrate without suprise and enjoy!
52-
53-
In order to be able to follow this great methodology, you need tooling to
54-
implement the third step in a fully automated way. That's pgloader.
55-
5638
Features Matrix
5739
---------------
5840

5941
Here's a comparison of the features supported depending on the source
60-
database engine. Most features that are not supported can be added to
61-
pgloader, it's just that nobody had the need to do so yet.
42+
database engine. Some features that are not supported can be added to
43+
pgloader, it's just that nobody had the need to do so yet. Those features
44+
are marked with ✗. Empty cells are used when the feature doesn't make sense
45+
for the selected source database.
6246

6347
========================== ======= ====== ====== =========== =========
6448
Feature SQLite MySQL MS SQL PostgreSQL Redshift
@@ -71,14 +55,13 @@ Schema only ✓ ✓ ✓ ✓
7155
Data only ✓ ✓ ✓ ✓ ✓
7256
Repeatable (DROP+CREATE) ✓ ✓ ✓ ✓ ✓
7357
User defined casting rules ✓ ✓ ✓ ✓ ✓
74-
Encoding Overrides ✓ ✗ ✗ ✗
58+
Encoding Overrides
7559
On error stop ✓ ✓ ✓ ✓ ✓
7660
On error resume next ✓ ✓ ✓ ✓ ✓
7761
Pre/Post SQL commands ✓ ✓ ✓ ✓ ✓
7862
Post-Schema SQL commands ✗ ✓ ✓ ✓ ✓
7963
Primary key support ✓ ✓ ✓ ✓ ✓
80-
Foreign key support ✓ ✓ ✓ ✓ ✗
81-
Incremental data loading ✓ ✓ ✓ ✓ ✓
64+
Foreign key support ✓ ✓ ✓ ✓
8265
Online ALTER schema ✓ ✓ ✓ ✓ ✓
8366
Materialized views ✗ ✓ ✓ ✓ ✓
8467
Distribute to Citus ✗ ✓ ✓ ✓ ✓

docs/tutorial/quickstart.rst renamed to docs/quickstart.rst

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,10 @@
1-
PgLoader Quick Start
2-
--------------------
1+
Pgloader Quick Start
2+
====================
33

44
In simple cases, pgloader is very easy to use.
55

66
CSV
7-
^^^
7+
---
88

99
Load data from a CSV file into a pre-existing table in your database::
1010

@@ -26,7 +26,7 @@ For documentation about the available syntaxes for the `--field` and
2626
Note also that the PostgreSQL URI includes the target *tablename*.
2727

2828
Reading from STDIN
29-
^^^^^^^^^^^^^^^^^^
29+
------------------
3030

3131
File based pgloader sources can be loaded from the standard input, as in the
3232
following example::
@@ -46,7 +46,7 @@ pgloader with this technique, using the Unix pipe::
4646
gunzip -c source.gz | pgloader --type csv ... - pgsql:///target?foo
4747

4848
Loading from CSV available through HTTP
49-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
49+
---------------------------------------
5050

5151
The same command as just above can also be run if the CSV file happens to be
5252
found on a remote HTTP location::
@@ -84,7 +84,7 @@ Also notice that the same command will work against an archived version of
8484
the same data.
8585

8686
Streaming CSV data from an HTTP compressed file
87-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
87+
-----------------------------------------------
8888

8989
Finally, it's important to note that pgloader first fetches the content from
9090
the HTTP URL it to a local file, then expand the archive when it's
@@ -110,7 +110,7 @@ and the commands and pgloader will take care of streaming the data down to
110110
PostgreSQL.
111111

112112
Migrating from SQLite
113-
^^^^^^^^^^^^^^^^^^^^^
113+
---------------------
114114

115115
The following command will open the SQLite database, discover its tables
116116
definitions including indexes and foreign keys, migrate those definitions
@@ -121,7 +121,7 @@ and then migrate the data over::
121121
pgloader ./test/sqlite/sqlite.db postgresql:///newdb
122122

123123
Migrating from MySQL
124-
^^^^^^^^^^^^^^^^^^^^
124+
--------------------
125125

126126
Just create a database where to host the MySQL data and definitions and have
127127
pgloader do the migration for you in a single command line::
@@ -130,7 +130,7 @@ pgloader do the migration for you in a single command line::
130130
pgloader mysql://user@localhost/sakila postgresql:///pagila
131131

132132
Fetching an archived DBF file from a HTTP remote location
133-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
133+
---------------------------------------------------------
134134

135135
It's possible for pgloader to download a file from HTTP, unarchive it, and
136136
only then open it to discover the schema then load the data::

docs/tutorial/tutorial.rst

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,6 @@
1-
PgLoader Tutorial
1+
Pgloader Tutorial
22
=================
33

4-
.. include:: quickstart.rst
54
.. include:: csv.rst
65
.. include:: fixed.rst
76
.. include:: geolite.rst

0 commit comments

Comments
 (0)