@@ -18,11 +18,226 @@ mode of operations, pgloader handles both the schema and data parts of the
18
18
migration, in a single unmanned command, allowing to implement **Continuous
19
19
Migration **.
20
20
21
+ Features Overview
22
+ =================
23
+
24
+ pgloader has two modes of operation: loading from files, migrating
25
+ databases. In both cases, pgloader uses the PostgreSQL COPY protocol which
26
+ implements a **streaming ** to send data in a very efficient way.
27
+
28
+ Loading file content in PostgreSQL
29
+ ----------------------------------
30
+
31
+ When loading from files, pgloader implements the following features:
32
+
33
+ Many source formats supported
34
+ Support for a wide variety of file based formats are included in
35
+ pgloader: the CSV family, fixed columns formats, dBase files (``db3 ``),
36
+ and IBM IXF files.
37
+
38
+ The SQLite database engine is accounted for in the next section:
39
+ pgloader considers SQLite as a database source and implements schema
40
+ discovery from SQLite catalogs.
41
+
42
+ On the fly data transformation
43
+ Often enough the data as read from a CSV file (or another format) needs
44
+ some tweaking and clean-up before being sent to PostgreSQL.
45
+
46
+ For instance in the `geolite
47
+ <https://github.com/dimitri/pgloader/blob/master/test/archive.load> `_
48
+ example we can see that integer values are being rewritten as IP address
49
+ ranges, allowing to target an ``ip4r `` column directly.
50
+
51
+ Full Field projections
52
+ pgloader supports loading data into less fields than found on file, or
53
+ more, doing some computation on the data read before sending it to
54
+ PostgreSQL.
55
+
56
+ Reading files from an archive
57
+ Archive formats *zip *, *tar *, and *gzip * are supported by pgloader: the
58
+ archive is extracted in a temporary directly and expanded files are then
59
+ loaded.
60
+
61
+ HTTP(S) support
62
+ pgloader knows how to download a source file or a source archive using
63
+ HTTP directly. It might be better to use ``curl -O- http://... |
64
+ pgloader` and read the data from *standard input*, then allowing for
65
+ streaming of the data from its source down to PostgreSQL.
66
+
67
+ Target schema discovery
68
+ When loading in an existing table, pgloader takes into account the
69
+ existing columns and may automatically guess the CSV format for you.
70
+
71
+ On error stop / On error resume next
72
+ In some cases the source data is so damaged as to be impossible to
73
+ migrate in full, and when loading from a file then the default for
74
+ pgloader is to use ``on error resume next `` option, where the rows
75
+ rejected by PostgreSQL are saved away and the migration continues with
76
+ the other rows.
77
+
78
+ In other cases loading only a part of the input data might not be a
79
+ great idea, and in such cases it's possible to use the ``on error stop ``
80
+ option.
81
+
82
+ Pre/Post SQL commands
83
+ This feature allows pgloader commands to include SQL commands to run
84
+ before and after loading a file. It might be about creating a table
85
+ first, then loading the data into it, and then doing more processing
86
+ on-top of the data (implementing an ``ELT `` pipeline then), or creating
87
+ specific indexes as soon as the data has been made ready.
88
+
89
+ One-command migration to PostgreSQL
90
+ -----------------------------------
91
+
92
+ When migrating a full database in a single command, pgloader implements the
93
+ following features:
94
+
95
+ One-command migration
96
+ The whole migration is started with a single command line and then runs
97
+ unattended. pgloader is meant to be integrated in a fully automated
98
+ tooling that you can repeat as many times as needed.
99
+
100
+ Schema discovery
101
+ The source database is introspected using its SQL catalogs to get the
102
+ list of tables, attributes (with data types, default values, not null
103
+ constraints, etc), primary key constraints, foreign key constraints,
104
+ indexes, comments, etc. This feeds an internal database catalog of all
105
+ the objects to migrate from the source database to the target database.
106
+
107
+ User defined casting rules
108
+ Some source database have ideas about their data types that might not be
109
+ compatible with PostgreSQL implementaion of equivalent data types.
110
+
111
+ For instance, SQLite since version 3 has a `Dynamic Type System
112
+ <https://www.sqlite.org/datatype3.html> `_ which of course isn't
113
+ compatible with the idea of a `Relation
114
+ <https://en.wikipedia.org/wiki/Relation_(database)> `_. Or MySQL accepts
115
+ datetime for year zero, which doesn't exists in our calendar, and
116
+ doesn't have a boolean data type.
117
+
118
+ When migrating from another source database technology to PostgreSQL,
119
+ data type casting choices must be made. pgloader implements solid
120
+ defaults that you can rely upon, and a facility for **user defined data
121
+ type casting rules ** for specific cases. The idea is to allow users to
122
+ specify the how the migration should be done, in order for it to be
123
+ repeatable and included in a *Continuous Migration * process.
124
+
125
+ On the fly data transformations
126
+ The user defined casting rules come with on the fly rewrite of the data.
127
+ For instance zero dates (it's not just the year, MySQL accepts
128
+ ``0000-00-00 `` as a valid datetime) are rewritten to NULL values by
129
+ default.
130
+
131
+ Partial Migrations
132
+ It is possible to include only a partial list of the source database
133
+ tables in the migration, or to exclude some of the tables on the source
134
+ database.
135
+
136
+ Schema only, Data only
137
+ This is the **ORM compatibility ** feature of pgloader, where it is
138
+ possible to create the schema using your ORM and then have pgloader
139
+ migrate the data targeting this already created schema.
140
+
141
+ When doing this, it is possible for pgloader to *reindex * the target
142
+ schema: before loading the data from the source database into PostgreSQL
143
+ using COPY, pgloader DROPs the indexes and constraints, and reinstalls
144
+ the exact same definitions of them once the data has been loaded.
145
+
146
+ The reason for operating that way is of course data load performance.
147
+
148
+ Repeatable (DROP+CREATE)
149
+ By default, pgloader issues DROP statements in the target PostgreSQL
150
+ database before issing any CREATE statement, so that you can repeat the
151
+ migration as many times as necessary until migration specifications and
152
+ rules are bug free.
153
+
154
+ On error stop / On error resume next
155
+ The default behavior of pgloader when migrating from a database is ``on
156
+ error stop ``. The idea is to let the user fix either the migration
157
+ specifications or the source data, and run the process again, until it
158
+ works.
159
+
160
+ In some cases the source data is so damaged as to be impossible to
161
+ migrate in full, and it might be necessary to then resort to the ``on
162
+ error resume next `` option, where the rows rejected by PostgreSQL are
163
+ saved away and the migration continues with the other rows.
164
+
165
+ Pre/Post SQL commands, Post-Schema SQL commands
166
+ While pgloader takes care of rewriting the schema to PostgreSQL
167
+ expectations, and even provides *user-defined data type casting rules *
168
+ support to that end, sometimes it is necessary to add some specific SQL
169
+ commands around the migration. It's of course supported right from
170
+ pgloader itself, without having to script around it.
171
+
172
+ Online ALTER schema
173
+ At times migrating to PostgreSQL is also a good opportunity to review
174
+ and fix bad decisions that were made in the past, or simply that are not
175
+ relevant to PostgreSQL.
176
+
177
+ The pgloader command syntax allows to ALTER pgloader's internal
178
+ representation of the target catalogs so that the target schema can be
179
+ created a little different from the source one. Changes supported
180
+ include target a different *schema * or *table * name.
181
+
182
+ Materialized Views, or schema rewrite on-the-fly
183
+ In some cases the schema rewriting goes deeper than just renaming the
184
+ SQL objects to being a full normalization exercise. Because PostgreSQL
185
+ is great at running a normalized schema in production under most
186
+ workloads.
187
+
188
+ pgloader implements full flexibility in on-the-fly schema rewriting, by
189
+ making it possible to migrate from a view definition. The view attribute
190
+ list becomes a table definition in PostgreSQL, and the data is fetched
191
+ by querying the view on the source system.
192
+
193
+ A SQL view allows to implement both content filtering at the column
194
+ level using the SELECT projection clause, and at the row level using the
195
+ WHERE restriction clause. And backfilling from reference tables thanks
196
+ to JOINs.
197
+
198
+ Distribute to Citus
199
+ When migrating from PostgreSQL to Citus, a important part of the process
200
+ consists of adjusting the schema to the distribution key. Read
201
+ `Preparing Tables and Ingesting Data
202
+ <https://docs.citusdata.com/en/v8.0/use_cases/multi_tenant.html> `_ in
203
+ the Citus documentation for a complete example showing how to do that.
204
+
205
+ When using pgloader it's possible to specify the distribution keys and
206
+ reference tables and let pgloader take care of adjusting the table,
207
+ indexes, primary keys and foreign key definitions all by itself.
208
+
209
+ Encoding Overrides
210
+ MySQL doesn't actually enforce the encoding of the data in the database
211
+ to match the encoding known in the metadata, defined at the database,
212
+ table, or attribute level. Sometimes, it's necessary to override the
213
+ metadata in order to make sense of the text, and pgloader makes it easy
214
+ to do so.
215
+
216
+
217
+ Continuous Migration
218
+ --------------------
219
+
220
+ pgloader is meant to migrate a whole database in a single command line and
221
+ without any manual intervention. The goal is to be able to setup a
222
+ *Continuous Integration * environment as described in the `Project
223
+ Methodology <http://mysqltopgsql.com/project/> `_ document of the `MySQL to
224
+ PostgreSQL <http://mysqltopgsql.com/project/> `_ webpage.
225
+
226
+ 1. Setup your target PostgreSQL Architecture
227
+ 2. Fork a Continuous Integration environment that uses PostgreSQL
228
+ 3. Migrate the data over and over again every night, from production
229
+ 4. As soon as the CI is all green using PostgreSQL, schedule the D-Day
230
+ 5. Migrate without suprise and enjoy!
231
+
232
+ In order to be able to follow this great methodology, you need tooling to
233
+ implement the third step in a fully automated way. That's pgloader.
234
+
21
235
.. toctree ::
22
236
:maxdepth: 2
23
237
:caption: Table Of Contents:
24
238
25
239
intro
240
+ quickstart
26
241
tutorial/tutorial
27
242
pgloader
28
243
ref/csv
0 commit comments