More docs improvements.

dimitri · dimitri · commit eab1cbf326c6 · 2018-12-19T22:40:32.000+01:00
Explain the feature list of pgloader better for improving discoverability of
what can be achieved with our nice little tool.
diff --git a/docs/index.rst b/docs/index.rst
@@ -18,11 +18,226 @@ mode of operations, pgloader handles both the schema and data parts of the
 migration, in a single unmanned command, allowing to implement **Continuous
 Migration**.
 
+Features Overview
+=================
+
+pgloader has two modes of operation: loading from files, migrating
+databases. In both cases, pgloader uses the PostgreSQL COPY protocol which
+implements a **streaming** to send data in a very efficient way.
+
+Loading file content in PostgreSQL
+----------------------------------
+
+When loading from files, pgloader implements the following features:
+
+Many source formats supported
+    Support for a wide variety of file based formats are included in
+    pgloader: the CSV family, fixed columns formats, dBase files (``db3``),
+    and IBM IXF files.
+
+    The SQLite database engine is accounted for in the next section:
+    pgloader considers SQLite as a database source and implements schema
+    discovery from SQLite catalogs.
+
+On the fly data transformation
+    Often enough the data as read from a CSV file (or another format) needs
+    some tweaking and clean-up before being sent to PostgreSQL.
+
+    For instance in the `geolite
+    <https://github.com/dimitri/pgloader/blob/master/test/archive.load>`_
+    example we can see that integer values are being rewritten as IP address
+    ranges, allowing to target an ``ip4r`` column directly.
+
+Full Field projections
+    pgloader supports loading data into less fields than found on file, or
+    more, doing some computation on the data read before sending it to
+    PostgreSQL.
+    
+Reading files from an archive
+    Archive formats *zip*, *tar*, and *gzip* are supported by pgloader: the
+    archive is extracted in a temporary directly and expanded files are then
+    loaded.
+    
+HTTP(S) support
+    pgloader knows how to download a source file or a source archive using
+    HTTP directly. It might be better to use ``curl -O- http://... |
+    pgloader` and read the data from *standard input*, then allowing for
+    streaming of the data from its source down to PostgreSQL.
+    
+Target schema discovery
+    When loading in an existing table, pgloader takes into account the
+    existing columns and may automatically guess the CSV format for you.
+  
+On error stop / On error resume next
+    In some cases the source data is so damaged as to be impossible to
+    migrate in full, and when loading from a file then the default for
+    pgloader is to use ``on error resume next`` option, where the rows
+    rejected by PostgreSQL are saved away and the migration continues with
+    the other rows.
+
+    In other cases loading only a part of the input data might not be a
+    great idea, and in such cases it's possible to use the ``on error stop``
+    option.
+
+Pre/Post SQL commands
+    This feature allows pgloader commands to include SQL commands to run
+    before and after loading a file. It might be about creating a table
+    first, then loading the data into it, and then doing more processing
+    on-top of the data (implementing an ``ELT`` pipeline then), or creating
+    specific indexes as soon as the data has been made ready.
+    
+One-command migration to PostgreSQL
+-----------------------------------
+  
+When migrating a full database in a single command, pgloader implements the
+following features:
+
+One-command migration
+    The whole migration is started with a single command line and then runs
+    unattended. pgloader is meant to be integrated in a fully automated
+    tooling that you can repeat as many times as needed.
+
+Schema discovery
+    The source database is introspected using its SQL catalogs to get the
+    list of tables, attributes (with data types, default values, not null
+    constraints, etc), primary key constraints, foreign key constraints,
+    indexes, comments, etc. This feeds an internal database catalog of all
+    the objects to migrate from the source database to the target database.
+
+User defined casting rules
+    Some source database have ideas about their data types that might not be
+    compatible with PostgreSQL implementaion of equivalent data types.
+
+    For instance, SQLite since version 3 has a `Dynamic Type System
+    <https://www.sqlite.org/datatype3.html>`_ which of course isn't
+    compatible with the idea of a `Relation
+    <https://en.wikipedia.org/wiki/Relation_(database)>`_. Or MySQL accepts
+    datetime for year zero, which doesn't exists in our calendar, and
+    doesn't have a boolean data type.
+
+    When migrating from another source database technology to PostgreSQL,
+    data type casting choices must be made. pgloader implements solid
+    defaults that you can rely upon, and a facility for **user defined data
+    type casting rules** for specific cases. The idea is to allow users to
+    specify the how the migration should be done, in order for it to be
+    repeatable and included in a *Continuous Migration* process.
+
+On the fly data transformations
+    The user defined casting rules come with on the fly rewrite of the data.
+    For instance zero dates (it's not just the year, MySQL accepts
+    ``0000-00-00`` as a valid datetime) are rewritten to NULL values by
+    default.
+    
+Partial Migrations
+    It is possible to include only a partial list of the source database
+    tables in the migration, or to exclude some of the tables on the source
+    database.
+
+Schema only, Data only
+    This is the **ORM compatibility** feature of pgloader, where it is
+    possible to create the schema using your ORM and then have pgloader
+    migrate the data targeting this already created schema.
+
+    When doing this, it is possible for pgloader to *reindex* the target
+    schema: before loading the data from the source database into PostgreSQL
+    using COPY, pgloader DROPs the indexes and constraints, and reinstalls
+    the exact same definitions of them once the data has been loaded.
+
+    The reason for operating that way is of course data load performance.
+    
+Repeatable (DROP+CREATE)
+    By default, pgloader issues DROP statements in the target PostgreSQL
+    database before issing any CREATE statement, so that you can repeat the
+    migration as many times as necessary until migration specifications and
+    rules are bug free.
+    
+On error stop / On error resume next
+    The default behavior of pgloader when migrating from a database is ``on
+    error stop``. The idea is to let the user fix either the migration
+    specifications or the source data, and run the process again, until it
+    works.
+
+    In some cases the source data is so damaged as to be impossible to
+    migrate in full, and it might be necessary to then resort to the ``on
+    error resume next`` option, where the rows rejected by PostgreSQL are
+    saved away and the migration continues with the other rows.
+
+Pre/Post SQL commands, Post-Schema SQL commands
+    While pgloader takes care of rewriting the schema to PostgreSQL
+    expectations, and even provides *user-defined data type casting rules*
+    support to that end, sometimes it is necessary to add some specific SQL
+    commands around the migration. It's of course supported right from
+    pgloader itself, without having to script around it.
+    
+Online ALTER schema
+    At times migrating to PostgreSQL is also a good opportunity to review
+    and fix bad decisions that were made in the past, or simply that are not
+    relevant to PostgreSQL.
+
+    The pgloader command syntax allows to ALTER pgloader's internal
+    representation of the target catalogs so that the target schema can be
+    created a little different from the source one. Changes supported
+    include target a different *schema* or *table* name.
+    
+Materialized Views, or schema rewrite on-the-fly
+    In some cases the schema rewriting goes deeper than just renaming the
+    SQL objects to being a full normalization exercise. Because PostgreSQL
+    is great at running a normalized schema in production under most
+    workloads.
+
+    pgloader implements full flexibility in on-the-fly schema rewriting, by
+    making it possible to migrate from a view definition. The view attribute
+    list becomes a table definition in PostgreSQL, and the data is fetched
+    by querying the view on the source system.
+
+    A SQL view allows to implement both content filtering at the column
+    level using the SELECT projection clause, and at the row level using the
+    WHERE restriction clause. And backfilling from reference tables thanks
+    to JOINs.
+    
+Distribute to Citus
+    When migrating from PostgreSQL to Citus, a important part of the process
+    consists of adjusting the schema to the distribution key. Read
+    `Preparing Tables and Ingesting Data
+    <https://docs.citusdata.com/en/v8.0/use_cases/multi_tenant.html>`_ in
+    the Citus documentation for a complete example showing how to do that.
+
+    When using pgloader it's possible to specify the distribution keys and
+    reference tables and let pgloader take care of adjusting the table,
+    indexes, primary keys and foreign key definitions all by itself.
+
+Encoding Overrides
+    MySQL doesn't actually enforce the encoding of the data in the database
+    to match the encoding known in the metadata, defined at the database,
+    table, or attribute level. Sometimes, it's necessary to override the
+    metadata in order to make sense of the text, and pgloader makes it easy
+    to do so.
+
+
+Continuous Migration
+--------------------
+
+pgloader is meant to migrate a whole database in a single command line and
+without any manual intervention. The goal is to be able to setup a
+*Continuous Integration* environment as described in the `Project
+Methodology <http://mysqltopgsql.com/project/>`_ document of the `MySQL to
+PostgreSQL <http://mysqltopgsql.com/project/>`_ webpage.
+
+  1. Setup your target PostgreSQL Architecture
+  2. Fork a Continuous Integration environment that uses PostgreSQL
+  3. Migrate the data over and over again every night, from production
+  4. As soon as the CI is all green using PostgreSQL, schedule the D-Day
+  5. Migrate without suprise and enjoy! 
+
+In order to be able to follow this great methodology, you need tooling to
+implement the third step in a fully automated way. That's pgloader.
+
 .. toctree::
    :maxdepth: 2
    :caption: Table Of Contents:
 
    intro
+   quickstart
    tutorial/tutorial
    pgloader
    ref/csv
diff --git a/docs/intro.rst b/docs/intro.rst
@@ -35,30 +35,14 @@ expected input properties must be given to pgloader. In the case of a
 database, pgloader connects to the live service and knows how to fetch the
 metadata it needs directly from it.
 
-Continuous Migration
---------------------
-
-pgloader is meant to migrate a whole database in a single command line and
-without any manual intervention. The goal is to be able to setup a
-*Continuous Integration* environment as described in the `Project
-Methodology <http://mysqltopgsql.com/project/>`_ document of the `MySQL to
-PostgreSQL <http://mysqltopgsql.com/project/>`_ webpage.
-
-  1. Setup your target PostgreSQL Architecture
-  2. Fork a Continuous Integration environment that uses PostgreSQL
-  3. Migrate the data over and over again every night, from production
-  4. As soon as the CI is all green using PostgreSQL, schedule the D-Day
-  5. Migrate without suprise and enjoy! 
-
-In order to be able to follow this great methodology, you need tooling to
-implement the third step in a fully automated way. That's pgloader.
-
 Features Matrix
 ---------------
 
 Here's a comparison of the features supported depending on the source
-database engine. Most features that are not supported can be added to
-pgloader, it's just that nobody had the need to do so yet.
+database engine. Some features that are not supported can be added to
+pgloader, it's just that nobody had the need to do so yet. Those features
+are marked with ✗. Empty cells are used when the feature doesn't make sense
+for the selected source database.
 
 ==========================   =======  ======  ======  ===========  =========
 Feature                      SQLite   MySQL   MS SQL  PostgreSQL   Redshift 
@@ -71,14 +55,13 @@ Schema only                     ✓       ✓       ✓           ✓          
 Data only                       ✓       ✓       ✓           ✓          ✓
 Repeatable (DROP+CREATE)        ✓       ✓       ✓           ✓          ✓
 User defined casting rules      ✓       ✓       ✓           ✓          ✓
-Encoding Overrides              ✗       ✓       ✗            ✗          ✗
+Encoding Overrides                      ✓
 On error stop                   ✓       ✓       ✓           ✓          ✓
 On error resume next            ✓       ✓       ✓           ✓          ✓
 Pre/Post SQL commands           ✓       ✓       ✓           ✓          ✓
 Post-Schema SQL commands        ✗       ✓       ✓           ✓          ✓
 Primary key support             ✓       ✓       ✓           ✓          ✓
-Foreign key support             ✓       ✓       ✓           ✓          ✗
-Incremental data loading        ✓       ✓       ✓           ✓          ✓
+Foreign key support             ✓       ✓       ✓           ✓
 Online ALTER schema             ✓       ✓       ✓           ✓          ✓
 Materialized views              ✗       ✓       ✓           ✓          ✓
 Distribute to Citus             ✗       ✓       ✓           ✓          ✓
diff --git a/docs/quickstart.rst b/docs/quickstart.rst
@@ -1,10 +1,10 @@
-PgLoader Quick Start
---------------------
+Pgloader Quick Start
+====================
 
 In simple cases, pgloader is very easy to use.
 
 CSV
-^^^
+---
 
 Load data from a CSV file into a pre-existing table in your database::
 
@@ -26,7 +26,7 @@ For documentation about the available syntaxes for the `--field` and
 Note also that the PostgreSQL URI includes the target *tablename*.
 
 Reading from STDIN
-^^^^^^^^^^^^^^^^^^
+------------------
 
 File based pgloader sources can be loaded from the standard input, as in the
 following example::
@@ -46,7 +46,7 @@ pgloader with this technique, using the Unix pipe::
     gunzip -c source.gz | pgloader --type csv ... - pgsql:///target?foo
 
 Loading from CSV available through HTTP
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+---------------------------------------
 
 The same command as just above can also be run if the CSV file happens to be
 found on a remote HTTP location::
@@ -84,7 +84,7 @@ Also notice that the same command will work against an archived version of
 the same data.
 
 Streaming CSV data from an HTTP compressed file
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+-----------------------------------------------
 
 Finally, it's important to note that pgloader first fetches the content from
 the HTTP URL it to a local file, then expand the archive when it's
@@ -110,7 +110,7 @@ and the commands and pgloader will take care of streaming the data down to
 PostgreSQL.
 
 Migrating from SQLite
-^^^^^^^^^^^^^^^^^^^^^
+---------------------
 
 The following command will open the SQLite database, discover its tables
 definitions including indexes and foreign keys, migrate those definitions
@@ -121,7 +121,7 @@ and then migrate the data over::
     pgloader ./test/sqlite/sqlite.db postgresql:///newdb
 
 Migrating from MySQL
-^^^^^^^^^^^^^^^^^^^^
+--------------------
 
 Just create a database where to host the MySQL data and definitions and have
 pgloader do the migration for you in a single command line::
@@ -130,7 +130,7 @@ pgloader do the migration for you in a single command line::
     pgloader mysql://user@localhost/sakila postgresql:///pagila
 
 Fetching an archived DBF file from a HTTP remote location
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+---------------------------------------------------------
 
 It's possible for pgloader to download a file from HTTP, unarchive it, and
 only then open it to discover the schema then load the data::
diff --git a/docs/tutorial/tutorial.rst b/docs/tutorial/tutorial.rst
@@ -1,7 +1,6 @@
-PgLoader Tutorial
+Pgloader Tutorial
 =================
 
-.. include:: quickstart.rst
 .. include:: csv.rst
 .. include:: fixed.rst
 .. include:: geolite.rst