-
Notifications
You must be signed in to change notification settings - Fork 24
Description
Summary
As a data analyst, I want support for DuckDB so that I can easily move data between DuckDB and other databases and also benefit from DuckDBs fast CSV loading capabilities.
Description
DuckDB is an in-memory, column-based analytical database. It is designed for working with large files. The column-based design makes it good for analytical work analytical queries e.g. based on aggregations over whole columns or large joins.
DuckDB works as a standalone application, in a similar way to SQLite. It comes with nice tools for reading CSV files that can do neat things like auto-detect data types. These could be useful in ETL workflows.
https://duckdb.org/docs/data/csv/overview
Adding DuckDB support to ETL Helper would allow users to use the etl.copy_rows
to pull data from PostgreSQL/Oracle etc. directly into DuckDB for analysis.
Implementation
The DuckDB Python library is compatible with the DB API 2.0 specification that ETL Helper uses.
https://duckdb.org/docs/api/python/dbapi
This should make it easy to add to ETL Helper. It is just a case of adding a DuckDbHelper and the appropriate tests: https://github.com/BritishGeologicalSurvey/etlhelper/blob/main/CONTRIBUTING.md#support-for-more-database-types
Acceptance criteria
- DbHelper for DuckDB is added
- Integration test suite added and runs in GitHub actions
- Documentation updated