This repo is a demonstration of using DuckDB and dbt to build and analyze data.
A fictional dataset about local agriculture called Farm To Market
was created for the purposes of this repo.
The code is organized into three stages to show a progression of from simple queries to a data pipeline.
- Stage 0: confirm everything is setup correctly
- Stage 1: load and explore data
- Stage 2: use a data pipeline to build Farm to Market
- Stage 3: use dbt to build Farm to Market
Farm To Market captures where local food is sold and might be grown in New York City.
This dataset combines the locations of farmers markets and potential farms (community gardens) to highlight availability and potential local suppliers of healthy produce in NYC.
The following diagram shows the lineage of sources and models in the Stage 3 dbt project:
graph LR
source_NYC_Farmers_Markets["source.NYC_Farmers_Markets"]
source_GreenThumb_Block_Lot["source.GreenThumb_Block-Lot"]
source_GreenThumb_Garden_Info["source.GreenThumb_Garden_Info"]
source_Borough_Boundaries["source.Borough_Boundaries"]
stg_boroughs["stg__boroughs"]
stg_garden_info["stg__garden_info"]
stg_markets["stg__markets"]
stg_garden_block_lot["stg__garden_block_lot"]
int_farms["int__farms"]
farms["farms"]
farm_to_market["farm_to_market"]
markets["markets"]
%% Source successors
source_NYC_Farmers_Markets --> stg_markets
source_GreenThumb_Block_Lot --> stg_garden_block_lot
source_GreenThumb_Garden_Info --> stg_garden_info
source_Borough_Boundaries --> stg_boroughs
%% Staging model successors
stg_boroughs --> int_farms
stg_garden_info --> int_farms
stg_garden_block_lot --> int_farms
stg_markets --> farm_to_market
stg_markets --> markets
%% Intermediate model successors
int_farms --> farms
int_farms --> farm_to_market
markets
Each row is a farmers market
farms
Each row is a potential farm
farm_to_market
Each row is a market and farm pair
Source data:
- NYC Borough Boundaries (geojson source)
- NYC Farmers Markets (csv source)
- GreenThumb Garden Info (csv source)
- GreenThumb Block-Lot (csv source)
Note
All examples of commands are written for Bash on Unix-based operating systems. You can run echo $SHELL
to help you determine your shell and then, when necessary, can find relevant docs to run the correct commands for your shell/operating system.
Required
- Python 3.12 for running python code (any version >=3.9 works)
- git for cloning this repo and installing
bash
terminal
Optional
- VS Code for an integrated development environment (IDE) and the Python extension
- DBeaver for querying a database
-
Clone this repo and navigate to the new folder:
git clone https://github.com/damonmcc/duckdb-dbt.git cd duckdb-dbt
-
Create a python virtual environment named
.venv
either using the command below or using the VS Code commandPython: create environment
which python python --version python -m venv .venv
-
Activate the virtual environment
source .venv/bin/activate
-
Install packages and confirm setup
python -m pip install --requirement setup/requirements.txt pip list
Run a python script to confirm everything is setup
python -m stage_0.sanity_check
Load and explore data from various sources
These datasets were chosen to show some of the ways source data can be imported with DuckDB.
- PLUTO from the NYC Department of City Planning (source)
- NYC Airbnb data (source)
- Trip record data from the NYC Taxi and Limousine Commission (TLC) (source)
-
Download PLUTO from NYC Open Data by navigating to
Export
->Download file
->Export format: CSV
[!TIP] If the NYC Open Data download takes too long, try downloading it from the NYC Department of City Planning.
-
Rename the downloaded csv file to
pluto.csv
and move it todata/source_data/
-
Run a python script to download the other two sources and load all three sources into a database:
python -m stage_1.load
-
Use the Jupyter notebook
stage_1/explore.ipynb
or DBeaver to explore the data
Use a data pipeline to build Farm To Market
-
Download all Farm To Market source data from their Open Data pages by navigating to
Export
->Download file
. Depending on the dataset, either download a CSV or a GeoJSON file. -
Rename the downloaded files to remove the dates and move them to
data/source_data/
-
Run a python script to load all source data into a database:
python -m stage_2.load
-
Run python scripts to transform and export data:
python -m stage_2.transform python -m stage_2.export
-
Use the Jupyter notebook
stage_2/analyze.ipynb
to review and analyze the dataset
Use dbt to build Farm to Market
-
Download and rename source data as described in Stage 2
-
Navigate to the
stage_3
folder:cd stage_3
-
Install dbt packages and confirm setup:
dbt deps dbt debug
-
Test source data:
dbt test --select "source:*"
-
Build the dataset:
dbt build
-
Generate and view data documentation:
dbt docs generate dbt docs serve
-
Inspect and use the data by using the notebooks in
stage_3/analysis/
- Format python files with
black fix directory_or_file_path
- Format SQL files with
sqlfluff fix directory_or_file_path
- Add new python packages to
requirements.in
and recompile with./setup/compile_python_packages.sh