duckdb-dbt

This repo is a demonstration of using DuckDB and dbt to build and analyze data.

A fictional dataset about local agriculture called Farm To Market was created for the purposes of this repo.

The code is organized into three stages to show a progression of from simple queries to a data pipeline.

Stage 0: confirm everything is setup correctly
Stage 1: load and explore data
Stage 2: use a data pipeline to build Farm to Market
Stage 3: use dbt to build Farm to Market

Farm To Market

Dataset description

Farm To Market captures where local food is sold and might be grown in New York City.

This dataset combines the locations of farmers markets and potential farms (community gardens) to highlight availability and potential local suppliers of healthy produce in NYC.

dbt Model Lineage

The following diagram shows the lineage of sources and models in the Stage 3 dbt project:

graph LR
  source_NYC_Farmers_Markets["source.NYC_Farmers_Markets"]
  source_GreenThumb_Block_Lot["source.GreenThumb_Block-Lot"]
  source_GreenThumb_Garden_Info["source.GreenThumb_Garden_Info"]
  source_Borough_Boundaries["source.Borough_Boundaries"]

  stg_boroughs["stg__boroughs"]
  stg_garden_info["stg__garden_info"]
  stg_markets["stg__markets"]
  stg_garden_block_lot["stg__garden_block_lot"]

  int_farms["int__farms"]
  farms["farms"]
  farm_to_market["farm_to_market"]
  markets["markets"]

  %% Source successors
  source_NYC_Farmers_Markets --> stg_markets
  source_GreenThumb_Block_Lot --> stg_garden_block_lot
  source_GreenThumb_Garden_Info --> stg_garden_info
  source_Borough_Boundaries --> stg_boroughs

  %% Staging model successors
  stg_boroughs --> int_farms
  stg_garden_info --> int_farms
  stg_garden_block_lot --> int_farms
  stg_markets --> farm_to_market
  stg_markets --> markets

  %% Intermediate model successors
  int_farms --> farms
  int_farms --> farm_to_market

Final tables

markets

Each row is a farmers market

farms

Each row is a potential farm

farm_to_market

Each row is a market and farm pair

Source data

Source data:

NYC Borough Boundaries (geojson source)
NYC Farmers Markets (csv source)
GreenThumb Garden Info (csv source)
GreenThumb Block-Lot (csv source)

Setup

Note

All examples of commands are written for Bash on Unix-based operating systems. You can run echo $SHELL to help you determine your shell and then, when necessary, can find relevant docs to run the correct commands for your shell/operating system.

Prerequisites

Required

Python 3.12 for running python code (any version >=3.9 works)
git for cloning this repo and installing bash terminal

Optional

VS Code for an integrated development environment (IDE) and the Python extension
DBeaver for querying a database

Environment

Clone this repo and navigate to the new folder:

git clone https://github.com/damonmcc/duckdb-dbt.git
cd duckdb-dbt

Create a python virtual environment named .venv either using the command below or using the VS Code command Python: create environment
```
which python
python --version
python -m venv .venv
```
Activate the virtual environment
```
source .venv/bin/activate
```

Install packages and confirm setup

python -m pip install --requirement setup/requirements.txt
pip list

Usage

Stage 0: Sanity Check

Run a python script to confirm everything is setup

python -m stage_0.sanity_check

Stage 1: Load and explore data

Load and explore data from various sources

These datasets were chosen to show some of the ways source data can be imported with DuckDB.

PLUTO from the NYC Department of City Planning (source)
NYC Airbnb data (source)
Trip record data from the NYC Taxi and Limousine Commission (TLC) (source)

Download PLUTO from NYC Open Data by navigating to Export -> Download file -> Export format: CSV

[!TIP] If the NYC Open Data download takes too long, try downloading it from the NYC Department of City Planning.
Rename the downloaded csv file to pluto.csv and move it to data/source_data/
Run a python script to download the other two sources and load all three sources into a database:
```
python -m stage_1.load
```
Use the Jupyter notebook stage_1/explore.ipynb or DBeaver to explore the data

Stage 2: Pipeline

Use a data pipeline to build Farm To Market

Download all Farm To Market source data from their Open Data pages by navigating to Export -> Download file. Depending on the dataset, either download a CSV or a GeoJSON file.
Rename the downloaded files to remove the dates and move them to data/source_data/
Run a python script to load all source data into a database:
```
python -m stage_2.load
```

Run python scripts to transform and export data:

python -m stage_2.transform
python -m stage_2.export

Use the Jupyter notebook stage_2/analyze.ipynb to review and analyze the dataset

Stage 3: dbt pipeline

Use dbt to build Farm to Market

Download and rename source data as described in Stage 2
Navigate to the stage_3 folder:
```
cd stage_3
```
Install dbt packages and confirm setup:
```
dbt deps
dbt debug
```
Test source data:
```
dbt test --select "source:*"
```
Build the dataset:
```
dbt build
```
Generate and view data documentation:
```
dbt docs generate
dbt docs serve
```
Inspect and use the data by using the notebooks in stage_3/analysis/

Development

Format python files with black fix directory_or_file_path
Format SQL files with sqlfluff fix directory_or_file_path
Add new python packages to requirements.in and recompile with ./setup/compile_python_packages.sh

Resources

DuckDB Guides - DuckDB Foundation

What is dbt? - dbt Labs

dbt Best practices - dbt Labs

Name		Name	Last commit message	Last commit date
Latest commit History 70 Commits
constants		constants
data		data
setup		setup
stage_0		stage_0
stage_1		stage_1
stage_2		stage_2
stage_3		stage_3
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

duckdb-dbt

Farm To Market

Dataset description

dbt Model Lineage

Final tables

Source data

Setup

Prerequisites

Environment

Usage

Stage 0: Sanity Check

Stage 1: Load and explore data

Stage 2: Pipeline

Stage 3: dbt pipeline

Development

Resources

About

Uh oh!

Uh oh!

Contributors 2

Uh oh!

Languages

License

damonmcc/duckdb-dbt

Folders and files

Latest commit

History

Repository files navigation

duckdb-dbt

Farm To Market

Dataset description

dbt Model Lineage

Final tables

Source data

Setup

Prerequisites

Environment

Usage

Stage 0: Sanity Check

Stage 1: Load and explore data

Stage 2: Pipeline

Stage 3: dbt pipeline

Development

Resources

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors 2

Uh oh!

Languages