Coffee Shop Synthetic Data Generator

Generate billions of records of realistic synthetic coffee shop sales data for big data and performance testing. This repository includes implementations for both Databricks/Spark and DuckLake.

Features

Scalable: Generate from thousands to billions of records
Realistic patterns: 2D time-trend modeling with seasonal variations
Rich dimensions: 1000 unique store locations across multiple regions
Product seasonality: Summer vs winter product preferences
Discount patterns: Realistic promotional patterns
Multiple implementations: Choose between Databricks or DuckLake

Implementations

1. DuckLake Version (Recommended)

Modern lakehouse implementation using DuckLake's simplified architecture.

Files:

ducklake_data_generator.py - Main data generator
fast_bulk_generator.py - Optimized bulk loader for large datasets
query_ducklake_data.py - Example queries

Setup:

# Install dependencies
pip install -r requirements.txt

# Copy and configure environment
cp .env.example .env
# Edit .env with your credentials

# Generate data
python ducklake_data_generator.py --data-path gs://your-bucket/ducklake/ --total-orders 1000000

2. Databricks/Spark Version

Original implementation for Databricks environments.

Files:

Data Generator V2.py - Databricks notebook
Spreader.sql - Distribution script
Performance Test Queries.sql - Benchmark queries

Data Model

Dimension Tables

dim_locations: 1000 stores across US regions
dim_products: 13 coffee shop products with pricing

Fact Table

fact_sales: Transaction-level sales data with:
- Order and line item details
- Time dimensions (date, time of day, season)
- Location and region
- Product details with quantity and pricing
- Discount information

Quick Start with DuckLake

Prerequisites
- Python 3.10+
- PostgreSQL database
- Cloud storage (GCS or S3)

Environment Setup

POSTGRES_CONN_STR=postgresql://user:pass@host:port/db
GCS_KEY_ID=your-gcs-key
GCS_SECRET=your-gcs-secret

Generate Data

# Small dataset (100K rows)
python ducklake_data_generator.py --data-path gs://bucket/path --total-orders 100000

# Large dataset (20M rows)
python fast_bulk_generator.py 20000000

Query Data

python query_ducklake_data.py --data-path gs://bucket/path

Performance Characteristics

DuckLake: ~3 minutes for 20M rows using bulk generator
Query performance: Sub-second for analytical queries on 20M rows
Storage: Efficient Parquet format with automatic compression

Connect via DuckDB CLI

FORCE INSTALL ducklake FROM core_nightly;
LOAD ducklake;
ATTACH 'ducklake:postgres:your-connection-string' AS lake (
    DATA_PATH 'gs://your-bucket/path',
    METADATA_SCHEMA 'your_schema'
);
USE lake;
SELECT COUNT(*) FROM coffeesales5b_v2_seed.fact_sales;

License

This project is licensed under the MIT License - see the LICENSE file for details.

Author

Original Databricks version by Josue Bogran

DuckLake implementation added by the community.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.env.example		.env.example
.gitignore		.gitignore
Data Generator V2.py		Data Generator V2.py
Dim_Locations.csv		Dim_Locations.csv
Dim_Products.csv		Dim_Products.csv
LICENSE		LICENSE
Performance Test Queries.sql		Performance Test Queries.sql
Quick Visual.sql		Quick Visual.sql
README.md		README.md
README_DATABRICKS.md		README_DATABRICKS.md
README_DUCKLAKE.md		README_DUCKLAKE.md
Spreader.sql		Spreader.sql
bulk_data_generator.py		bulk_data_generator.py
ducklake_data_generator.py		ducklake_data_generator.py
fast_bulk_generator.py		fast_bulk_generator.py
pyproject.toml		pyproject.toml
query_ducklake_data.py		query_ducklake_data.py
requirements.txt		requirements.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Coffee Shop Synthetic Data Generator

Features

Implementations

1. DuckLake Version (Recommended)

2. Databricks/Spark Version

Data Model

Dimension Tables

Fact Table

Quick Start with DuckLake

Performance Characteristics

Connect via DuckDB CLI

License

Author

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

License

definite-app/ducklake-coffee-shop

Folders and files

Latest commit

History

Repository files navigation

Coffee Shop Synthetic Data Generator

Features

Implementations

1. DuckLake Version (Recommended)

2. Databricks/Spark Version

Data Model

Dimension Tables

Fact Table

Quick Start with DuckLake

Performance Characteristics

Connect via DuckDB CLI

License

Author

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages