Generate billions of records of realistic synthetic coffee shop sales data for big data and performance testing. This repository includes implementations for both Databricks/Spark and DuckLake.
- Scalable: Generate from thousands to billions of records
- Realistic patterns: 2D time-trend modeling with seasonal variations
- Rich dimensions: 1000 unique store locations across multiple regions
- Product seasonality: Summer vs winter product preferences
- Discount patterns: Realistic promotional patterns
- Multiple implementations: Choose between Databricks or DuckLake
Modern lakehouse implementation using DuckLake's simplified architecture.
Files:
ducklake_data_generator.py
- Main data generatorfast_bulk_generator.py
- Optimized bulk loader for large datasetsquery_ducklake_data.py
- Example queries
Setup:
# Install dependencies
pip install -r requirements.txt
# Copy and configure environment
cp .env.example .env
# Edit .env with your credentials
# Generate data
python ducklake_data_generator.py --data-path gs://your-bucket/ducklake/ --total-orders 1000000
Original implementation for Databricks environments.
Files:
Data Generator V2.py
- Databricks notebookSpreader.sql
- Distribution scriptPerformance Test Queries.sql
- Benchmark queries
- dim_locations: 1000 stores across US regions
- dim_products: 13 coffee shop products with pricing
- fact_sales: Transaction-level sales data with:
- Order and line item details
- Time dimensions (date, time of day, season)
- Location and region
- Product details with quantity and pricing
- Discount information
-
Prerequisites
- Python 3.10+
- PostgreSQL database
- Cloud storage (GCS or S3)
-
Environment Setup
POSTGRES_CONN_STR=postgresql://user:pass@host:port/db GCS_KEY_ID=your-gcs-key GCS_SECRET=your-gcs-secret
-
Generate Data
# Small dataset (100K rows) python ducklake_data_generator.py --data-path gs://bucket/path --total-orders 100000 # Large dataset (20M rows) python fast_bulk_generator.py 20000000
-
Query Data
python query_ducklake_data.py --data-path gs://bucket/path
- DuckLake: ~3 minutes for 20M rows using bulk generator
- Query performance: Sub-second for analytical queries on 20M rows
- Storage: Efficient Parquet format with automatic compression
FORCE INSTALL ducklake FROM core_nightly;
LOAD ducklake;
ATTACH 'ducklake:postgres:your-connection-string' AS lake (
DATA_PATH 'gs://your-bucket/path',
METADATA_SCHEMA 'your_schema'
);
USE lake;
SELECT COUNT(*) FROM coffeesales5b_v2_seed.fact_sales;
This project is licensed under the MIT License - see the LICENSE file for details.
Original Databricks version by Josue Bogran
DuckLake implementation added by the community.