Skip to content

definite-app/ducklake-coffee-shop

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Coffee Shop Synthetic Data Generator

Generate billions of records of realistic synthetic coffee shop sales data for big data and performance testing. This repository includes implementations for both Databricks/Spark and DuckLake.

Features

  • Scalable: Generate from thousands to billions of records
  • Realistic patterns: 2D time-trend modeling with seasonal variations
  • Rich dimensions: 1000 unique store locations across multiple regions
  • Product seasonality: Summer vs winter product preferences
  • Discount patterns: Realistic promotional patterns
  • Multiple implementations: Choose between Databricks or DuckLake

Implementations

1. DuckLake Version (Recommended)

Modern lakehouse implementation using DuckLake's simplified architecture.

Files:

  • ducklake_data_generator.py - Main data generator
  • fast_bulk_generator.py - Optimized bulk loader for large datasets
  • query_ducklake_data.py - Example queries

Setup:

# Install dependencies
pip install -r requirements.txt

# Copy and configure environment
cp .env.example .env
# Edit .env with your credentials

# Generate data
python ducklake_data_generator.py --data-path gs://your-bucket/ducklake/ --total-orders 1000000

2. Databricks/Spark Version

Original implementation for Databricks environments.

Files:

  • Data Generator V2.py - Databricks notebook
  • Spreader.sql - Distribution script
  • Performance Test Queries.sql - Benchmark queries

Data Model

Dimension Tables

  • dim_locations: 1000 stores across US regions
  • dim_products: 13 coffee shop products with pricing

Fact Table

  • fact_sales: Transaction-level sales data with:
    • Order and line item details
    • Time dimensions (date, time of day, season)
    • Location and region
    • Product details with quantity and pricing
    • Discount information

Quick Start with DuckLake

  1. Prerequisites

    • Python 3.10+
    • PostgreSQL database
    • Cloud storage (GCS or S3)
  2. Environment Setup

    POSTGRES_CONN_STR=postgresql://user:pass@host:port/db
    GCS_KEY_ID=your-gcs-key
    GCS_SECRET=your-gcs-secret
  3. Generate Data

    # Small dataset (100K rows)
    python ducklake_data_generator.py --data-path gs://bucket/path --total-orders 100000
    
    # Large dataset (20M rows)
    python fast_bulk_generator.py 20000000
  4. Query Data

    python query_ducklake_data.py --data-path gs://bucket/path

Performance Characteristics

  • DuckLake: ~3 minutes for 20M rows using bulk generator
  • Query performance: Sub-second for analytical queries on 20M rows
  • Storage: Efficient Parquet format with automatic compression

Connect via DuckDB CLI

FORCE INSTALL ducklake FROM core_nightly;
LOAD ducklake;
ATTACH 'ducklake:postgres:your-connection-string' AS lake (
    DATA_PATH 'gs://your-bucket/path',
    METADATA_SCHEMA 'your_schema'
);
USE lake;
SELECT COUNT(*) FROM coffeesales5b_v2_seed.fact_sales;

License

This project is licensed under the MIT License - see the LICENSE file for details.

Author

Original Databricks version by Josue Bogran

DuckLake implementation added by the community.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages