🏭 Mock Data Factory

🧰 What Is Mock Data Factory?

This project is designed to generate large-scale synthetic datasets using SQL and BigQuery. It's particularly useful for:

Testing and prototyping data pipelines
Demonstrating dashboards or analytical models with realistic (but fake) data
Simulating business scenarios with high data volumes

🔧 Key Features

SQL-Based Data Generation: Utilizes SQL for creating synthetic data tailored for BigQuery environments.
Modular Models: Includes separate models for customers, products, and orders, allowing for flexible data generation.
Parameterized Row Volume: Supports generating datasets with varying sizes (e.g., 10K, 1M rows) to suit different testing needs.
DBT-Ready Structure: Structured to integrate seamlessly with DBT (Data Build Tool) for scalable development.
Pipeline Integration: Designed for easy integration with data pipeline tools like Airflow and Cloud Composer.

🚀 How to Get Started

Clone the repository:

git clone https://github.com/CamilaJaviera91/mock-data-factory.git
cd mock-data-factory/mockdata

Install dependencies:

Make sure Python and pip are installed, then run:

pip install -r requirements.txt

Configure DBT:

Ensure DBT is installed and set up to connect to your BigQuery project.

Generate synthetic data:

Run the Python script:

python scripts/generate_fake_data.py

Run DBT models:

Compile and execute the models to populate data in BigQuery:

dbt run

🗂️ Project Structure

The repository follows a structure compatible with DBT (Data Build Tool), which makes it easy to integrate with orchestration tools like Airflow or Cloud Composer.

├── .gitignore
├── logs
│   └── dbt.log
├── mockdata
│   ├── analyses
│   │   └── .gitkeep
│   ├── dbt_project.yml
│   ├── .gitignore
│   ├── macros
│   │   └── .gitkeep
│   ├── models
│   │   ├── client.sql
|   |   ├── orders.sql
|   |   ├── product.sql
│   │   └── schema.yml
│   ├── scripts
│   │   └── generate_fake_data.py
│   ├── README.md
│   ├── seeds
│   │   └── .gitkeep
│   ├── snapshots
│   │   └── .gitkeep
│   └── tests
│       └── .gitkeep
└── README.md

🧪 DBT Integration

Try running the following commands.

Builds or materializes the tables/views based on your transformations.

dbt run

Runs checks on the data to ensure quality and correctness.

dbt test

🐍 Python Script: generate_fake_data.py

Located in the mockdata/scripts/ directory, this Python script uses the Faker library to generate synthetic data for a PostgreSQL database. It creates and populates three tables: client, product, and orders.
The script is customizable and can be extended to fit specific testing scenarios.

🔧 Key Features

SQL-based data generation tailored for BigQuery.
Modular models for entities like customers, products, and orders.
Parameterized row volume, allowing generation of datasets with varying sizes (e.g., 10K, 1M rows).
DBT-ready structure for scalable development.
Easy integration with data pipeline tools such as Airflow and Cloud Composer.

🚀 Setup

⚙️ Step 1: Prerequisites

Make sure you have the following installed:

Python 3.7+
PostgreSQL (running locally or accessible remotely)

🔗 Step 2: Configure Database Connection

Update the PostgreSQL credentials in the script:

conn = psycopg2.connect(
    host="localhost",
    database="db_name",
    user="user_name",
    password="*****"
)

Adjust the values to match your environment.

▶️ Step 3: Run the Script

Once configured, run the script:

python generate_fake_data.py

You'll see log messages indicating progress:

✅ Successfully connected to PostgreSQL.
✅ Tables created or verified.
✅ Inserted 200 clients.
✅ Inserted 30 products.
✅ Inserted 500 orders.
✅ PostgreSQL connection closed. Data generation complete.

🗃️ Table Schemas

client

Column	Type
client_id	SERIAL PRIMARY KEY
name	TEXT
email	TEXT
address	TEXT
city	TEXT

product

Column	Type
product_id	SERIAL PRIMARY KEY
name	TEXT
price	NUMERIC(10,2)
category	TEXT

salesman

Column	Type
salesman_id	SERIAL PRIMARY KEY
name	TEXT
city	TEXT

orders

Column	Type
order_id	SERIAL PRIMARY KEY
client_id	INTEGER (FK to client)
product_id	INTEGER (FK to product)
order_date	DATE
quantity	INTEGER
total	NUMERIC(10,2)

🗃️ Model

resume

Column	Type
year	NUMERIC(10,2)
store	TEXT
total	NUMERIC(10,2)

🛠️ Customization

To generate more or fewer rows, change the arguments in generate_clients(), generate_products(), generate_salesman() or generate_orders(). You can modify the list of product names and categories to suit your business domain.

📎 Related Projects

I have other related repositories that might be of interest:

sql-mock-data: Earlier version or a related project focusing on SQL-based mock data generation.
- repository
dbt-transformations-sql-mock-data: This project contains transformations and documentation for the data model generated in sql-mock-data.
- repository

💭 Final Thoughts

The mock-data-factory repository offers a robust framework for generating synthetic datasets, making it a valuable tool for data engineers and analysts working with BigQuery and DBT. Its modular design and integration capabilities make it adaptable to various testing and development scenarios.
If you need assistance setting up the project or customizing it for your specific use case, feel free to ask!

📚 Learn More About dbt

📖 Read the official dbt documentation — A great starting point to understand how dbt works and how to get started.
💬 Visit Discourse — Explore frequently asked questions and community discussions.
💻 Join the dbt Slack community — Get live support, ask questions, and connect with other data practitioners.
📅 Browse upcoming dbt events — Find webinars, meetups, and conferences near you.
📰 Read the dbt blog — Stay up to date with product updates, best practices, and community highlights.

📄 License

This project is released under the MIT License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🏭 Mock Data Factory

🧰 What Is Mock Data Factory?

🔧 Key Features

🚀 How to Get Started

🗂️ Project Structure

🧪 DBT Integration

🐍 Python Script: generate_fake_data.py

🔧 Key Features

🚀 Setup

⚙️ Step 1: Prerequisites

🔗 Step 2: Configure Database Connection

▶️ Step 3: Run the Script

🗃️ Table Schemas

🗃️ Model

🛠️ Customization

📎 Related Projects

💭 Final Thoughts

📚 Learn More About dbt

📄 License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 95 Commits
mockdata		mockdata
.gitignore		.gitignore
README.md		README.md

CamilaJaviera91/mock-data-factory

Folders and files

Latest commit

History

Repository files navigation

🏭 Mock Data Factory

🧰 What Is Mock Data Factory?

🔧 Key Features

🚀 How to Get Started

🗂️ Project Structure

🧪 DBT Integration

🐍 Python Script: generate_fake_data.py

🔧 Key Features

🚀 Setup

⚙️ Step 1: Prerequisites

🔗 Step 2: Configure Database Connection

▶️ Step 3: Run the Script

🗃️ Table Schemas

🗃️ Model

🛠️ Customization

📎 Related Projects

💭 Final Thoughts

📚 Learn More About dbt

📄 License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages