Data Engineering Capstone Project – IBM Data Engineering Professional Certificate

Overview

This capstone project simulates the role of a Data Engineer at SoftCart, an e-commerce company with global reach. The project integrates OLTP and NoSQL systems, a PostgreSQL data warehouse, ETL pipelines, a BI dashboard, and Spark-based machine learning. It was completed as the final project for Course 14 – Data Engineering Capstone Project in the IBM Data Engineering Professional Certificate.

Objectives

Design and implement a complete data platform for SoftCart
Build OLTP and NoSQL databases for operational data
Design a star-schema data warehouse and load data into it
Automate ETL pipelines for incremental data loads
Create a BI dashboard using IBM Cognos Analytics
Perform search analytics and sales prediction using Apache Spark

Tools & Technologies

Category	Tools/Technologies
Databases	MySQL, MongoDB, PostgreSQL
ETL & Automation	Bash, Python (`mysql-connector-python`, `psycopg2`), Apache Airflow
Data Warehouse	PostgreSQL (staging and reporting)
Big Data & ML	Apache Spark, PySpark, Spark MLlib
Visualization	IBM Cognos Analytics

Data Platform Architecture

The following diagram illustrates the end-to-end architecture implemented in this project, covering data ingestion, processing, storage, and analytics:

Ingestion Layer

Web Server captures customer activity and sends event data to downstream systems.

Operational Storage Layer

MySQL stores structured transactional (OLTP) data from sales operations.
MongoDB stores unstructured catalog data in a NoSQL format.

Data Warehousing Layer

PostgreSQL serves as both the staging and reporting data warehouse, used for cleansing, transformation, and analytics.
(Note: DB2 was optional in the project; PostgreSQL was used in this implementation.)

Processing Layer

Apache Spark performs scalable batch processing and powers machine learning tasks using PySpark and MLlib.

Visualization Layer

IBM Cognos Analytics is used to deliver business dashboards, KPIs, and interactive visualizations.
(Note: Google Looker Studio was optional in the project; IBM Cognos Analytics was used in this implementation.)

This layered architecture enables modular, scalable, and maintainable data engineering pipelines, supporting robust analytics from raw ingestion to insight delivery.

Results & Benchmarks

Warehouse query performance: Reduced “total sales per country” query time from ~78 ms to <1 ms by using a PostgreSQL materialized view (~99% faster).
Warehouse scale (repo data): 300,000 fact rows; 1,096 dates; 56 countries.
Repro notes: Times measured locally with \timing on in psql; materialized view defined as total_sales_per_country.

Project Modules & Components

Module 1: OLTP Database (MySQL)

Created sales_data table and populated it with oltpdata.csv
Set up an index on the timestamp field
Automated export of the data using a Bash script

Files:

sql/sales_module1.sql
data/oltpdata.csv
python/scripts/datadump.sh

Module 2: NoSQL Catalog (MongoDB)

Imported catalog data from catalog.json into MongoDB
Created indexes and executed aggregation queries
Exported selected fields into CSV

Files:

data/catalog.json
mongodb/module2_mongodb_commands.txt

Module 3: Data Warehouse Design & Queries (PostgreSQL)

Designed and implemented a star schema with fact/dimension tables
Loaded data from CSVs into staging PostgreSQL
Performed advanced aggregation: ROLLUP, CUBE, GROUPING SETS, and materialized views

Files:

sql/warehouse_schema.sql – ERD schema creation
sql/create_script_module3_given.sql – Provided setup script
sql/aggregation_queries_module3_part2.sql – Aggregation queries
data/DimDate.csv, DimCategory.csv, DimCountry.csv, FactSales.csv

Module 4: Business Intelligence Dashboard (IBM Cognos)

Created a BI dashboard with:
- Line chart: Total monthly sales for 2020
- Pie chart: Category-wise sales for 2020
- Bar chart: Quarterly mobile phone sales

Files:

dashboards/total-sales-per-month-in-2020.png - Screenshots of the dashboard visuals
dashboards/total-sales-per-category-in-2020.png
dashboards/quarterly-sales-of-mobile-phones.png

Module 5: ETL Pipelines

Modified and executed sync_sales_data.py to sync new rows from MySQL → PostgreSQL
Created Airflow DAG process_web_log.py to extract, transform, and load log data

Files:

data/sales.csv, data/sales_dump_module5.sql
python/sync_sales_data.py – Final ETL script
python/templates/mysqlconnect.py, postgresqlconnect.py – Connection templates
airflow/process_web_log.py – DAG script

Module 6: Spark ML for Sales Forecasting

Ingested searchterms.csv using PySpark
Ran frequency and search-term analytics
Loaded a pretrained Spark ML model and predicted sales for 2023

Files:

sparkml/Spark_MLOps.ipynb
sparkml/searchterms.csv
sparkml/sales_prediction.model/ – Trained model folder

Repository Structure

Data_Engineering_Capstone_Project/
├── README.md                                  # Project overview and instructions
├── airflow/
│   └── process_web_log.py                     # Airflow DAG for processing web logs
├── dashboards/                                # Cognos dashboard screenshots
│   ├── total-sales-per-month-in-2020.png
│   ├── total-sales-per-category-in-2020.png
│   └── quarterly-sales-of-mobile-phones.png 
├── data/
│   ├── oltpdata.csv                           # Input file for OLTP (Module 1)
│   ├── sales.csv                              # MySQL sales staging data (Module 5)
│   ├── sales_data.sql                         # SQL dump of sales_data table from MySQL (via datadump.sh)
│   ├── sales_dump_module5.sql                 # Provided SQL dump for MySQL setup (Module 5)
│   ├── catalog.json                           # JSON data to be imported into MongoDB (Module 2)
│   ├── ecommerce.csv                          # Input for Spark ML (Module 4)
│   ├── accesslog.txt                          # Raw web log file used in Airflow DAG (Module 5)
│   ├── DimDate.csv                            # Data warehouse dimension tables (Module 3)
│   ├── DimCategory.csv
│   ├── DimCountry.csv
│   ├── FactSales.csv
├── images/
│   └── data-platform-architecture.png         # Architecture diagram used in README
├── mongodb/
│   └── module2_mongodb_commands.txt           # MongoDB shell commands for catalog import and analysis
├── python/
│   ├── sync_sales_data.py                     # Python ETL script for incremental sync from MySQL to PostgreSQL
│   ├── scripts/
│   │   └── datadump.sh                        # Bash script to export MySQL data
│   └── templates/
│       ├── mysqlconnect.py                    # Sample connection and setup for MySQL (provided template)
│       └── postgresqlconnect.py               # Sample connection and setup for PostgreSQL (provided template)
├── sparkml/
│   ├── Spark_MLOps.ipynb                      # PySpark notebook: ingest, train, and evaluate sales prediction model
│   ├── searchterms.csv                        # Input dataset for Spark ML pipeline
│   └── sales_prediction.model/                # Saved trained model and metadata
├── sql/
│   ├── sales_module1.sql                      # OLTP schema and operations (Module 1)
│   ├── warehouse_schema.sql                   # Data warehouse schema (custom ERD version)
│   ├── create_script_module3_given.sql        # Provided CREATE TABLE script (Module 3)
│   └── aggregation_queries_module3_part2.sql  # ROLLUP, CUBE, GROUPING SETS, MQT (Module 3)

How to Run Locally

Note: Some functionality (e.g., Cognos Analytics) is cloud-based or part of IBM Skills Network Labs. Adapt steps as needed for your local environment.

OLTP Setup:
- Load oltpdata.csv into MySQL.
- Run sales_module1.sql and datadump.sh.
NoSQL Setup:
- Use module2_mongodb_commands.txt to import and query catalog.json.
Data Warehouse:
- Run warehouse_schema.sql or create_script_module3_given.sql on PostgreSQL.
- Load CSVs into the appropriate tables.
ETL Pipeline:
- Run sync_sales_data.py to insert incremental rows.
- Execute process_web_log.py DAG in Apache Airflow.
Spark ML:
- Open Spark_MLOps.ipynb, ingest searchterms.csv, and load the model to generate predictions.

License

This project was completed as part of the IBM Data Engineering Professional Certificate and is intended for educational use.

Links

Course Page - Data Engineering Capstone Project
GitHub Profile
GitHub Repository

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Data Engineering Capstone Project – IBM Data Engineering Professional Certificate

Overview

Objectives

Tools & Technologies

Data Platform Architecture

Ingestion Layer

Operational Storage Layer

Data Warehousing Layer

Processing Layer

Visualization Layer

Results & Benchmarks

Project Modules & Components

Module 1: OLTP Database (MySQL)

Module 2: NoSQL Catalog (MongoDB)

Module 3: Data Warehouse Design & Queries (PostgreSQL)

Module 4: Business Intelligence Dashboard (IBM Cognos)

Module 5: ETL Pipelines

Module 6: Spark ML for Sales Forecasting

Repository Structure

How to Run Locally

License

Links

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
airflow		airflow
dashboards		dashboards
data		data
images		images
mongodb		mongodb
python		python
sparkml		sparkml
sql		sql
.gitignore		.gitignore
README.md		README.md

royungar/Data_Engineering_Capstone_Project

Folders and files

Latest commit

History

Repository files navigation

Data Engineering Capstone Project – IBM Data Engineering Professional Certificate

Overview

Objectives

Tools & Technologies

Data Platform Architecture

Ingestion Layer

Operational Storage Layer

Data Warehousing Layer

Processing Layer

Visualization Layer

Results & Benchmarks

Project Modules & Components

Module 1: OLTP Database (MySQL)

Module 2: NoSQL Catalog (MongoDB)

Module 3: Data Warehouse Design & Queries (PostgreSQL)

Module 4: Business Intelligence Dashboard (IBM Cognos)

Module 5: ETL Pipelines

Module 6: Spark ML for Sales Forecasting

Repository Structure

How to Run Locally

License

Links

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages