PandaPodium Insights Hub

About: PandaPodium Insights Hub: End-to-end cycling e-commerce analytics platform, inspired by PandaPodium.cc. Showcases batch & stream processing (PySpark, PyFlink), IaC (Terraform), orchestration (Airflow), dbt, data lake/warehouse (S3, Glue, Athena), & viz (Tableau/Metabase) for product, sales, & customer insights.

Project Status

🚧 Work in Progress: This project is currently under active development. Features and documentation may change frequently until a stable release is finalized.

Project Overview

The PandaPodium Insights Hub is a comprehensive, end-to-end data analytics platform designed to simulate and analyze the operations of a cycling e-commerce business, with specific inspiration drawn from PandaPodium.cc. This project demonstrates the creation of a scalable data infrastructure capable of ingesting, processing, transforming, and visualizing data from various sources to derive actionable business insights.

It covers the entire data lifecycle, from raw data ingestion (batch and stream) to sophisticated data modeling and interactive dashboarding, leveraging a modern, cloud-native (AWS) tech stack.

Motivation

Panda Podium aims to provide "value bike parts" and the "best service of any bike part website on the internet" to a global customer base, sourcing products primarily from Far Eastern manufacturers. As a growing e-commerce entity, the ability to effectively harness data is crucial for:

Understanding customer behavior and preferences.
Optimizing product catalog and inventory.
Analyzing sales trends and regional performance.
Enhancing marketing effectiveness.
Improving overall operational efficiency and service delivery.

This project was built as a speculative demonstration of how a modern data engineering platform can directly support these goals, providing Panda Podium with the tools to make data-driven decisions and further enhance their competitive edge in the cycling market. It showcases a practical application of a robust data stack to solve real-world e-commerce challenges.

Tech Stack

This project utilizes a range of modern data engineering tools and technologies:

Cloud Provider: AWS
Infrastructure as Code (IaC): Terraform
Workflow Orchestration: Apache Airflow
Containerization: Docker
Data Lake:
- Production: Amazon S3
- Development: Local file system / PostgreSQL
Data Warehouse:
- Production: Amazon S3 (with AWS Glue Data Catalog & Amazon Athena)
- Development: PostgreSQL
Data Ingestion (Batch):
- Python
- dlt (data load tool)
Batch Processing:
- PySpark (with local and Amazon EMR configurations)
Data Transformation:
- dbt (data build tool)
Testing & Documentation (Data Models): dbt
Streaming Data Ingestion & Processing:
- Messaging (Dev): Kafka (Redpanda)
- Messaging (Prod): Amazon MSK (Managed Streaming for Apache Kafka)
- Stream Processing (Dev): PyFlink (local cluster)
- Stream Processing (Prod): AWS Managed Service for Apache Flink
- Python Kafka Clients
Analytics: Python (Pandas, NumPy), SQL
Data Visualization:
- Development: Metabase
- Production: Tableau Public
DevOps & Workflow:
- Git (for version control)
- Terraform Workspaces (for environment isolation: dev/prod)
- Integrate CI/CD pipelines for automated testing and deployment
- Use Make

Architecture

(This section is a placeholder. I'll create an architecture diagram and embed it here. Perhaps using tools like diagrams.net (draw.io), Lucidchart, or even PowerPoint/Google Slides and export as an image.)

| Data Sources | ----> | Ingestion | ----> | Storage | ----> | Processing & | ----> | Visualization | | (Scraped, | | (dlt, Kafka/MSK, | | (S3 Data Lake, | | Transformation | | (Metabase, | | Synthetic Ecom, | | PyFlink for Stream)| | Postgres for Dev) | | (PySpark, dbt, | | Tableau Public) | | Public Streams) | -------------------- -------------------- | Athena, Glue) | --------------------

    ^ |


| |
\------------------------------------ Orchestration (Airflow) -------------------------
|
|
\-------------------------------
| Infrastructure (Terraform) |
| Containerization (Docker) |
\-------------------------------

Brief Explanation:

Data Sources: Product catalog data is scraped from PandaPodium.cc. Sales transactions, customer data, and web clickstream data are synthetically generated or adapted from public e-commerce datasets. Public real-time streams (e.g., Wikimedia EventStreams, GTFS-Realtime) are used for the streaming component.
Ingestion: Batch data is loaded using dlt and Python scripts orchestrated by Airflow. Streaming data is ingested via Kafka/MSK and processed by PyFlink.
Storage: Raw and processed data resides in Amazon S3, forming the data lake. PostgreSQL is used for development warehousing to minimize cloud costs.
Processing & Transformation: PySpark handles large-scale batch transformations. dbt is used for SQL-based data modeling and transformation within the data warehouse (Athena for prod, Postgres for dev). AWS Glue Data Catalog provides schema for Athena.
Orchestration: Apache Airflow manages and schedules batch data pipelines.
Infrastructure: All AWS infrastructure is provisioned and managed using Terraform. Docker is used for containerizing applications/services where appropriate.
Visualization: Metabase is used for development dashboards, while Tableau Public is used for production-ready, shareable visualizations.

Data Sources

(Still in data exploration and highly likely additional/removal of data sources will happen.)

Product Catalog (Scraped):
- Source: PandaPodium.cc (publicly available product information)
- Fields: Product Name, Prices, Brand, Image URLs, Descriptions, Specifications, Categories, Stock Status, etc.
- Method: Web scraping (ensure ethical considerations and respect for robots.txt if applicable for a real-world scenario).
Sales Transactions (Synthetic/Public):
- Source: Adapted from public e-commerce datasets (e.g., Kaggle datasets) and enhanced with synthetic data generation (Faker).
- Fields: Transaction ID, Customer ID, Product ID (linked to scraped catalog), Quantity, Price, Timestamp, Region.
Customer Data (Synthetic/Public):
- Source: Adapted from public e-commerce datasets and enhanced with synthetic data generation (Faker).
- Fields: Customer ID, Name, Region, Signup Date, Age (simulated).
Web Clickstream Data (Synthetic):
- Source: Generated synthetically using Python (Faker, random) to simulate user interactions.
- Fields: Session ID, User ID, Timestamp, Event Type (page_view, product_view, add_to_cart), Product ID (linked to scraped catalog).
- This data is primarily intended for the streaming pipeline.
Public Real-Time Streams (for Streaming Component):
- Source: Examples include Wikimedia EventStreams (recent changes) or GTFS-Realtime feeds from public transit agencies.
- Purpose: To demonstrate real-time data ingestion and processing capabilities.

Key Features

Automated Data Pipelines: End-to-end batch and streaming data pipelines orchestrated by Apache Airflow.
Scalable Data Processing: Utilizes PySpark for distributed batch processing and PyFlink for scalable stream processing.
Infrastructure as Code: All cloud infrastructure managed by Terraform for reproducibility and version control.
Modular Data Modeling: dbt for robust, version-controlled, and testable data transformations in the warehouse.
Environment Parity: Clear separation between dev and prod environments using Terraform workspaces and distinct configurations.
Interactive Dashboards:
- Development dashboards in Metabase for quick analytics and monitoring.
- Production-quality dashboards in Tableau Public for stakeholder presentation.
Comprehensive Analytics: Enables analysis of sales performance, customer segmentation, product trends, and (simulated) real-time user activity.

Directory Structure


pandapodium-insights-hub/
├── airflow/                \# Airflow DAGs and related configurations
│   ├── dags/
│   └──...
├── data/            \# data archives for EDA, sampling, and back ups
├── dbt/            \# dbt project for data transformations
│   ├── models/
│   ├── tests/
│   ├── snapshots/
│   └── dbt_project.yml
├── docker/                 \# Dockerfiles and docker-compose files
│   ├── airflow/
│   ├── redpanda/
│   └──...
├── notebooks/              \# Jupyter notebooks for exploration and ad-hoc analysis
├── scripts/                \# Helper scripts (e.g., data generation, scraping)
│   ├── scraping/
│   └── data_generation/
├── src/                    \# Source code for data processing jobs
│   ├── batch/              \# PySpark jobs
│   └── streaming/          \# PyFlink jobs
├── terraform/              \# Terraform configurations for infrastructure
│   ├── modules/
│   ├── dev/
│   └── prod/
├── tests/                  \# Unit and integration tests for src code
├──.gitignore
├── LICENSE
└── README.md

Setup and Installation

(This placeholder section will be highly specific to actual project's setup. Clear, step-by-step instructions will be added later on.)

Prerequisites:

AWS Account & AWS CLI configured
Terraform installed
Docker & Docker Compose installed
Python (specify version, e.g., 3.8+) & pip
dbt-core and relevant dbt adapter (e.g., dbt-postgres, dbt-athena)
Apache Airflow (can be run via Docker)
(Optional) Redpanda/Kafka installed or running via Docker for local streaming dev
(Optional) Metabase installed or running via Docker
(Optional) Tableau Public Desktop

Steps:

Clone the Repository:

git clone [https://github.com/pizofreude/pandapodium-insights-hub.git](https://github.com/pizofreude/pandapodium-insights-hub.git)
cd pandapodium-insights-hub

Set up Python Virtual Environment (Recommended):

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install -r requirements.txt # Create a requirements.txt for Python dependencies

Configure AWS Credentials: Ensure your AWS CLI is configured with the necessary permissions. Terraform will use these credentials.
```
aws configure
```
Deploy Infrastructure with Terraform:
- Navigate to the appropriate environment directory (e.g., terraform/dev/).
- Initialize Terraform:
```
terraform init
```
- (Optional) Plan the deployment:
```
terraform plan
```
- Apply the configuration:
```
terraform apply
```
- Note: You might need to manually create an S3 bucket for Terraform remote state backend before running terraform init if you configure it that way.
Set up dbt Project:
- Navigate to the dbt/ directory.
- Configure your profiles.yml (typically located in ~/.dbt/profiles.yml) with connection details for your dev (Postgres) and prod (Athena) environments.
- Install dbt dependencies:
```
dbt deps
```
Run Data Generation/Scraping Scripts:
- Execute scripts in the scripts/ directory to populate initial data (e.g., scrape product catalog, generate synthetic sales data).
- (Provide specific commands here) [TO-DO]
Start Local Services (Docker Compose): If using Docker for Airflow, Redpanda, Postgres (dev), Metabase:
```
docker-compose up -d # From the directory containing your docker-compose.yml
```
Configure Airflow Connections: Once Airflow is running, configure connections to AWS, Spark, etc., through the Airflow UI.

Running the Project

Trigger Airflow DAGs:
- Access the Airflow UI (typically http://localhost:8080).
- Unpause and trigger the relevant DAGs for batch ingestion, PySpark processing, and dbt model runs.

Run dbt Models (Manual):

Navigate to dbt/.

To run models for development:

dbt run --target dev
dbt test --target dev

To generate documentation:

dbt docs generate --target dev
dbt docs serve --target dev

Start Streaming Pipeline:
- Ensure Redpanda/MSK is running and topics are created.
- Submit your PyFlink job.
- (Provide specific commands for submitting Flink jobs, e.g., using flink run) [TO-DO]
Access Dashboards:
- Metabase (Dev): Connect Metabase to your development PostgreSQL database and build/view dashboards.
- Tableau Public (Prod): Connect Tableau Public to Athena (or use data extracts) to create and view production dashboards.

Example Dashboards

(Include screenshots & links to Tableau Public dashboards here. Describe what insights they provide.) [TO-DO]

Example 1: Sales Overview Dashboard

Shows total sales, sales by product category, sales by region over time.
Helps identify top-performing products and regions.

Example 2: Customer Segmentation Dashboard

Segments customers based on purchase history, frequency, and value.
Aids in targeted marketing campaigns.

Future Enhancements

Implement more sophisticated Machine Learning models (e.g., sales forecasting, recommendation engine).
Expand real-time analytics capabilities (e.g., real-time inventory tracking, fraud detection).
Add more comprehensive data quality checks and alerting.
Develop a more interactive front-end for data exploration.

Contributing

While this is primarily a portfolio project, contributions, suggestions, and feedback are welcome :) Please feel free to open an issue or submit a pull request. If possible, kindly follow this conventions.

Fork the Project
Create your Feature Branch (git checkout -b feature/AmazingFeature)
Commit your Changes (git commit -m 'feat: Add some AmazingFeature')
Push to the Branch (git push origin feature/AmazingFeature)
Open a Pull Request

License

This project is licensed under the Apache 2.0 License. See the LICENSE file for details.

Acknowledgements

PandaPodium.cc for the inspiration and publicly available product information.
The open-source community for the amazing tools and libraries used in this project.
Publicly available datasets from Kaggle, NYC Citi Bike, Wikimedia, GTFS providers, etc.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PandaPodium Insights Hub

Project Status

Project Overview

Motivation

Tech Stack

Architecture

Data Sources

Key Features

Directory Structure

Setup and Installation

Running the Project

Example Dashboards

Future Enhancements

Contributing

License

Acknowledgements

About

Uh oh!

Releases

Packages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
airflow		airflow
data		data
docker		docker
notebooks		notebooks
scripts		scripts
src		src
terraform		terraform
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

License

pizofreude/pandapodium-insights-hub

Folders and files

Latest commit

History

Repository files navigation

PandaPodium Insights Hub

Project Status

Project Overview

Motivation

Tech Stack

Architecture

Data Sources

Key Features

Directory Structure

Setup and Installation

Running the Project

Example Dashboards

Future Enhancements

Contributing

License

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages