This project implements an end-to-end data pipeline for processing skincare products and reviews data from Sephora, using Google Cloud Platform (GCP), PySpark, Docker, Airflow, dbt, Great Expectations, and Terraform.
The dataset consists of:
product_info.csv
: contains product name, brand, category, price, ingredients, highlights ...etc.reviews_1.csv
toreviews_5.csv
: customer reviews with product ID, rating, feedback_count ...etc.
📥 Download: Link to dataset
The pipeline includes the following stages:
- Provision cloud infrastructure on GCP using IaC (Terraform)
- Upload raw CSVs to a data lake (Google Cloud Storage)
- Submit a PySpark job for transformation and cleaning (Dataproc)
- Load processed data into a data warehouse (BigQuery)
- Build a dimensional model using SQL (dbt using Cosmos)
- Run data quality checks (Great Expectations + dbt)
- Orchestrate the entire pipeline (Airflow via Astro CLI)
- Visualize insights (Looker Studio)
Tool / Language | Purpose |
---|---|
Google Cloud Storage | Store raw CSV files + PySpark code |
Dataproc + PySpark | Data transformation and cleaning |
BigQuery | Data warehousing and SQL-based querying |
dbt | Data modeling and testing |
Great Expectations | Data validation and testing |
Airflow (Astro CLI) | Orchestration of the entire pipeline |
Terraform | Provision and manage cloud infrastructure (IaC) |
Docker | Containerization of services |
Looker Studio | Final dashboard for visualization |
dbt models: The dbt project follows a four-layer architecture:
source
:products
reviews
staging
:stg_products
stg_reviews
stg_customers
stg_brands
star_schema
:dim_products
dim_date
dim_customers
fact_care
marts
:brands_rating
products_skin_type
rating_price_product
total_feedback_product
The dashboard provides insights on:
- Average product prices and ratings by skin type
- Most-reviewed skincare products
- Distribution of product price versus rating
🖼️ Dashboard Preview:
- A Google Cloud Platform (GCP) account with a service account that has admin access to GCS, Dataproc, and BigQuery.
- Docker & Astro CLI installed (refer to the official Astro documentation).
- Python & PySpark installed.
- Terraform installed.
astro dev start (astro dev start --wait 3m)
astro dev bash
cd terraform/
terraform init
terraform plan
terraform apply
cd include/dbt
dbt deps
dbt run --profiles-dir /usr/local/airflow/include/dbt/