🌸 GCP Skincare Products Pipeline

This project implements an end-to-end data pipeline for processing skincare products and reviews data from Sephora, using Google Cloud Platform (GCP), PySpark, Docker, Airflow, dbt, Great Expectations, and Terraform.

📚 Table of Contents

Project Overview
Setup
- Prerequisites
- Useful Commands

📦 Project Overview

1. 🧾 Dataset

The dataset consists of:

product_info.csv: contains product name, brand, category, price, ingredients, highlights ...etc.
reviews_1.csv to reviews_5.csv: customer reviews with product ID, rating, feedback_count ...etc.

📥 Download: Link to dataset

2. 🧱 Architecture

The pipeline includes the following stages:

Provision cloud infrastructure on GCP using IaC (Terraform)
Upload raw CSVs to a data lake (Google Cloud Storage)
Submit a PySpark job for transformation and cleaning (Dataproc)
Load processed data into a data warehouse (BigQuery)
Build a dimensional model using SQL (dbt using Cosmos)
Run data quality checks (Great Expectations + dbt)
Orchestrate the entire pipeline (Airflow via Astro CLI)
Visualize insights (Looker Studio)

3. 🛠 Technologies Used

Tool / Language	Purpose
Google Cloud Storage	Store raw CSV files + PySpark code
Dataproc + PySpark	Data transformation and cleaning
BigQuery	Data warehousing and SQL-based querying
dbt	Data modeling and testing
Great Expectations	Data validation and testing
Airflow (Astro CLI)	Orchestration of the entire pipeline
Terraform	Provision and manage cloud infrastructure (IaC)
Docker	Containerization of services
Looker Studio	Final dashboard for visualization

4. 🧮 Data Model

dbt models: The dbt project follows a four-layer architecture:

source:
- products
- reviews
staging:
- stg_products
- stg_reviews
- stg_customers
- stg_brands
star_schema:
- dim_products
- dim_date
- dim_customers
- fact_care
marts:
- brands_rating
- products_skin_type
- rating_price_product
- total_feedback_product

5. 📊 Dashboard

The dashboard provides insights on:

Average product prices and ratings by skin type
Most-reviewed skincare products
Distribution of product price versus rating

🖼️ Dashboard Preview:

⚙️ Setup

✅ Prerequisites

A Google Cloud Platform (GCP) account with a service account that has admin access to GCS, Dataproc, and BigQuery.
Docker & Astro CLI installed (refer to the official Astro documentation).
Python & PySpark installed.
Terraform installed.

🧾 Useful Commands

Airflow

astro dev start (astro dev start --wait 3m)
astro dev bash

Terraform

cd terraform/
terraform init
terraform plan
terraform apply

dbt

cd include/dbt
dbt deps
dbt run --profiles-dir /usr/local/airflow/include/dbt/

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
.astro		.astro
dags		dags
include		include
.dockerignore		.dockerignore
.env		.env
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
packages.txt		packages.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🌸 GCP Skincare Products Pipeline

📚 Table of Contents

📦 Project Overview

1. 🧾 Dataset

2. 🧱 Architecture

3. 🛠 Technologies Used

4. 🧮 Data Model

5. 📊 Dashboard

⚙️ Setup

✅ Prerequisites

🧾 Useful Commands

Airflow

Terraform

dbt

About

Uh oh!

Releases

Packages

Uh oh!

Languages

hasnaabouagad/GCPSkincareProductsPipeline

Folders and files

Latest commit

History

Repository files navigation

🌸 GCP Skincare Products Pipeline

📚 Table of Contents

📦 Project Overview

1. 🧾 Dataset

2. 🧱 Architecture

3. 🛠 Technologies Used

4. 🧮 Data Model

5. 📊 Dashboard

⚙️ Setup

✅ Prerequisites

🧾 Useful Commands

Airflow

Terraform

dbt

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages