ETL Pipeline

Requirements

Python >=3.8
Apache Airflow 2.10
Google Cloud Platform account with required services enabled

Overview

This project implements an ETL (Extract, Transform, Load) pipeline using Apache Airflow to process data from the DummyJSON API. The pipeline extracts user, product, and cart data, transforms it according to specified schemas, and loads it into Google Cloud Storage and BigQuery for analysis.

Project Structure

project_root/
├── dags/
│   └── etl_dag.py
├── schemas/
│   ├── users.py
│   ├── products.py
│   └── carts.py
├── scripts/
│   ├── helpers.py
│   └── bg_sql_scripts.py
├── config/
│   └── config.py
├── tests/
│   └── test_schemas.py
├── requirements.txt
└── README.md

Installation

Local Development Setup

Clone the repository:

git clone <repository-url>

Create and activate virtual environment:

python -m venv venv

# Linux/macOS
source venv/bin/activate

# Windows
.\venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt

Pipeline Components

1. Data Extraction

Fetches data from DummyJSON API endpoints
Handles pagination
Stores raw data locally
Implements error handling and retries

2. Data Transformation

Validates and cleans data using Pydantic models
Flattens nested structures
Filters products based on price threshold (<50)
Calculates cart values
Performs data quality checks

3. Data Loading

Uploads raw and cleaned data
Creates BigQuery tables
Loads transformed data into BigQuery
Validates data loads

4. Data Analysis

User purchase summary query
Category sales analysis query
Detailed cart information query

Configuration

Environment Variables

Create a .env file with:

BUCKET_NAME=
BUCKET_PATH=
BIGQUERY_PROJECT_DATASET=
BASE_FILE_PATH=
GCP_CONNECTION_ID=

GCP Setup

Required permissions for GCP service account:

Storage Object Admin
BigQuery Data Editor
BigQuery Job User

Place your GCP service account key in config/gcp.json

Testing

Run tests using pytest:

pytest tests/

Usage

Ensure all configuration is set up correctly
Start Airflow webserver and scheduler:

airflow webserver
airflow scheduler

Access Airflow UI at http://localhost:8080
Trigger the DAG named dummy_pipeline through the Airflow UI

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ETL Pipeline

Requirements

Overview

Project Structure

Installation

Local Development Setup

Pipeline Components

1. Data Extraction

2. Data Transformation

3. Data Loading

4. Data Analysis

Configuration

Environment Variables

GCP Setup

Testing

Usage

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
config		config
dags		dags
schemas		schemas
scripts		scripts
tests		tests
.gitignore		.gitignore
DOC.md		DOC.md
README.md		README.md
image.png		image.png
requirements.txt		requirements.txt

IanDanielM/dummypipeline

Folders and files

Latest commit

History

Repository files navigation

ETL Pipeline

Requirements

Overview

Project Structure

Installation

Local Development Setup

Pipeline Components

1. Data Extraction

2. Data Transformation

3. Data Loading

4. Data Analysis

Configuration

Environment Variables

GCP Setup

Testing

Usage

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages