Skip to content

IanDanielM/dummypipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ETL Pipeline

Python Airflow License

Requirements

  • Python >=3.8
  • Apache Airflow 2.10
  • Google Cloud Platform account with required services enabled

Overview

This project implements an ETL (Extract, Transform, Load) pipeline using Apache Airflow to process data from the DummyJSON API. The pipeline extracts user, product, and cart data, transforms it according to specified schemas, and loads it into Google Cloud Storage and BigQuery for analysis.

Project Structure

project_root/
├── dags/
│   └── etl_dag.py
├── schemas/
│   ├── users.py
│   ├── products.py
│   └── carts.py
├── scripts/
│   ├── helpers.py
│   └── bg_sql_scripts.py
├── config/
│   └── config.py
├── tests/
│   └── test_schemas.py
├── requirements.txt
└── README.md

Installation

Local Development Setup

  1. Clone the repository:
git clone <repository-url>
  1. Create and activate virtual environment:
python -m venv venv

# Linux/macOS
source venv/bin/activate

# Windows
.\venv\Scripts\activate
  1. Install dependencies:
pip install -r requirements.txt

Pipeline Components

1. Data Extraction

  • Fetches data from DummyJSON API endpoints
  • Handles pagination
  • Stores raw data locally
  • Implements error handling and retries

2. Data Transformation

  • Validates and cleans data using Pydantic models
  • Flattens nested structures
  • Filters products based on price threshold (<50)
  • Calculates cart values
  • Performs data quality checks

3. Data Loading

  • Uploads raw and cleaned data
  • Creates BigQuery tables
  • Loads transformed data into BigQuery
  • Validates data loads

4. Data Analysis

  • User purchase summary query
  • Category sales analysis query
  • Detailed cart information query

Configuration

Environment Variables

Create a .env file with:

BUCKET_NAME=
BUCKET_PATH=
BIGQUERY_PROJECT_DATASET=
BASE_FILE_PATH=
GCP_CONNECTION_ID=

GCP Setup

Required permissions for GCP service account:

  • Storage Object Admin
  • BigQuery Data Editor
  • BigQuery Job User

Place your GCP service account key in config/gcp.json

Testing

Run tests using pytest:

pytest tests/

Usage

  1. Ensure all configuration is set up correctly
  2. Start Airflow webserver and scheduler:
airflow webserver
airflow scheduler
  1. Access Airflow UI at http://localhost:8080
  2. Trigger the DAG named dummy_pipeline through the Airflow UI

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages