GitHub - swatir-git/airflow-etl: ETL project with Spark and Airflow

ETL Pipeline

Project Overview

This project extracts trending video data from the YouTube API of various regions, processes it to identify top tags linked to video categories of trending videos, and then loads the data to an Azure Data Lake Storage (ADLS) Gen 2 container. The pipeline is built using Python, Apache Airflow, and PySpark, and is designed to be scalable and efficient daily batch runs.

Features

Data Extraction: Fetches trending video data of various regions using the YouTube data API v3.
Data Transformation: Processes and transforms the raw video data to extract and analyze top tags associated with video categories.
Data Loading: Loads the transformed data into an ADLS Gen 2 container for further analysis or reporting.

Tools and Technologies

Python: Programming language used for the core logic of the ETL pipeline.
Apache Airflow: Used for orchestrating and scheduling the ETL workflow.
PySpark: Used for large-scale data processing and transformation.
YouTube Data API: Provides access to trending video data.
Azure Data Lake Storage Gen 2 (ADLS Gen 2): Cloud storage used for storing the processed data.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.idea		.idea
main/src		main/src
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ETL Pipeline

Project Overview

Features

Tools and Technologies

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

swatir-git/airflow-etl

Folders and files

Latest commit

History

Repository files navigation

ETL Pipeline

Project Overview

Features

Tools and Technologies

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages