VidFlow: YouTube Data Automation

⚠️ IMPORTANT NOTICE ⚠️

PROJECT FILES ARE TEMPORARILY UNAVAILABLE AS I'M CURRENTLY WORKING ON RESOLVING SOME IMPLEMENTATION ISSUES. THE COMPLETE CODEBASE WILL BE UPLOADED SOON. PLEASE REFER TO THE README FOR THE PROJECT ARCHITECTURE AND TECHNICAL DETAILS.

VidFlow: YouTube Data Automation

Project Overview

VidFlow is a comprehensive data engineering project that builds an ETL (Extract, Transform, Load) pipeline for YouTube trending videos data. The pipeline extracts data using the YouTube API, processes it with Apache Spark, stores it in AWS S3, and makes it available for analysis through AWS Athena and visualization in Tableau.

Architecture

This project follows a modern data engineering architecture:

Data Extraction: YouTube API is used to fetch trending videos data
Data Processing: Apache Spark processes and transforms the raw data
Data Storage: Processed data is stored in AWS S3 in Parquet format
Data Cataloging: AWS Glue Crawler catalogs the data for querying
Data Querying: AWS Athena provides SQL querying capabilities
Data Visualization: Tableau connects to Athena for creating dashboards

Technologies Used

Apache Spark: For distributed data processing
Python: Core programming language
AWS Services:
- S3: Object storage for data lake
- EC2: Compute for data processing
- Glue: Data catalog service
- Athena: Serverless query service
- IAM: Identity and access management
Tableau: Data visualization and dashboards
Git: Version control
Docker: Containerization (optional)

Data Source

The project uses the YouTube Data API v3 to collect trending videos data. The API provides access to various YouTube resources including:

Video metadata (title, description, publish date, etc.)
Channel information
Video statistics (views, likes, comments, etc.)
Video categories
Geographic trending data

Project Structure

youtube-data-analysis/
│
├── config/
│   ├── config.ini              # Configuration file for API keys and AWS settings
│
├── src/
│   ├── extraction/             # Data extraction scripts
│   │   ├── youtube_api.py      # YouTube API connector
│   │   └── extract_data.py     # Main extraction script
│   │
│   ├── transformation/         # Data transformation scripts
│   │   ├── spark_jobs/         # Spark transformation jobs
│   │   └── transform_data.py   # Main transformation script
│   │
│   ├── loading/                # Data loading scripts
│   │   └── load_to_s3.py       # Script to load data to S3
│   │
│   └── utils/                  # Utility functions
│       ├── s3_utils.py         # S3 utility functions
│       └── logging_utils.py    # Logging utility functions
│
├── notebooks/                  # Jupyter notebooks for exploration and testing
│   ├── data_exploration.ipynb  # Data exploration notebook
│   └── spark_testing.ipynb     # Spark testing notebook
│
├── scripts/                    # Shell scripts for automation
│   ├── setup.sh                # Setup script
│   └── run_pipeline.sh         # Script to run the full pipeline
│
├── sql/                        # SQL queries for Athena
│   ├── create_tables.sql       # Table creation queries
│   └── analysis_queries.sql    # Analysis queries
│
├── terraform/                  # Infrastructure as Code (Optional)
│   └── main.tf                 # Terraform configuration
│
├── docs/                       # Documentation
│   └── images/                 # Images for documentation
│
├── .gitignore                  # Git ignore file
├── requirements.txt            # Python dependencies
├── Dockerfile                  # Docker configuration (Optional)
├── LICENSE                     # License file
└── README.md                   # Project README

Setup and Installation

Prerequisites

AWS Account with appropriate permissions
Python 3.8+
Apache Spark 3.1+
YouTube Data API key
Tableau Desktop (for visualization)

Environment Setup

Clone the repository:

git clone https://github.com/yourusername/Youtube-Data-Analysis---Data-Engineering-Project.git
cd Youtube-Data-Analysis---Data-Engineering-Project

Install the required Python packages:
```
pip install -r requirements.txt
```
Set up AWS CLI and configure credentials:
```
pip install awscli
aws configure
```

Configuration

Create a config.ini file in the config directory with the following structure:

[youtube_api]
api_key = YOUR_YOUTUBE_API_KEY

[aws]
region = YOUR_AWS_REGION
s3_bucket = YOUR_S3_BUCKET_NAME

Create the S3 bucket if it doesn't exist:
```
aws s3 mb s3://YOUR_S3_BUCKET_NAME
```

Running the Pipeline

Manual Execution

Run the extraction script:
```
python src/extraction/extract_data.py
```

Run the transformation script:

python src/transformation/transform_data.py

Run the loading script:
```
python src/loading/load_to_s3.py
```

Using the Pipeline Script

Run the entire pipeline with a single command:

./scripts/run_pipeline.sh

Data Schema

After processing, the data follows this schema in the data lake:

Videos Table:

video_id (string): Unique identifier for the video
title (string): Video title
channel_id (string): Channel identifier
channel_title (string): Channel name
publish_time (timestamp): When the video was published
tags (array): Video tags
category_id (string): Category identifier
trending_date (date): Date when the video was trending
view_count (long): Number of views
likes (long): Number of likes
dislikes (long): Number of dislikes
comment_count (long): Number of comments
thumbnail_link (string): Link to the thumbnail
comments_disabled (boolean): Whether comments are disabled
ratings_disabled (boolean): Whether ratings are disabled
description (string): Video description
region (string): Region where the video is trending

Analytics Capabilities

With this pipeline, you can perform various analyses:

Trending Videos Analysis:
- Most viewed videos by category
- Videos with highest engagement (likes/views ratio)
- Trending patterns over time
Content Creator Analysis:
- Top channels by trending videos
- Channel performance metrics
Regional Analysis:
- Region-specific trending patterns
- Content preferences by region
Temporal Analysis:
- Day of week/time of day trends
- Seasonal trending patterns

Visualizations

Example visualizations you can create with Tableau:

Trending Dashboard:
- Heatmap of trending videos by category and region
- Time series of trending video metrics
Engagement Analysis:
- Comparison of engagement metrics across categories
- Correlation between video attributes and performance
Content Creator Insights:
- Top channels by region and category
- Channel growth and performance metrics

Future Enhancements

Implement real-time data processing with Kafka and Spark Streaming
Add sentiment analysis on video comments
Develop ML models to predict video performance
Create an automated recommendation system
Implement API for accessing processed data

Troubleshooting

Common Issues:

YouTube API Quota Exceeded:
- Solution: Implement rate limiting or use multiple API keys
Spark Job Failures:
- Solution: Check logs in the Spark UI and ensure sufficient resources
AWS Athena Query Issues:
- Solution: Verify that the AWS Glue Crawler has correctly cataloged the data

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add some amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

VidFlow: YouTube Data Automation

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

⚠️ IMPORTANT NOTICE ⚠️

PROJECT FILES ARE TEMPORARILY UNAVAILABLE AS I'M CURRENTLY WORKING ON RESOLVING SOME IMPLEMENTATION ISSUES. THE COMPLETE CODEBASE WILL BE UPLOADED SOON. PLEASE REFER TO THE README FOR THE PROJECT ARCHITECTURE AND TECHNICAL DETAILS.

VidFlow: YouTube Data Automation

Project Overview

Table of Contents

Architecture

Technologies Used

Data Source

Project Structure

Setup and Installation

Prerequisites

Environment Setup

Configuration

Running the Pipeline

Manual Execution

Using the Pipeline Script

Data Schema

Analytics Capabilities

Visualizations

Future Enhancements

Troubleshooting

Contributing

License

About

Uh oh!

Releases

Packages

Vkartik-3/VidFlow-YouTube-Data-Automation

Folders and files

Latest commit

History

Repository files navigation

⚠️ IMPORTANT NOTICE ⚠️

PROJECT FILES ARE TEMPORARILY UNAVAILABLE AS I'M CURRENTLY WORKING ON RESOLVING SOME IMPLEMENTATION ISSUES. THE COMPLETE CODEBASE WILL BE UPLOADED SOON. PLEASE REFER TO THE README FOR THE PROJECT ARCHITECTURE AND TECHNICAL DETAILS.

VidFlow: YouTube Data Automation

Project Overview

Table of Contents

Architecture

Technologies Used

Data Source

Project Structure

Setup and Installation

Prerequisites

Environment Setup

Configuration

Running the Pipeline

Manual Execution

Using the Pipeline Script

Data Schema

Analytics Capabilities

Visualizations

Future Enhancements

Troubleshooting

Contributing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages