PROJECT FILES ARE TEMPORARILY UNAVAILABLE AS I'M CURRENTLY WORKING ON RESOLVING SOME IMPLEMENTATION ISSUES. THE COMPLETE CODEBASE WILL BE UPLOADED SOON. PLEASE REFER TO THE README FOR THE PROJECT ARCHITECTURE AND TECHNICAL DETAILS.
VidFlow is a comprehensive data engineering project that builds an ETL (Extract, Transform, Load) pipeline for YouTube trending videos data. The pipeline extracts data using the YouTube API, processes it with Apache Spark, stores it in AWS S3, and makes it available for analysis through AWS Athena and visualization in Tableau.
- Architecture
- Technologies Used
- Data Source
- Project Structure
- Setup and Installation
- Running the Pipeline
- Data Schema
- Analytics Capabilities
- Visualizations
- Future Enhancements
- Troubleshooting
- Contributing
- License
This project follows a modern data engineering architecture:
- Data Extraction: YouTube API is used to fetch trending videos data
- Data Processing: Apache Spark processes and transforms the raw data
- Data Storage: Processed data is stored in AWS S3 in Parquet format
- Data Cataloging: AWS Glue Crawler catalogs the data for querying
- Data Querying: AWS Athena provides SQL querying capabilities
- Data Visualization: Tableau connects to Athena for creating dashboards
- Apache Spark: For distributed data processing
- Python: Core programming language
- AWS Services:
- S3: Object storage for data lake
- EC2: Compute for data processing
- Glue: Data catalog service
- Athena: Serverless query service
- IAM: Identity and access management
- Tableau: Data visualization and dashboards
- Git: Version control
- Docker: Containerization (optional)
The project uses the YouTube Data API v3 to collect trending videos data. The API provides access to various YouTube resources including:
- Video metadata (title, description, publish date, etc.)
- Channel information
- Video statistics (views, likes, comments, etc.)
- Video categories
- Geographic trending data
youtube-data-analysis/
│
├── config/
│ ├── config.ini # Configuration file for API keys and AWS settings
│
├── src/
│ ├── extraction/ # Data extraction scripts
│ │ ├── youtube_api.py # YouTube API connector
│ │ └── extract_data.py # Main extraction script
│ │
│ ├── transformation/ # Data transformation scripts
│ │ ├── spark_jobs/ # Spark transformation jobs
│ │ └── transform_data.py # Main transformation script
│ │
│ ├── loading/ # Data loading scripts
│ │ └── load_to_s3.py # Script to load data to S3
│ │
│ └── utils/ # Utility functions
│ ├── s3_utils.py # S3 utility functions
│ └── logging_utils.py # Logging utility functions
│
├── notebooks/ # Jupyter notebooks for exploration and testing
│ ├── data_exploration.ipynb # Data exploration notebook
│ └── spark_testing.ipynb # Spark testing notebook
│
├── scripts/ # Shell scripts for automation
│ ├── setup.sh # Setup script
│ └── run_pipeline.sh # Script to run the full pipeline
│
├── sql/ # SQL queries for Athena
│ ├── create_tables.sql # Table creation queries
│ └── analysis_queries.sql # Analysis queries
│
├── terraform/ # Infrastructure as Code (Optional)
│ └── main.tf # Terraform configuration
│
├── docs/ # Documentation
│ └── images/ # Images for documentation
│
├── .gitignore # Git ignore file
├── requirements.txt # Python dependencies
├── Dockerfile # Docker configuration (Optional)
├── LICENSE # License file
└── README.md # Project README
- AWS Account with appropriate permissions
- Python 3.8+
- Apache Spark 3.1+
- YouTube Data API key
- Tableau Desktop (for visualization)
-
Clone the repository:
git clone https://github.com/yourusername/Youtube-Data-Analysis---Data-Engineering-Project.git cd Youtube-Data-Analysis---Data-Engineering-Project
-
Install the required Python packages:
pip install -r requirements.txt
-
Set up AWS CLI and configure credentials:
pip install awscli aws configure
-
Create a
config.ini
file in theconfig
directory with the following structure:[youtube_api] api_key = YOUR_YOUTUBE_API_KEY [aws] region = YOUR_AWS_REGION s3_bucket = YOUR_S3_BUCKET_NAME
-
Create the S3 bucket if it doesn't exist:
aws s3 mb s3://YOUR_S3_BUCKET_NAME
-
Run the extraction script:
python src/extraction/extract_data.py
-
Run the transformation script:
python src/transformation/transform_data.py
-
Run the loading script:
python src/loading/load_to_s3.py
Run the entire pipeline with a single command:
./scripts/run_pipeline.sh
After processing, the data follows this schema in the data lake:
Videos Table:
- video_id (string): Unique identifier for the video
- title (string): Video title
- channel_id (string): Channel identifier
- channel_title (string): Channel name
- publish_time (timestamp): When the video was published
- tags (array): Video tags
- category_id (string): Category identifier
- trending_date (date): Date when the video was trending
- view_count (long): Number of views
- likes (long): Number of likes
- dislikes (long): Number of dislikes
- comment_count (long): Number of comments
- thumbnail_link (string): Link to the thumbnail
- comments_disabled (boolean): Whether comments are disabled
- ratings_disabled (boolean): Whether ratings are disabled
- description (string): Video description
- region (string): Region where the video is trending
With this pipeline, you can perform various analyses:
-
Trending Videos Analysis:
- Most viewed videos by category
- Videos with highest engagement (likes/views ratio)
- Trending patterns over time
-
Content Creator Analysis:
- Top channels by trending videos
- Channel performance metrics
-
Regional Analysis:
- Region-specific trending patterns
- Content preferences by region
-
Temporal Analysis:
- Day of week/time of day trends
- Seasonal trending patterns
Example visualizations you can create with Tableau:
-
Trending Dashboard:
- Heatmap of trending videos by category and region
- Time series of trending video metrics
-
Engagement Analysis:
- Comparison of engagement metrics across categories
- Correlation between video attributes and performance
-
Content Creator Insights:
- Top channels by region and category
- Channel growth and performance metrics
- Implement real-time data processing with Kafka and Spark Streaming
- Add sentiment analysis on video comments
- Develop ML models to predict video performance
- Create an automated recommendation system
- Implement API for accessing processed data
Common Issues:
-
YouTube API Quota Exceeded:
- Solution: Implement rate limiting or use multiple API keys
-
Spark Job Failures:
- Solution: Check logs in the Spark UI and ensure sufficient resources
-
AWS Athena Query Issues:
- Solution: Verify that the AWS Glue Crawler has correctly cataloged the data
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature
) - Commit your changes (
git commit -m 'Add some amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
VidFlow: YouTube Data Automation