This sentiment analysis project aims to classify YouTube comments into positive, neutral, or negative sentiments using advanced machine learning techniques. The project comprises the following key components:
-
Model Training:
- The project uses LightGBM as the base model for its high performance and interpretability.
- Employs feature extraction techniques such as TF-IDF and handles class imbalances through resampling techniques to improve model robustness.
-
Backend:
- A FastAPI application serves as the backend for real-time predictions, exposing RESTful API endpoints to interact with the trained model.
-
MLflow, DVC & Dagshub:
- MLflow is leveraged for model tracking and versioning, ensuring seamless experiment management.
- DVC enables efficient data and model version control, improving pipeline reproducibility and collaboration.
- Experiments and model tracking are integrated with Dagshub, providing an interactive dashboard for managing model lifecycle and monitoring metrics.
-
Frontend Integration:
- A Chrome Plugin acts as the user-facing frontend, interacting with the backend for predictions.
- Explore the frontend repository here.
-
Continuous Integration and Deployment (CI/CD):
- Managed through a
cicd.yaml
workflow in GitHub Actions to ensure automated testing, building, and deployment.
- Managed through a
-
AWS Integration:
- The backend is deployed on AWS EC2 instances using AWS CodeDeploy.
- An Elastic Load Balancer (ELB) is set up to distribute traffic efficiently.
- Auto Scaling Groups (ASG) ensure high availability and scalability under varying loads.
-
Containerization:
- The entire backend is containerized using Docker for consistency across development and production environments.
- A public Amazon Elastic Container Registry (ECR) hosts the container image for easy access and deployment.
You can pull the latest version of the Docker image from the public Amazon ECR repository using the following command:
docker pull public.ecr.aws/m3t3s7a1/yt-plugin:latest
- Watch a video demonstration of the project here.
- View all model experiments and their results on Dagshub here.
The project follows a well-defined structure that organizes code, data, models, and configurations in separate directories:
└── 📁sentiment_analysis
└── 📁.github
└── 📁workflows
└── ⚙️ cicd.yaml
└── 📁deploy
└── 📁scripts
└── download_env.sh
└── install_dependencies.sh
└── start_docker.sh
└── 📁data
└── 📁interim
└── test_processed.csv
└── train_processed.csv
└── 📁processed
└── train_target.csv
└── train_tfidf.csv
└── 📁raw
└── test.csv
└── train.csv
└── 📁visualizations
└── 🖼️confusion_matrix_Test_Data.png
└── 📁fastapi
└── app.py
└── 📜requirements.txt
└── 📁models
└── 💾lgbm_model.joblib
└── 💾tfidf_vectorizer.joblib
└── 📁notebooks
└── preprocessing_eda.ipynb
└── exp_1_baseline_model.ipynb
└── exp_2_bow_tfidf_word2vec.ipynb
└── exp_3_handling_imbalanced_data.ipynb
└── exp_4_tuning_ml_algo.ipynb
└── exp_4_tuning_ml_algo_2.ipynb
└── exp_5_lightGBM_final.ipynb
└── 📁scripts
└── load_model_test.py
└── performance_test.py
└── promote_model.py
└── fastapi_test.py
└── 📁src
└── data_ingestion.py
└── data_preprocessing.py
└── feature_extraction.py
└── model_building.py
└── model_evaluation.py
└── register_model.py
└── utils.py
└── .dvcignore
└── .env
└── .gitignore
└── 🐳Dockerfile
└── 📝appspec.yml
└── dvc.lock
└── 📝dvc.yaml
└── experiment_info.json
└── Makefile
└── 📝params.yaml
└── pyproject.toml
└── README.md
└── 📜requirements.txt
This notebook focuses on Exploratory Data Analysis (EDA) and preprocessing steps. It includes:
- Data loading: Imported and explored the Reddit sentiment dataset.
- Cleaning and preprocessing: Removed missing values, duplicates, URLs, and non-English characters; converted text to lowercase; and applied lemmatization.
- Feature engineering: Added columns for word count, character count, and punctuation count.
- Visualization: Analyzed class distribution, word counts, stop words for each sentiment category.
This notebook builds a baseline Random Forest model for sentiment analysis. Key steps:
- Dataset preparation: Loaded preprocessed data (
preprocessed_data.csv
). Split data into training (80%) and testing (20%) sets using stratified sampling. - Feature extraction: Vectorized comments using Bag of Words (CountVectorizer) with a max feature size of 10,000. Combined vectorized features with original dataset columns.
- Model training: Used a Random Forest Classifier (
n_estimators=200
,max_depth=15
) as a baseline model. Trained on vectorized and combined features. - Evaluation: Calculated accuracy and detailed metrics for each class using classification report. Visualized results with a confusion matrix.
- Logging with MLflow: Logged experiment parameters, metrics, model artifacts, and dataset details to MLflow. Saved confusion matrix as a plot artifact.
- Accuracy: Achieved baseline accuracy and recorded detailed class-wise metrics.
This notebook compares BoW, TF-IDF, and Word2Vec vectorization techniques for sentiment analysis, while optimizing hyperparameters using Optuna.
- Dataset preparation: Dataset preparation included loading preprocessed data (
preprocessed_data.csv
), dropping missing comments, and retaining features like word count, character count, and average word length. - Vectorization: Implemented vectorization techniques including CountVectorizer (BoW), TfidfVectorizer, and Word2Vec; Optuna tuned hyperparameters such as
vectorizer_type
(BoW, TF-IDF, Word2Vec),ngram_range
for BoW and TF-IDF,max_features
for BoW and TF-IDF, andvector_size
for Word2Vec. - Model training: Trained a Random Forest Classifier (
n_estimators=200
,max_depth=15
) using combined vectorized features with additional features for training and testing. - Hyperparameter optimization: Ran 200 trials with Optuna to optimize hyperparameters and logged each trial's results to MLflow, including metrics, parameters, and confusion matrix plots.
- Evaluation: Achieved the best accuracy and identified the optimal vectorization method and hyperparameters, with the highest accuracy and best parameters displayed.
This notebook explores various strategies for handling imbalanced datasets in sentiment analysis using TF-IDF vectorization and resampling techniques.
- Loaded preprocessed data (
preprocessed_data.csv
) and extracted features like word count, character count, and average word length. Split the dataset into training and testing sets and vectorized the text data using TF-IDF with hyperparameters (max_features=1006
,ngram_range=(1, 2)
), combining vectorized text with additional features. - Experimented with multiple resampling techniques: Random Undersampling, Tomek Links, Centroid Clustering, NearMiss Undersampling, Random Oversampling, SMOTE, ADASYN, Borderline SMOTE, SMOTETomek, SMOTEENN, and class_weight='balanced'.
- Trained a Random Forest Classifier (
n_estimators=200
,max_depth=15
) on resampled data for each technique and evaluated performance using accuracy score, classification report, and confusion matrix. Logged all parameters, metrics, and confusion matrix plots to MLflow with appropriate tags for each technique. - Generated confusion matrices for visual analysis of predictions, logged artifacts for all runs, and identified the impact of resampling techniques on model performance. Experiment concluded with all results logged and accuracy improvements analyzed.
This notebook is focused on hyperparameter tuning and improving sentiment analysis performance:
- Optuna Optimization: Implements Optuna for multi-objective hyperparameter tuning across various models, including Random Forest, Logistic Regression, Naive Bayes, SVM, XGBoost and LightGBM.
- Resampling Techniques: Applies ADASYN for handling class imbalance in the dataset.
- Experiment Tracking: Uses MLflow with Dagshub integration for logging parameters, metrics, and artifacts like confusion matrices.
- Model Evaluation: Evaluates models on accuracy, F1-score, and classification metrics while tracking their performance across trials.
This notebook demonstrates detailed hyperparameter tuning and evaluation of a LightGBM model for sentiment analysis on YouTube comments:
- Model Optimization: Optuna optimizes hyperparameters like n_estimators, max_depth, and learning_rate over
150 trials
. - Evaluation: The model is evaluated using accuracy, F1 score, and a confusion matrix.
- Prediction Function: A function predicts sentiment for new comments, returning the sentiment and its confidence.
Contains all the source code and scripts for different stages of the ML pipeline. The project uses DVC (Data Version Control) for tracking data and model versions, ensuring that experiments are reproducible and for building pipeline. The trained models are also registered in MLflow Model Registry for version management and easier deployment.
- data_ingestion.py: Handles the process of loading and collecting data.
- data_preprocessing.py: Performs necessary preprocessing tasks such as text cleaning, handling missing values, etc.
- feature_extraction.py: Implements feature extraction technique
TF-IDF
. - model_building.py: Contains code for defining, training, and tuning
LightGBM
models. - model_evaluation.py: Provides functions for evaluating the model’s performance using metrics like accuracy, precision, recall, etc.
- register_model.py: Registers the trained model into MLflow Model Registry for version control and easy tracking.
- utils.py: Contains helper functions for tasks that are used across the pipeline (e.g., logging, exception handling, loading, etc.).
Defines the DVC pipeline, describing how data and models flow through the different stages of the project.
Locks the versions of the dependencies and ensures reproducibility across environments.
This file contains the hyperparameters and settings for different stages of the machine learning pipeline. It includes data ingestion, feature extraction, and model building parameters.
This directory contains various Python scripts used for testing, evaluating, and promoting machine learning models, as well as for testing the FastAPI backend using pytest
.
-
1. load_model_test.py: Tests the loading vectorizer machine learning model from model registry.
-
2. performance_test.py: This script is typically used for assessing the accuracy, precision, recall and F1 score metrics for model evaluation.
-
3. promote_model.py: Handles the model promotion process. This script is used to transition the model from staging phase to production. It ensures that the best-performing model is deployed.
-
4. fastapi_test.py: Tests the FastAPI endpoints used for making predictions. This script ensures that the backend API is functional, and the model predictions are returned correctly.
Each of these scripts is critical for testing and deploying models and ensuring the API works correctly for real-time predictions.
This directory contains the FastAPI application and its dependencies used for serving machine learning model predictions via a backend API.
-
1. app.py: The main FastAPI application file that defines the API endpoints. It handles incoming HTTP requests, loads the trained machine learning model, and serves predictions based on user inputs. The FastAPI app is used to expose model functionality to external services or users.
-
2. requirements.txt: A text file that lists the dependencies required to run the FastAPI application. It includes libraries and tools such as FastAPI, Uvicorn, and any other necessary dependencies for the model serving functionality.
This directory is key for deploying the machine learning model as a service and interacting with it through HTTP requests.
This Dockerfile
defines a multi-stage build for creating a Docker image to deploy the FastAPI application with the machine learning model.Stages in the Dockerfile are:
-
1. Builder Stage
- Base Image:
python:3.10-slim
- Purpose: Sets up the environment for building the application by installing the necessary dependencies and libraries.
- Base Image:
-
2. Runtime Image
- Base Image:
python:3.10-slim
- Purpose: The final image used to run the application, which is based on a slimmed-down Python image.
- Base Image:
The use of a multi-stage build optimizes the Docker image by separating the build and runtime environments. The builder stage contains all the tools and dependencies needed to install libraries, while the runtime image only includes the essential libraries and application code, making the final image smaller and more efficient.
This GitHub Actions pipeline automates the deployment and testing of a machine learning model as a FastAPI service with integration to AWS services such as S3, ECR, and CodeDeploy.
-
Checkout Code : Pulls the latest code from the GitHub repository using
actions/checkout@v3
. -
Set up Python : Configures Python 3.10 using
actions/setup-python@v2
for compatibility with the required dependencies. -
Cache pip Dependencies : Caches Python package installations for faster subsequent runs using
actions/cache@v3
. -
Install Dependencies : Installs all the required libraries listed in
requirements.txt
.
-
Run Pipeline : Executes the DVC pipeline (
dvc repro
) to preprocess and prepare data, with AWS credentials and DAGsHub authentication passed as environment variables. -
Push DVC-tracked Data : Uploads processed data and outputs tracked by DVC to a remote storage (AWS S3).
-
Configure Git : Sets up Git user details for commits by GitHub Actions.
-
Add and Commit Changes : Stages and commits changes, such as updated DVC outputs, to the repository.
-
Push Changes : Pushes committed changes back to the repository.
-
Run Model Loading Test : Verifies the model loading script with
pytest
to ensure the model loads correctly. -
Run Model Performance Test : Validates the model’s performance by executing a performance test script.
-
Run FastAPI Tests : Tests the FastAPI endpoints for functionality using
pytest
.
-
Start FastAPI App : Launches the FastAPI application in the background using
nohup
. -
Stop FastAPI App : Ensures the running FastAPI application is terminated after testing.
-
Login to AWS ECR : Authenticates Docker with AWS Elastic Container Registry (ECR) for image uploads.
-
Build Docker Image : Builds a Docker image named
yt-plugin
for the FastAPI application. -
Tag Docker Image : Tags the built Docker image for ECR.
-
Push Docker Image to AWS ECR : Uploads the Docker image to AWS ECR for containerized deployment.
-
Zip Files for Deployment : Compresses deployment files (
appspec.yml
and necessary scripts) into adeployment.zip
. -
Upload ZIP and Secret to S3 : Uploads the deployment package and environment variables to an S3 bucket.
- Deploy to AWS CodeDeploy : Initiates deployment through AWS CodeDeploy, specifying the application, deployment group, and configuration settings.
This pipeline ensures seamless integration and deployment by leveraging GitHub Actions, DVC, AWS services, and Docker, automating every step from preprocessing to deployment, while running rigorous tests at each stage.
The deploy
directory includes scripts
folder which contains all necessary configuration files for deploying the application via AWS CodeDeploy.
-
download_env.sh :
- Downloads environment variables (
dagshub.env
) from an S3 bucket.
- Downloads environment variables (
-
install_dependencies.sh
- Installs Docker, AWS CLI, and other necessary utilities.
- Configures Docker to run without
sudo
and enables it as a service.
-
start_docker.sh
- Loads environment variables and verifies them.
- Logs in to AWS ECR and pulls the latest Docker image.
- Stops and removes any existing container before starting a new one.
- Cleans up sensitive files after deployment.
Defines deployment lifecycle hooks:
- BeforeInstall: Runs
install_dependencies.sh
anddownload_env.sh
to set up dependencies and environment variables. - ApplicationStart: Executes
start_docker.sh
to pull and run the Docker container.
This setup ensures a smooth and automated deployment process, adhering to security and efficiency standards.