|
1 | 1 | # Text Classification using MLOps
|
2 | 2 |
|
3 |
| -This project demonstrates a complete MLOps pipeline for a text classification task, implementing end-to-end practices for model experimentation, tracking, and deployment. The project includes data tracking, model training, hyperparameter tuning, model registration, and automated pipeline deployment. |
| 3 | +This project demonstrates a complete MLOps pipeline for a text classification task, implementing end-to-end practices for model experimentation, tracking, packaging, and deployment. The project incorporates advanced features such as AWS CodeDeploy for automated blue-green deployment and AWS Elastic Container Registry (ECR) for model storage. To ensure scalability, reliability, and fault tolerance, it also utilizes **AWS Auto Scaling Groups (ASGs)**, **Load Balancers**, and **Launch Templates**. |
| 4 | + |
| 5 | +--- |
4 | 6 |
|
5 | 7 | ## Project Overview
|
6 | 8 |
|
7 | 9 | This repository includes:
|
8 |
| -- **Experiment Tracking**: Tracks all model training runs with parameters, metrics, and artifacts in MLflow. |
9 |
| -- **Hyperparameter Tuning**: Performs hyperparameter tuning using MLflow to log and compare performance across runs. |
10 |
| -- **ML Pipeline with DVC**: Constructs a machine learning pipeline using DVC, tracking each stage to reproduce the best model. |
11 |
| -- **Model Registration**: Registers the best model on MLflow, making it accessible for production. |
12 |
| -- **Data Tracking**: Uses DVC to version and store the dataset in an Amazon S3 bucket. |
13 |
| -- **Remote Tracking Server**: Deploys an MLflow remote tracking server on DagsHub for centralized experiment tracking. |
14 |
| -- **Automated Pipeline Development**: Configures GitHub Actions to automate the ML pipeline, testing, and deployment processes. |
15 |
| -- **Unit Testing**: Adds unit test cases for the Flask API, model loading, and model signature to ensure robustness. |
16 |
| -- **Production Deployment**: Automatically promotes models to the production stage in MLflow if they pass all tests. |
| 10 | +- **Experiment Tracking**: Logs all training runs with parameters, metrics, and artifacts in MLflow. |
| 11 | +- **Hyperparameter Tuning**: Uses MLflow to log and compare performance during hyperparameter optimization. |
| 12 | +- **ML Pipeline with DVC**: Structures and manages machine learning pipelines, ensuring reproducibility. |
| 13 | +- **Model Registration**: Registers the best-performing models for deployment using MLflow. |
| 14 | +- **Data Versioning**: Tracks and versions datasets with DVC, storing them in Amazon S3. |
| 15 | +- **Remote Experiment Tracking**: Hosts a centralized MLflow tracking server on DagsHub. |
| 16 | +- **Automated CI/CD Pipelines**: Leverages GitHub Actions to automate testing, pipeline execution, and deployment. |
| 17 | +- **Unit Testing**: Validates API endpoints, model loading, and configurations to ensure robust deployments. |
| 18 | +- **AWS CodeDeploy with Blue-Green Deployment**: Deploys the application using AWS CodeDeploy to minimize downtime. |
| 19 | +- **AWS ECR Integration**: Stores and retrieves Docker images for deployment. |
| 20 | +- **Production Deployment**: Automates testing and model promotion to production, ensuring deployment readiness. |
| 21 | +- **Scalability Features**: |
| 22 | + - **Auto Scaling Groups (ASGs)**: Automatically adjusts the number of EC2 instances based on traffic and system load. |
| 23 | + - **Load Balancers**: Distributes traffic evenly across instances to ensure high availability and fault tolerance. |
| 24 | + - **Launch Templates**: Defines instance configurations for easy scaling and reproducibility. |
17 | 25 |
|
18 | 26 | ---
|
19 | 27 |
|
20 | 28 | ## Key Features
|
21 | 29 |
|
22 |
| -### 1. Experiment Tracking with MLflow |
23 |
| - - All experiments, including hyperparameters and metrics, are logged to an MLflow server hosted on DagsHub. |
24 |
| - - This helps in tracking and comparing various model runs for better reproducibility and performance analysis. |
25 |
| - |
26 |
| -### 2. Hyperparameter Tuning |
27 |
| - - Used MLflow’s tracking capabilities to monitor hyperparameter tuning experiments. |
28 |
| - - Selected the best model configuration based on tracked metrics and stored it in the MLflow Model Registry. |
29 |
| - |
30 |
| -### 3. ML Pipeline Creation with DVC |
31 |
| - - DVC (Data Version Control) is used to define and manage a structured ML pipeline. |
32 |
| - - Each stage, from data preprocessing to model training, is defined in DVC, ensuring reproducibility. |
33 |
| - - The final model is selected based on the best hyperparameter configuration and added to the DVC pipeline. |
34 |
| - |
35 |
| -### 4. Model Registration on MLflow |
36 |
| - - The best-performing model from the experiments is registered in MLflow, ensuring easy access for deployment. |
37 |
| - - Staging and production environments are defined to manage the model lifecycle effectively. |
38 |
| - |
39 |
| -### 5. Data Tracking with DVC and S3 |
40 |
| - - The dataset is versioned using DVC, and files are stored in an Amazon S3 bucket. |
41 |
| - - This approach provides efficient data management, enabling version control and rollback if needed. |
| 30 | +### 1. **Experiment Tracking with MLflow** |
| 31 | + - Logs all experiments, hyperparameters, metrics, and artifacts to an MLflow server hosted on DagsHub. |
| 32 | + - Simplifies comparison and selection of the best-performing models. |
| 33 | + |
| 34 | +### 2. **Hyperparameter Tuning** |
| 35 | + - Uses MLflow’s tracking capabilities for hyperparameter tuning. |
| 36 | + - Tracks each experiment run and selects the best configuration for deployment. |
| 37 | + |
| 38 | +### 3. **Structured ML Pipeline with DVC** |
| 39 | + - Employs DVC to define and manage an end-to-end ML pipeline from data ingestion to model training. |
| 40 | + - Tracks all pipeline stages, ensuring reproducibility and efficient updates. |
| 41 | + |
| 42 | +### 4. **Model Registration in MLflow** |
| 43 | + - Registers the best-performing models to MLflow Model Registry. |
| 44 | + - Supports staging and production environments for model lifecycle management. |
| 45 | + |
| 46 | +### 5. **Data Versioning with DVC and S3** |
| 47 | + - Tracks data changes and stores datasets securely in an Amazon S3 bucket. |
| 48 | + - Allows easy rollback and version comparison for datasets. |
| 49 | + |
| 50 | +### 6. **Scalable Deployment with AWS** |
| 51 | + - **Auto Scaling Groups (ASGs)**: |
| 52 | + - Dynamically adjusts the number of EC2 instances based on predefined scaling policies (e.g., CPU usage, memory usage). |
| 53 | + - Ensures cost efficiency by scaling down during low traffic and scaling up during peak traffic. |
| 54 | + - **Load Balancers**: |
| 55 | + - Elastic Load Balancer (ELB) ensures that incoming traffic is evenly distributed across all running instances. |
| 56 | + - Provides fault tolerance by automatically routing traffic away from unhealthy instances. |
| 57 | + - **Launch Templates**: |
| 58 | + - Predefined configurations for EC2 instances, including AMIs, instance types, security groups, and networking settings. |
| 59 | + - Simplifies instance management and ensures consistency across scaling operations. |
| 60 | + |
| 61 | +### 7. **Automated Deployment with AWS CodeDeploy** |
| 62 | + - Implements blue-green deployment for seamless updates to Auto Scaling Groups (ASGs) behind a load balancer. |
| 63 | + - Ensures minimal downtime and safe transitions between application versions. |
| 64 | + |
| 65 | +### 8. **Integration with AWS ECR** |
| 66 | + - After passing all tests, Docker images are built and pushed to AWS Elastic Container Registry (ECR). |
| 67 | + - CodeDeploy pulls these images for deployment to ASGs. |
| 68 | + |
| 69 | +### 9. **CI/CD with GitHub Actions** |
| 70 | + - Automates the workflow for testing, building, and deploying updates. |
| 71 | + - Triggers deployment only after passing all unit tests and validations. |
| 72 | + |
| 73 | +### 10. **Unit Testing** |
| 74 | + - Comprehensive unit tests for: |
| 75 | + - Flask API endpoints |
| 76 | + - Model loading |
| 77 | + - Model signature validation |
| 78 | + - Ensures reliability before deployment. |
42 | 79 |
|
43 |
| -### 6. MLflow Remote Tracking Server on DagsHub |
44 |
| - - MLflow tracking server is deployed on DagsHub, allowing centralized and remote tracking for experiments. |
45 |
| - - Makes it easier to collaborate, monitor, and share experiment results. |
| 80 | +--- |
46 | 81 |
|
47 |
| -### 7. Automated Pipeline Development with GitHub Actions |
48 |
| - - GitHub Actions is configured to automate the pipeline development and testing workflows. |
49 |
| - - Each update to the pipeline or model triggers a build-and-test process, ensuring continuous integration and delivery. |
| 82 | +## Deployment Workflow |
50 | 83 |
|
51 |
| -### 8. Unit Testing |
52 |
| - - Unit tests are added for: |
53 |
| - - Flask API endpoints |
54 |
| - - Model loading function |
55 |
| - - Model signature validation |
56 |
| - - Ensures that each component behaves as expected before deployment. |
| 84 | +1. **Build and Push to AWS ECR**: |
| 85 | + - After successful testing, the application is containerized using Docker. |
| 86 | + - The Docker image is pushed to AWS ECR for centralized storage. |
57 | 87 |
|
58 |
| -### 9. Production Deployment |
59 |
| - - If the model passes all test cases, it is automatically pushed to the production stage in MLflow. |
60 |
| - - This step automates the final promotion to production, reducing manual intervention. |
| 88 | +2. **Automated Deployment**: |
| 89 | + - AWS CodeDeploy retrieves the Docker image from ECR. |
| 90 | + - Deploys the application to ASGs using blue-green deployment to minimize downtime. |
| 91 | + - Load balancers ensure high availability, routing traffic only to healthy instances. |
61 | 92 |
|
62 |
| ---- |
| 93 | +3. **Scaling and Traffic Management**: |
| 94 | + - ASGs adjust the number of instances based on traffic patterns. |
| 95 | + - Load balancers handle incoming requests and distribute them to available instances, ensuring optimal performance. |
63 | 96 |
|
64 |
| -## Project Structure |
65 |
| - |
66 |
| -```plaintext |
67 |
| - ├── LICENSE |
68 |
| - ├── Makefile <- Makefile with commands like `make data` or `make train` |
69 |
| - ├── README.md <- The top-level README for developers using this project. |
70 |
| - ├── data |
71 |
| - │ ├── external <- Data from third party sources. |
72 |
| - │ ├── interim <- Intermediate data that has been transformed. |
73 |
| - │ ├── processed <- The final, canonical data sets for modeling. |
74 |
| - │ └── raw <- The original, immutable data dump. |
75 |
| - │ |
76 |
| - ├── docs <- A default Sphinx project; see sphinx-doc.org for details |
77 |
| - │ |
78 |
| - ├── models <- Trained and serialized models, model predictions, or model summaries |
79 |
| - │ |
80 |
| - ├── notebooks <- Jupyter notebooks. Naming convention is a number (for ordering), |
81 |
| - │ the creator's initials, and a short `-` delimited description, e.g. |
82 |
| - │ `1.0-jqp-initial-data-exploration`. |
83 |
| - │ |
84 |
| - ├── references <- Data dictionaries, manuals, and all other explanatory materials. |
85 |
| - │ |
86 |
| - ├── reports <- Generated analysis as HTML, PDF, LaTeX, etc. |
87 |
| - │ └── figures <- Generated graphics and figures to be used in reporting |
88 |
| - │ |
89 |
| - ├── requirements.txt <- The requirements file for reproducing the analysis environment, e.g. |
90 |
| - │ generated with `pip freeze > requirements.txt` |
91 |
| - │ |
92 |
| - ├── setup.py <- makes project pip installable (pip install -e .) so src can be imported |
93 |
| - ├── src <- Source code for use in this project. |
94 |
| - │ ├── __init__.py <- Makes src a Python module |
95 |
| - │ │ |
96 |
| - │ ├── data <- Scripts to download or generate data |
97 |
| - │ │ └── make_dataset.py |
98 |
| - │ │ |
99 |
| - │ ├── features <- Scripts to turn raw data into features for modeling |
100 |
| - │ │ └── build_features.py |
101 |
| - │ │ |
102 |
| - │ ├── models <- Scripts to train models and then use trained models to make |
103 |
| - │ │ │ predictions |
104 |
| - │ │ ├── predict_model.py |
105 |
| - │ │ └── train_model.py |
106 |
| - │ │ |
107 |
| - │ └── visualization <- Scripts to create exploratory and results oriented visualizations |
108 |
| - │ └── visualize.py |
109 |
| - │ |
110 |
| - └── tox.ini <- tox file with settings for running tox; see tox.readthedocs.io |
111 |
| -``` |
| 97 | +4. **Continuous Integration/Delivery**: |
| 98 | + - GitHub Actions automatically trigger the deployment pipeline on new commits. |
112 | 99 |
|
113 | 100 | ---
|
114 | 101 |
|
115 | 102 | ## Setup
|
116 | 103 |
|
117 | 104 | 1. **Clone the Repository**:
|
118 | 105 | ```bash
|
119 |
| - git clone [https://github.com/username/text-classification-mlops.git](https://github.com/2003HARSH/Text-Classification-using-MLOps.git) |
120 |
| - cd Text-Classification-using-MLOps.git |
| 106 | + git clone https://github.com/2003HARSH/Text-Classification-using-MLOps.git |
| 107 | + cd Text-Classification-using-MLOps |
121 | 108 | ```
|
122 | 109 |
|
123 |
| -2. **Install Requirements**: |
| 110 | +2. **Install Dependencies**: |
124 | 111 | ```bash
|
125 | 112 | pip install -r requirements.txt
|
126 | 113 | ```
|
127 | 114 |
|
128 |
| -3. **Set Up MLflow Remote Server** (DagsHub): |
129 |
| - - Follow DagsHub instructions to connect your repository and configure the MLflow server. |
| 115 | +3. **Set Up AWS Services**: |
| 116 | + - **Auto Scaling Groups (ASGs)**: Define scaling policies for EC2 instances. |
| 117 | + - **Load Balancers**: Configure ELB to distribute traffic across instances. |
| 118 | + - **Launch Templates**: Create templates for consistent instance configurations. |
130 | 119 |
|
131 |
| -4. **Configure DVC with S3**: |
132 |
| - - Set up DVC with S3 to store and version datasets: |
133 |
| - ```bash |
134 |
| - dvc remote add -d s3remote s3://your-bucket-name/path |
135 |
| - ``` |
| 120 | +4. **Configure AWS CodeDeploy**: |
| 121 | + - Set up a CodeDeploy application with blue-green deployment using ASGs and a load balancer. |
| 122 | + |
| 123 | +5. **Push Docker Image to ECR**: |
| 124 | + ```bash |
| 125 | + aws ecr get-login-password --region <region> | docker login --username AWS --password-stdin <account_id>.dkr.ecr.<region>.amazonaws.com |
| 126 | + docker build -t text-classification . |
| 127 | + docker tag text-classification:latest <account_id>.dkr.ecr.<region>.amazonaws.com/text-classification:latest |
| 128 | + docker push <account_id>.dkr.ecr.<region>.amazonaws.com/text-classification:latest |
| 129 | + ``` |
136 | 130 |
|
137 | 131 | ---
|
138 | 132 |
|
@@ -162,38 +156,27 @@ This repository includes:
|
162 | 156 |
|
163 | 157 | ---
|
164 | 158 |
|
165 |
| -## CI/CD with GitHub Actions |
166 |
| - |
167 |
| -GitHub Actions automates the following: |
168 |
| - |
169 |
| -- Pipeline execution and testing |
170 |
| -- Data tracking and artifact logging |
171 |
| -- Model deployment to production in MLflow |
172 |
| - |
173 |
| -**Workflow files** are stored in `.github/workflows`, where each push triggers the CI/CD pipeline. |
174 |
| - |
175 |
| ---- |
176 |
| - |
177 | 159 | ## Testing
|
178 | 160 |
|
179 | 161 | - Run unit tests locally:
|
180 | 162 | ```bash
|
181 | 163 | python -m unittest <test_file_name>.py
|
182 | 164 | ```
|
183 |
| -- GitHub Actions runs these tests automatically to ensure quality before production deployment. |
| 165 | +- CI/CD workflows execute these tests automatically. |
184 | 166 |
|
185 | 167 | ---
|
186 | 168 |
|
187 |
| -## License |
188 |
| - |
189 |
| -This project is licensed under the MIT License. |
| 169 | +## Future Enhancements |
190 | 170 |
|
191 |
| ---- |
192 |
| - |
193 |
| -## Acknowledgments |
| 171 | +1. **Enhanced Deployment**: |
| 172 | + - Deployment of the application using AWS Elastic Container Service (ECS) for scaling and fault tolerance. |
| 173 | + - Further integration with AWS CodePipeline to orchestrate the end-to-end deployment process. |
194 | 174 |
|
195 |
| -Thanks to [DagsHub](https://dagshub.com/) for remote MLflow tracking support and [Amazon S3](https://aws.amazon.com/s3/) for data storage. |
| 175 | +2. **Model Monitoring**: |
| 176 | + - Integration of tools for monitoring model performance in production and detecting drift. |
196 | 177 |
|
197 | 178 | ---
|
198 | 179 |
|
199 |
| -Feel free to reach out for collaboration or to report issues! |
| 180 | +## Contact |
| 181 | + |
| 182 | +Feel free to reach out at [harshnkgupta@gmail.com](harshnkgupta@gmail.com) or create an issue in the repository for questions or collaboration opportunities! |
0 commit comments