This is a modular, production-ready machine learning project for detecting hate speech in text data using deep learning. It follows a clean, stage-wise architecture with components for data ingestion, validation, transformation, model training, evaluation, and deployment. The pipeline includes GCP integration for cloud storage, model registry, and CI/CD with CircleCI for automated deployment on a GCP VM. This project ensures scalable, reproducible ML workflows suitable for real-world NLP applications.
A brief walkthrough of this project is available on LinkedIn post.
- Data Ingestion
- Data Validation
- Data Trnsformation
- Model Training
- Model Evaluation
- Model Pusher
- Model Prediction
- Model Deployment
Note: You need to install and configure the gcloud sdk in your system to featch the data from gcloud storage bucket
- constants
- config_entity
- artifact_entity
- components
- pipeline
- main.py
- Setup the CircleCI
- Activate the 'Self-Hosted Runners' by confirming the terms
- Create a new project in CircleCI
- Link the project to your GitHub repository
- Configure VM instance
- Configure GCR in GCP
- Write the 'config.yml' file
- setup the environment variables
-
Generate a dedicated SSH key pair for your project (on your local machine):
ssh-keygen -t ed25519 -f ~/.ssh/project_key -C "your_email@example.com"
- This generates two files:
- ~/.ssh/project_key (private key)
- ~/.ssh/project_key.pub (public key)
- This generates two files:
-
Add the public key to GitHub as a Deploy Key:
- Go to your GitHub repo → Settings → Deploy Keys
- Click "Add deploy key"
- Title: e.g., CircleCI Access
- Key: paste the contents of project_key.pub
- Enable "Allow write access" if needed (e.g., if CircleCI pushes code/tags)
-
Add the private key to CircleCI:
- Go to CircleCI Project Settings → SSH Keys → "Add SSH Key"
- Choose: "Other"
- Paste the private key (project_key)
- Hostname: github.com
- Create a VM instance in GCP with a Linux OS (e.g., Ubuntu 22.04) and allow HTTP/HTTPS traffic.
- Allow firewall rules to enable access port8080.
- Enable Artifact Registry API in your GCP project to allow Docker image pulling.
- Install Docker on the VM instance.
- Authenticate Docker with Artifact Registry:
gcloud auth configure-docker us-central1-docker.pkg.dev
-
Give the VM access to pull from Artifact Registry (via service account permissions or using gcloud auth login).
-
Install Google Cloud SDK on the VM (if not using a service account)
-
Add the docker user to admin group:
sudo usermod -aG docker $USER newgrp docker
-
Restart the VM to apply the changes.
- Create an SSH key pair on your local machine (if you don’t already have one):
$ ssh-keygen -t rsa -f ~/.ssh/gcp-key -C youremail@gmail.com
- Add the public key to your GCP VM:
- Go to the VM instance details in GCP.
- Click "Edit" → scroll to SSH Keys.
- Click "Add Key" → paste the public key from ~/.ssh/gcp-key.pub.
- Add the private key to CircleCI:
- Go to your project in CircleCI → Project Settings → SSH Keys → Add SSH Key.
- Paste your private key (~/.ssh/gcp-key) there.
- Set the hostname as 35.xxx.xxx.xxx (your VM's external IP) or just *.
- Add the GCP required environment variables in CircleCI environment variables:
- Go to your project in CircleCI → Environment Variables → Add Environment Variable. e.g., for GCP VM IP:
- Name:
GCP_VM_IP
, Value:35.xxx.xxx.xxx
- Name:
SSH_PRIVATE_KEY
, Value: your private key (~/.ssh/gcp-key in base64) --> Follow theconfig.yml
file for this important step.
conda create -n nlp python=3.10 -y
conda activate nlp
pip install -r requirements.txt
python main.py
1 clear
2 sudo apt-get update
3 git clone https://github.com/razyousuf/NLP-Text-Classification-Pipeline
4 cd NLP-Text-Classification-Pipeline/
5 ls
6 cat Dockerfile
7 export PROJECT_ID=crypto-snow-432611-i2 # YOUR_PROJECT_ID !
8 docker build -t gcr.io/${PROJECT_ID}/hatespeech-app:v1 .
9 docker images
10 gcloud auth configure-docker gcr.io
11 docker push gcr.io/${PROJECT_ID}/hatespeech-app:v1
12 gcloud config set compute/zone us-central1
13 kubectl get pods
14 kubectl expose deployment hatespeech-app --type=loadbalancer --port 80 --target-port 8080
15 gcloud container clusters get-credentials hatespeech-cluster --region us-central1
16 kubectl get pods
17 kubectl create deployment hatespeech-app --image=gcr.io/${PROJECT_ID}/hatespeech-app:v1
18 kubectl get pods
19 kubectl expose deployment hatespeech-app --type=LoadBalancer --port=80 --target-port=8080
20 kubectl get services
21 kubectl cluster-info
22 kubectl get nodes