DataTalks.Club's Capstone 2 project by Alexander D. Rios
Source: EarthDailyAnalytics
This repository was created as part of the DataTalks.Club's Machine Learning Zoomcamp by Alexey Grigorev.
This project has been submitted as the Capstone 1 project for the course.
Twitter has become an important communication channel in times of emergency. The ubiquitousness of smartphones enables people to announce an emergency theyβre observing in real-time. Because of this, more agencies are interested in programatically monitoring Twitter (i.e. disaster relief organizations and news agencies).
But, itβs not always clear whether a personβs words are actually announcing a disaster. Take this example:
The author explicitly uses the word βABLAZEβ but means it metaphorically. This is clear to a human right away, especially with the visual aid. But itβs less clear to a machine.
In this competition, weβre challenged to build a machine learning model that predicts which Tweets are about real disasters and which oneβs arenβt. Weβll have access to a dataset of 10,000 tweets that were hand classified.
Disclaimer: The dataset for this competition contains text that may be considered profane, vulgar, or offensive.
This dataset was created by the company figure-eight and originally shared on their βData For Everyoneβ website here.
Tweet source: https://twitter.com/AnyOtherAnnaK/status/629195955506708480
You are predicting whether a given tweet is about a real disaster or not. If so, predict a 1. If not, predict a 0.
You can find the competition in the following link: https://www.kaggle.com/competitions/nlp-getting-started
This particular challenge is perfect for data scientists looking to get started with Natural Language Processing.
- π README.md
- π Pipfile
- π Pipfile.lock
- π analysis
- π dataset
- π 479k-english-words
- ποΈ english_words_479k.txt
- π english-word-frequency
- ποΈ unigram_freq.csv
- π nlp-getting-started
- ποΈ sample_submission.csv
- ποΈ test.csv
- ποΈ train.csv
- π 479k-english-words
- π etc
- π deploy.sh
- βοΈ gateway-deployment-service.yaml
- βοΈ gateway-deployment.yaml
- π gateway.dockerfile
- βοΈ kind-config.yaml
- βοΈ metallb-configmap.yaml
- βοΈ model-deployment.yaml
- βοΈ model-service.yaml
- βοΈ nginx-ingress.yaml
- π serving.dockerfile
- π models
- π€ model_base.h5
- π€ tokenizer.bin
- π scripts
- π load_data.py
- π model_conversor.py
- π model_serving.py
- π test.py
- π train.py
- π utils.py
- π disaster_tweets_model
- π fingerprint.pb
- π saved_model.pb
- π variables
- π variables.data-00000-of-00001
- π variables.index
- π streamlit_app
- π my_app.py
- π requirements.txt
- ποΈ train_clean.csv
In this project, I used the following dataset: Natural Language Processing with Disaster Tweets
You can download it with the following code:
!pip install kagglehub
import kagglehub
kagglehub.login()
nlp_getting_started_path = kagglehub.competition_download('nlp-getting-started')
rtatman_english_word_frequency_path = kagglehub.dataset_download('rtatman/english-word-frequency')
yk1598_479k_english_words_path = kagglehub.dataset_download('yk1598/479k-english-words')
keras_bert_keras_bert_small_en_uncased_2_path = kagglehub.model_download('keras/bert/Keras/bert_small_en_uncased/2')
print('Data source import complete.')
You need to log in with your credentials or username and password. For more help, refer to the KaggleHub repository
The dataset for training contains 7613 records about tweets.
Each sample in the train and test set has the following information:
- The
text
of a tweet - A
keyword
from that tweet (although this may be blank!) - The
location
the tweet was sent from (may also be blank)
Submissions are evaluated using F1 between the predicted and expected answers. F1 is calculated as follows:
$ F_1=2β\frac{precisionβrecall}{precision+recall}$
where:
$precision=\frac{TP}{TP+FP}$ $recall=\frac{TP}{TP+FN}$
and:
True Positive [TP] = your prediction is 1, and the ground truth is also 1 - you predicted a positive and that's true!
False Positive [FP] = your prediction is 1, and the ground truth is 0 - you predicted a positive, and that's false.
False Negative [FN] = your prediction is 0, and the ground truth is 1 - you predicted a negative, and that's false.
The dataset analysis and the models training were conducted in Jupyter Notebook. You can find in the file named nlp-with-disaster-tweets.ipynb.
The training script is available in the train.py script.
To deployment I used a model trained for classification named model_base.h5.
The script to deploy the model using Flask is model_serving.py
Pipfile and Pipfile.lock set up the Pipenv environment.
First, you need to install from Pipfile:
pipenv install
The virtual environment can be activated by running
pipenv shell
Once in the virtual enviroment, you can run the following command:
python ./scripts/model_serving.py
Futhermore, you need to serve the model with the following command:
docker run -it --rm -p 8500:8500 -v "$(pwd)/scripts/disaster_tweets_model:/models/disaster_tweets_model/1" -e MODEL_NAME=disaster_tweets_model tensorflow/serving:2.14.0
Then, you will be ready to test the model by running the following command:
python ./scripts/test.py
Don't forget to update the url
variable in the test.py file to:
url = "http://localhost:9696/predict"
Also, you must update the host in the model_serving.py file to:
host = os.getenv('TF_SERVING_HOST', 'localhost:8500')
Once in the virtual enviroment, you can run the following commands:
waitress-serve --listen=0.0.0.0:9696 scripts.model_serving:app
Before that, you need to serve the model using the following command:
docker run -it --rm -p 8500:8500 -v "$(pwd)/scripts/disaster_tweets_model:/models/disaster_tweets_model/1" -e MODEL_NAME=disaster_tweets_model tensorflow/serving:2.14.0
And then, you can test the model by running the following command:
python ./scripts/test.py
Don't forget to update the url
variable in the test.py file to:
url = "http://localhost:9696/predict"
Also, you must update the host in the model_serving.py file to:
host = os.getenv('TF_SERVING_HOST', 'localhost:8500')
To deploy our model with Kubernetes, I aim to ensure that you can create the following structures and connections:
First, you need to build:
- The TensorFlow Serving image.
- The Gateway image.
To achieve this, I created two separate Dockerfiles:
- serving.dockerfile -- Contains the instruction to serve the TensorFlow model in
saved_model
format (disaster_tweets_model). - gateway.dockerfile -- Contains the instruction to deploy the model_serving.py algorithm and install its dependencies.
To build them, you can use the following commands:
docker build -t tf-serving-disaster-tweets-model -f .etc/serving.dockerfile .
docker build -t serving-gateway-disaster-tweets-model -f ./etc/gateway.dockerfile .
You must tag and push them to your repository:
docker tag tf-serving-disaster-tweets-model <YOUR_USERNAME>/tf-serving-disaster-tweets-model
docker tag serving-gateway-disaster-tweets-model <YOUR_USERNAME>/serving-gateway-disaster-tweets-model
docker push <YOUR_USERNAME>/tf-serving-disaster-tweets-model:latest
docker push <YOUR_USERNAME>/serving-gateway-disaster-tweets-model:latest
You can also pull them from my repository by using the following commands:
docker pull aletbm/tf-serving-disaster-tweets-model:latest
docker pull aletbm/serving-gateway-disaster-tweets-model:latest
To deploy locally using Docker, you must execute the following commands in two separate terminals:
- To serve the model:
docker run -it --rm -p 8500:8500 tf-serving-disaster-tweets-model:latest
- To deploy the gateway:
docker run -it --rm -p 9696:9696 serving-gateway-disaster-tweets-model:latest
Then, you will be ready to test the model by running the following command:
python ./scripts/test.py
Don't forget to update the url
variable in the test.py file to:
url = "http://localhost:9696/predict"
To deploy locally using Kubernetes and Docker, you must replace my Docker username with your Docker username in:
- The model-deployment.yaml file configuration.
spec: containers: - name: tf-serving-disaster-tweets-model image: <YOUR_USERNAME>/tf-serving-disaster-tweets-model:latest ports: - containerPort: 8500
- The gateway-deployment.yaml file configuration.
spec: containers: - name: serving-gateway-disaster-tweets-model image: <YOUR_USERNAME>/serving-gateway-disaster-tweets-model:latest ports: - containerPort: 9696
Up to this point, you have built and pushed all the necessary images, and all configuration files have been corrected.
You need to install Kubernetes and Kind. I wonβt explain how to install them, but you can refer to their respective documentation pages.
Now, you need to create a Kubernetes cluster with Kind and apply all configuration files. To do this, you have two options:
- Do it manually by executing each command individually.
- Do it automatically by executing the deploy.sh script.
Manually, you must to execute the following commands:
kind create cluster --config kind-config.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/main/deploy/static/provider/kind/deploy.yaml
kubectl apply -f model-deployment.yaml
kubectl apply -f model-service.yaml
kubectl apply -f gateway-deployment.yaml
kubectl apply -f gateway-deployment-service.yaml
kubectl delete -A ValidatingWebhookConfiguration ingress-nginx-admission
kubectl apply -f nginx-ingress.yaml
kubectl apply -f https://raw.githubusercontent.com/metallb/metallb/v0.12.1/manifests/namespace.yaml
kubectl apply -f https://raw.githubusercontent.com/metallb/metallb/v0.12.1/manifests/metallb.yaml
kubectl apply -f metallb-configmap.yaml
kubectl get pods -n metallb-system --watch
Automatically, you must to execute the following command in a bash terminal:
cd etc
./deploy.sh
Once all pods are running, you can test the deployment by running the following command:
python ./scripts/test.py
Don't forget to update the url
variable in the test.py file to:
url = "http://localhost:80/predict"
Iβve included a GIF that shows how to perform a deployment:
On the other hand, I developed a very simple app using Streamlit to deploy my model, where you can upload an image and obtain a prediction.
streamlit.mp4
Hereβs the link to my Streamlit App.