This project aims to build a model for an API that classifies disaster messages based on disaster data from Appen (formally Figure 8). The provided dataset contains real messages that were sent during disaster events. We build a machine learning pipeline to classify possible disaster events. Then, we dispatch the messages to the concerned disaster relief agency.
We provide a web app where an emergency worker can input a new message and get classification results in the pre-defined categories. Furthermore, we provide some visualizations to show the data distribution.
disaster_response.mp4
click on the image below
Note: You can visualize the app but the model might not work due to limited resources on the Free tier of render.com

The ETL pipeline, process_data.py
, is used to to prepare the data:
- Loads the messages and categories datasets
- Merges the two datasets
- Cleans the data
- Stores the clean data in a SQLite database
The ML pipeline ,train_classifier.py
, is used to train and export a classifier:
- Loads data from the SQLite database
- Tokenizes and lemmatizes text data
- Builds an ML pipeline using `CountVectorizer`, `TfidfTransformer`, and `MultiOutputClassifier`
- Add features such as `negation_counter`, `verb_counter`, `emotion_counter`, `punctuation_counter`, `text_length`, `capitalization_counter`, `subjectivity`, `polarity`, `ner`(not in the light version).
- Trains and tunes a model using `GridSearchCV`
- Outputs results on the test set in `reports/classification_report.md`
- Exports the final model as a pickle file in `models/classifier.pkl`
A Flask web app that shows data visualizations and runs the inference using trained model to classify disaster messages. The web app includes:
- A data visualization using Plotly in the `go.html` template file
- A Flask app that runs the web app accessible in `app/run.py`.
- Mutliple visualizations are accessible such as:
* Distribution of Message Genres
* Distribution of Messages Across Categories and Genres
* Average Message Length by Genre
* Average Message Length by Category
- Python 3.9
- NumPy
- Pandas
- Scikit-Learn
- NLTK
- Flask
- Plotly
- SQLAlchemy
- SpaCy
- TextBlob
- gunicorn
- contractions
- Clone the repository
git clone https://github.com/naoufal51/disaster_response.git
- Create a virtual environment
python3 -m venv venv
- Activate the virtual environment
source venv/bin/activate
- Install the dependencies
pip install -r requirements.txt
To run the ETL pipeline that cleans data and stores in database, run the following command:
python data/process_data.py data/disaster_messages.csv data/disaster_categories.csv data/DisasterResponse.db
To run the ML pipeline that trains classifier and saves, run the following command:
python models/train_classifier.py data/DisasterResponse.db models/classifier.pkl
You can choose to run the app with the model with or without NER (classifier_light.pkl
). Go to app/run.py
to make the necessary changes.
To run the Flask web app, run the following command:
gunicorn app.run:app
Then go to http://127.0.0.1:8000 or http://localhost:8000/
This dataset is highly imbalanced as shown in the figure above where a large number of categories have very few samples. This imbalance issue can lead to a biased model that favors the majority classes. This issue is even more critical as we are dealing with disaster response messages where we want to make sure that all categories are well represented.
There are several techniques to mitigate this issue:
- Resampling: Majority undersampling or minority oversampling.
- SMOTE (Synthetic Minority Over-Sampling Techniques): Instead of simple oversampling of minority classes. SMOTE exploits the feature space to create a new synthetic sample based on k-nearest neighbors.
- Data augmentation: There exist multiple strategies to generate text using data augmentation. Synonym replacement, random insertion, and random deletion are among them.
- Cost-Sensitive Learning: Apply a penalty score in misclassified instances to decrease the total misclassification cost.
- Ensemble Methods: Bagging or Boosting can mitigate class imbalance.
In order to accurately evaluate our model in the presence of class imbalance issues, we use the F1 score as a metric to evaluate the model performance. The F1 score is a good metric to assess the balance between precision and recall.
In addition to the models mentioned above, we also explored leveraging state-of-the-art transformer models using the Hugging Face Transformers library in a Kaggle Notebook. This approach allows us to tap into powerful pre-trained models like BERT, which have achieved high performance across a variety of NLP tasks, including text classification.
- Loading: We employ
AutoTokenizer
andAutoModelForSequenceClassification
for tokenizing and loading the ‘bert-base-uncased’ model. - Training & Evaluation: The model is trained using the
Trainer
class, with a weighted loss function for class imbalance, and evaluated using metrics like accuracy and F1 score on a test set. - Experimentation: For those with ample computational resources, the notebook in the repository provides an optional, more sophisticated methodology, offering the possibility for further tuning and experimentation.
Given the resource-intensive nature of transformer models, consideration is required for deployment in constrained environments.
You can refer to the provided kaggle notebook for a detailed walk-through and implementation.