This project works with Olist's Brazilian E-Commerce Public Dataset. Our goal is to use this dataset to develop a classification model, which will identify if a customer review was positive or negative. Furthermore, we wish to gather explainability from the model, in this case, applying LIME and SHAP. We provide a Dockerfile and Poetry configuration file for ease of running and reproductibility.
It is always recommended to have a separate python environment for different projects. This projects utilizes Python 3.11.5
. We walk you through the environment configuration with Poetry and the highly recommended Docker image. pip and Conda were failing to build the project due to unresolved dependency issues with Numba, hence, their usage is not recommend - but feel free to try.
We provide a Docker image which runs our training script and allows you to interact with the files. Running the docker build
command will build the Python 3.11 image, install Poetry and run train.py
, which generates de .pkl models.
docker build -t bravium_heitor .
Running this Docker run command will allow you to interact with the image. Running with these -v
flags allows you to access the files on the container locally, so that they may be persisted on your local machine.
Inside of the image, you may run poetry run python explainability.py
to run LIME and SHAP and get the results. Other than that, you may play around with the files freely.
docker run -it --rm \
-v $(pwd)/data:/app/data \
-v $(pwd)/explainability:/app/explainability \
-v $(pwd)/metrics:/app/metrics \
-v $(pwd)/model:/app/model \
-v $(pwd)/processed_csvs:/app/processed_csvs \
bravium_heitor
Poetry is our preferred Python package manager and we recommend its use for this project. You should have it installed locally with pipx. There are plenty of guides available on this topic.
With poetry installed, just run
poetry install no-root
and the environment will be fully operational. The order in which we recommend running the codes is:
- Getting the dataset from kaggle ->
get_kaggle_dataset.py
. - Following the
data_analysis.ipynb
anddata_cleaning.ipynb
. - Running
train.py
andexplainability.py
codes.
However, following that order is not necessary, since we've uploaded our processed .csv files to the processed_csvs
folder.
The only file that is dependent on having a .pkl
model on the /models
folder is the explainability.py
folder. As such, if you're unable to run the train.py
script but still want to explore the code (or just want to access or model) you can download the pickle files here
During the EDA phase, our main goal is to understand the dataset's features and their relationships with eachother. We exclude multiple files and records from the dataset, either due to them not being suited for the analysis or having missing data, and save a much smaller sample of the dataset for the cleaning stage.
Using the .csv file resulting from our EDA, we apply essential pre-processing steps on this stage, such as removing trailing whitespaces, emojis, special characters, and stemming.
The model is defined on the train.py
phase. The goal is to automatically classify reviews as positive or negative based on their text content. On this stage, we first transformed text into numerical features with TF–IDF vectorization, and then train the model.
The model is a Logistic Regression classifier, trained using GridSearchCV to find the best hyperparameters (C, penalty, class_weight). The training set and test set are split 80-20. The model is optimized for F1-score, which balances performance across classes.
To evaluate the model, we generate a classification graph, showcasing precision, recall and F1-score per class. We also save a confusion matrix (true vs predicted labels). Both are saved as .png images under the /metrics
directory. Our model presented the following results:
The blue (precision), orange (recall) and green (f1-score) bars show the metrics for each class. We can see that the model is overall more precise when detecting positive reviews, but shows solid scores on both cases.
As we see from the confusion matrix, the model's most frequent mistake is confusing negative reviews as positive. Aside from training and using other models, this could also be a result of an unbalanced dataset - one that has more positive reviews than negative ones.
For explainability, we want to know why a review was classified as positive or negative. We use two complementary tools. The explainability.py
file loads the trained Logistic Regression model and the TF–IDF pipeline and samples reviews, generating LIME and SHAP visualizations in the /explainability
folder.
LIME works on individual predictions. For a given review, it identifies the top words that influenced the classification.
This graph shows which words weighted the model the most towards a specific prediction. Green words weight towards a positive review and red ones, towards a negative review.
SHAP provides a more general view. Instead of only explaining one prediction, it highlights the most influential words across many reviews.
This is an SHAP Summary Plot. The Y-axis shows the most important words/features, sorted by overall importance, while the X-axis shows their SHAP value (impact on the model’s output).
Red means the feature increased the chance of predicting “positive”, and blue means the feature pushed towards “negative”.Their position from left to right define how strong the impact was.
In this project, we built a sentiment analysis model using Olist’s Brazilian E-Commerce Public Dataset. Our objective was to automatically classify customer reviews as positive or negative and to further interpret the model’s decisions using explainability tools.
The final model is a Logistic Regression classifier, trained on TF–IDF features and optimized with GridSearchCV. The training/test split was 80/20, and the model was optimized for F1-score.
- Accuracy: ~88%
- F1-score: 0.82 (negative), 0.91 (positive)
- Precision/Recall: the model is more reliable at detecting positive reviews, but still achieves solid results for negative ones.
LIME demonstrated how individual words influence single predictions. SHAP provided a global perspective, ranking the most influential words across the dataset.
Overall, the project achieved its dual goal: building a performant model for sentiment classification and making its predictions transparent and interpretable.