I classify the sentiment of tweets within this dataset as positive
, neutral
, or negative
using the fast.ai implementation of ULMFiT.
- I perform exploratory data analysis in this notebook.
- Create a train / test set, only looking at the train set for the rest of the project until the very end
- Determine the distribution of classes
- Get a sense of the defining characteristics of each class (the topic / tone of tweets of each sentiment)
- Describe how I augmented the airline dataset with a different tweet dataset, making sure to keep the class distributions the same after doing the join
- Beginning to consider using yet another tweet dataset
- Create a classifier based on unmodified airline data only
- Create a language model that can create useful embeddings for tweets
- Use those embeddings to train a classifier
- Train another classifier in which I oversample the minority classes in a bid to achieve a higher auc-roc score
- Create a classifier based on oversampled airline data
- Randomly oversampled the minority classes (positive and neutral) until they both had the same cardinality as the majority class (negative)
- The hope is that the classifier I create from this new dataset will achieve higher recall on the minority classes since it now has a "better sense" of what they look like, but perhaps at the cost of lower recall on the majority class
- Trained the classifier exactly as before
- Create a classifier based on airline data + sentiment-140 data
- Create a language model that can represent airline data + sentiment-140 data
- Train a classifier using only airline data
- Evaluate the performance of the best classifiers against the test set
- Compute the test accuracy of all models, compare it to the baseline
- Compute the au-roc score
- Examine the confusion matrix to determine the most common types of mistakes the classifier mistakes
- Explore those mistakes and try to determine if any meaningful patterns exist
fast.ai challenge questions:
- Highest test accuracy on this dataset?
- What type of visualization will help me grasp the nature of the problem / data
- Look at the frequency of words within each sentiment
- Distribution of "label confidence" for each sentiment
TODO:
- Describe why fast.ai's accuracy metrics were confusing
- Wasn't able to achieve reproducible steps in google col
- Rather than random oversampling directly on the raw text data, create language model embeddings of all the tweets and then perform oversampling on those