Skip to content

bensenberner/airline-sentiment-prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 

Repository files navigation

I classify the sentiment of tweets within this dataset as positive, neutral, or negative using the fast.ai implementation of ULMFiT.

  1. I perform exploratory data analysis in this notebook.
    • Create a train / test set, only looking at the train set for the rest of the project until the very end
    • Determine the distribution of classes
    • Get a sense of the defining characteristics of each class (the topic / tone of tweets of each sentiment)
    • Describe how I augmented the airline dataset with a different tweet dataset, making sure to keep the class distributions the same after doing the join
    • Beginning to consider using yet another tweet dataset
  2. Create a classifier based on unmodified airline data only
    • Create a language model that can create useful embeddings for tweets
    • Use those embeddings to train a classifier
    • Train another classifier in which I oversample the minority classes in a bid to achieve a higher auc-roc score
  3. Create a classifier based on oversampled airline data
    • Randomly oversampled the minority classes (positive and neutral) until they both had the same cardinality as the majority class (negative)
    • The hope is that the classifier I create from this new dataset will achieve higher recall on the minority classes since it now has a "better sense" of what they look like, but perhaps at the cost of lower recall on the majority class
    • Trained the classifier exactly as before
  4. Create a classifier based on airline data + sentiment-140 data
    • Create a language model that can represent airline data + sentiment-140 data
    • Train a classifier using only airline data
  5. Evaluate the performance of the best classifiers against the test set
    • Compute the test accuracy of all models, compare it to the baseline
    • Compute the au-roc score
    • Examine the confusion matrix to determine the most common types of mistakes the classifier mistakes
    • Explore those mistakes and try to determine if any meaningful patterns exist

fast.ai challenge questions:

  • Highest test accuracy on this dataset?
  • What type of visualization will help me grasp the nature of the problem / data
    • Look at the frequency of words within each sentiment
    • Distribution of "label confidence" for each sentiment

TODO:

  • Describe why fast.ai's accuracy metrics were confusing
  • Wasn't able to achieve reproducible steps in google col
  • Rather than random oversampling directly on the raw text data, create language model embeddings of all the tweets and then perform oversampling on those

Releases

No releases published

Packages

No packages published