GitHub - bensenberner/airline-sentiment-prediction: Participating in https://www.kaggle.com/crowdflower/twitter-airline-sentiment

I classify the sentiment of tweets within this dataset as positive, neutral, or negative using the fast.ai implementation of ULMFiT.

I perform exploratory data analysis in this notebook.
- Create a train / test set, only looking at the train set for the rest of the project until the very end
- Determine the distribution of classes
- Get a sense of the defining characteristics of each class (the topic / tone of tweets of each sentiment)
- Describe how I augmented the airline dataset with a different tweet dataset, making sure to keep the class distributions the same after doing the join
- Beginning to consider using yet another tweet dataset
Create a classifier based on unmodified airline data only
- Create a language model that can create useful embeddings for tweets
- Use those embeddings to train a classifier
- Train another classifier in which I oversample the minority classes in a bid to achieve a higher auc-roc score
Create a classifier based on oversampled airline data
- Randomly oversampled the minority classes (positive and neutral) until they both had the same cardinality as the majority class (negative)
- The hope is that the classifier I create from this new dataset will achieve higher recall on the minority classes since it now has a "better sense" of what they look like, but perhaps at the cost of lower recall on the majority class
- Trained the classifier exactly as before
Create a classifier based on airline data + sentiment-140 data
- Create a language model that can represent airline data + sentiment-140 data
- Train a classifier using only airline data
Evaluate the performance of the best classifiers against the test set
- Compute the test accuracy of all models, compare it to the baseline
- Compute the au-roc score
- Examine the confusion matrix to determine the most common types of mistakes the classifier mistakes
- Explore those mistakes and try to determine if any meaningful patterns exist

fast.ai challenge questions:

Highest test accuracy on this dataset?
- 81.3
- 77.39
- My test accuracy is 83.2
What type of visualization will help me grasp the nature of the problem / data
- Look at the frequency of words within each sentiment
- Distribution of "label confidence" for each sentiment

TODO:

Describe why fast.ai's accuracy metrics were confusing
Wasn't able to achieve reproducible steps in google col
Rather than random oversampling directly on the raw text data, create language model embeddings of all the tweets and then perform oversampling on those

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
notebooks		notebooks
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

bensenberner/airline-sentiment-prediction

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages