USING MOVIE PLOT FOR GENRE CLASSIFICATION

Overview

This project uses summarized plots of movies for classifying their genre. TF-IDF and BERT have been used to generate two different sets of embeddings that have been use in modelling.

Dataset

The dataset used here has been taken from kaggle, you can get it here. It contained details about ~35,0000 movies including plot summaries, release year, origin, etc.

Preprocessing

The movie genre column had 2265 unique values, including repetitions, different spellings and other redundancies.

All of these were condensed down to 11 major movie genres with enough data for classification purpose. The list for them is given below in order number of observation for each of them:

Genre	No. of Movies
Horror	1167
Action	1098
Thriller	966
Romance	923
Western	865
Science Fiction	639
Crime	568
Adventure	526
Musical	467
Film Noir	345
Mystery	310
TOTAL	7874

Afterwards the genres are encoded using label encoder and then embeddings are created using TF-IDF Vectorizer and BERT Base uncased model. Afterwards seperate models were built for both set of embeddings.

Training

We have trained models on TF-IDF and BERT embeddings. At first we trained base line models using both. The Train_Classifiers() class used to train models is present in the helper_functions.py file.
Then we used class weighting by assigning different classes the following weights:

Class	Weight
0	0.6519291273389634
1	1.3608710680954026
2	1.2602432778489117
3	2.074835309617918
4	0.6133831892186647
5	1.5328012458633444
6	2.309090909090909
7	0.7755343248300994
8	1.1202162469768104
9	0.741012610577828
10	0.8275354703100368

The weights have been calculated using the following code:

from sklearn.utils import compute_class_weight

weights = compute_class_weight('balanced', classes=np.unique(y), y=y)
weights = {index:value for index,value in enumerate(weights)}
weights

The models were again trained and tested and select few were finetuned.

Some of the promising models could not be fine-tuned because of their fine-tuning being computationally too expensive.

helper_functions.py was modified in-between so at some places it might show AUC and training times while at others it might not.

Evaluation

Initial Modelling

Embeddings are directly used to train and test models to test models to set a quick initial baseline performance.

Going this way get the following performances with TF-IDF Embeddings:

The best performances by accuracy are from:

Logistic Regression: 64%
SVC: 62.2%
CatBoost: 61.8%
LightGBM: 61%

Using BERT embeddings the following performance was achieved:

The best performances by accuracy are from:

Logistic Regression: 67.1%
CatBoost: 65.7%
SVC: 65.4%
LightGBM: 65%

Using Class weights

Since the number of observations of different classes vary by a wide margin class weights are computed to help models learn better.
Going this way get the following performances with TF-IDF Embeddings:

The best performances by accuracy are from:

Logistic Regression: 64%
SVC: 62.2%
CatBoost: 61.8%
LightGBM: 61%

Using BERT embeddings the following performance was achieved:

The best performances by accuracy are from:

CatBoost: 65.58%
Logistic Regression: 64.6%
LightGBM: 63.55%
SVC: 63.2%

Using class weights seem to hurt overall performance but since classes are unbalanced we must check other metrics.

Cross-Validation

As the performance is clearly better with BERT embeddings we will be using them going them forward.

Logistic Regression

Starting with Logistic Regression we use cross validation with Weighted F1 Score as our parameter and get the following results:

Model	Weighted F1 Score
Unweighted + default parameters	0.6708896713962726
Weighted + default parameters	0.6600781535713869
Unweighted + best parameters	0.6728840114412727

The best performance comes from fine-tuned hyperparameters found using GridSearchCV with no class weights. The model being the following:
LogisticRegression(C=0.3, max_iter=2000, penalty='l2', solver='liblinear')
The cross-validated accuracy comes out to be 68%, compared to 67.42% for the default model.

LightGBM

Now with LightGBM we once again use cross validation with Weighted F1 Score as our parameter and get the following results:

Model	Weighted F1 Score
Unweighted + default parameters	0.644975856590887
Weighted + default parameters	0.6553677280866055
Unweighted + best parameters	0.660406101978489
Weighted + best parameters	0.66350711949183

The best performance comes from fine-tuned hyperparameters found using GridSearchCV with class weights. The model being the following:
LGBMClassifier(class_weight=weights, colsample_bytree=0.8, max_depth=7, learning_rate=0.1, n_estimators=300, subsample=0.8)
The cross-validated accuracy comes out to be 66.9%, compared to 65.08% for the default model.

The fine-tuning isn't done on SVC and CatBoost due to lack of computation power needed. If you can, please do it 😊

Dependencies

The dependencies for this project are mentioned in requirements.txt file. The python version used is 3.10.11

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
Classical NLP Approach		Classical NLP Approach
Transformer Approach		Transformer Approach
__pycache__		__pycache__
.DS_Store		.DS_Store
README.md		README.md
data_cleaning.ipynb		data_cleaning.ipynb
helper_functions.py		helper_functions.py
image-1.png		image-1.png
image-4.png		image-4.png
image.png		image.png
output.png		output.png
requirements.txt		requirements.txt
wiki_movie_plots_deduped.csv		wiki_movie_plots_deduped.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

USING MOVIE PLOT FOR GENRE CLASSIFICATION

Overview

Table of Contents

Dataset

Preprocessing

Training