Sentiment Classification of Movie Reviews

Contact:
Pooria daneshvar Kakhaki
Email: daneshvarkakhaki.p@northeastern.edu Department of Computer Science, Northeastern University, Boston, MA

Neda Ghohabi Esfahani
Email: ghohabiesfahani.n@northeastern.edu Department of Bioengineering, Northeastern University

⚠️ Important Notice

To demo the repository, please refer to the instructions provided below.
Before running the demo, please ensure that all required libraries are installed and the environment is properly set up by following the instructions provided in the Environment section. 🔴 Note: Failing to install the required packages or correctly set up the dataset may result in errors during the demo.

Demo Instructions

To quickly run a demo of this repository:

Run the demo script:

python demo.py [--text "I loved the movie" [--cpu]

Arguments:

--cpu :To force CPU training if GPU is not available or desired
--text: To classify a custom review text,

This script will:

Download and prepare the IMDb dataset.
Train a Logistic Regression model on TF-IDF representations.
Perform inference on a provided sample review (default: "I really liked the movie").
Train a Fully Connected Network (FCN) using pretrained Word2Vec embeddings.
Perform inference again using the FCN model.

You will find additional info including logs and required files to reuse the trained models in ./results directory.

Introduction:

This project focuses on sentiment analysis of movie reviews using the IMDb Large Movie Review Dataset, which consists of 50,000 reviews evenly split between positive and negative sentiments. We investigate a wide range of machine learning and deep learning approaches, including classical models (Logistic Regression, Random Forest with Bag-of-Words and TF-IDF), Fully Connected Networks and LSTMs with various word embeddings, and Transformer-based architectures such as DistilBERT and RoBERTa. Models were evaluated using accuracy, F1-score, precision, and recall, offering a comprehensive comparison across embedding strategies, fine-tuning techniques, and model complexities for sentiment classification.

Environment

Prepare the virtual environment:

You can use the provided requirements.py file:

pip install -r requirements.txt

Otherwise, you we need the following packages:

# Core Python Libraries
numpy
pandas
scikit-learn
argparse

# Deep Learning
torch
torchvision
torchaudio

# Transformers (Hugging Face)
transformers
datasets

# Tokenization Utils
sentencepiece

# Plotting and Visualization
matplotlib
seaborn
wordcloud

# Word2Vec
gensim

# Text Processing
nltk
joblib
symspellpy

# Training Utilities
tqdm

Dataset

We use the Large Movie Review Dataset (IMDb), which contains 100,000 movie reviews, including 50,000 labeled examples and 50,000 unlabeled examples. The labeled portion is evenly split into a training set and a test set, each with 25,000 reviews (12,500 positive and 12,500 negative). The unlabeled reviews, typically used for clustering or zero-shot tasks, are excluded from all experiments in this project. Since no predefined validation split is provided, we reserve 20% of the training data for validation. This results in 20,000 training samples (10,000 positive, 10,000 negative) and 5,000 validation samples (2,500 positive, 2,500 negative).

To download and prepared the dataset, run the following script:

pip prepare_dataset.py

Model Training

The train.py script is the central entry point for model training. It automatically selects the appropriate trainer script based on the specified training strategy.

Usage:

python train.py <train_strategy> [additional arguments]

Supported Training Strategies:

Strategy	Description	Trainer Script
`ml_bow`	Train classical ML models on Bag-of-Words	`trainers/ml_bagofwords.py`
`ml_wv`	Train classical ML models on word embeddings	`trainers/ml_wordvectors.py`
`fcn_bow`	Train Fully Connected Networks on BoW	`trainers/fcn_bagofwords.py`
`fcn_wv`	Train Fully Connected Networks on word vectors	`trainers/fcn_wordvectors.py`
`lstm_wv`	Train LSTM models using word vectors	`trainers/lstm_wordvectors.py`
`transformer_wv`	Train custom Transformer models	`trainers/transformer_wordvectors.py`
`llm`	Fine-tune pretrained large language models (DistilBERT, RoBERTa)	`trainers/llm.py`

If an invalid strategy is provided, the script will exit with an error.

Each train_strategy have their own set off required arguments which will be explain in the next part:

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
config		config
feeder		feeder
images		images
metrics		metrics
model_factory		model_factory
runners		runners
scripts		scripts
testers		testers
trainers		trainers
utils		utils
visualization		visualization
visualizer		visualizer
.gitignore		.gitignore
README.md		README.md
demo.py		demo.py
inference.py		inference.py
notebook.ipynb		notebook.ipynb
prepare_dataset.py		prepare_dataset.py
preprocessing.py		preprocessing.py
requirements.txt		requirements.txt
reviews_to_csv.py		reviews_to_csv.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Sentiment Classification of Movie Reviews

⚠️ Important Notice

Demo Instructions

Table of Contents

Introduction:

Environment

Dataset

Model Training

Usage:

Supported Training Strategies:

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

PouriaDan/SentimentAnalysis_IMDBMovieReviews

Folders and files

Latest commit

History

Repository files navigation

Sentiment Classification of Movie Reviews

⚠️ Important Notice

Demo Instructions

Table of Contents

Introduction:

Environment

Dataset

Model Training

Usage:

Supported Training Strategies:

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages