This project demonstrates a sentiment analysis pipeline using various natural language processing (NLP) techniques and machine learning models. The primary objective is to classify movie reviews as either Positive or Negative using the following algorithms:
- Bag of Words (BoW)
- TF-IDF
- N-gram
- Word2Vec
- FastText
Each model is trained to predict sentiment, and a final output is decided by taking the majority vote from the different models.
The dataset used in this project is the IMDB Movie Reviews Dataset, consisting of 50,000 movie reviews labeled as positive or negative.
You can download the dataset from Kaggle:
Place the dataset in your working directory or Google Drive to use in Colab.
- Bag of Words (BoW): Converts text to feature vectors using word counts.
- TF-IDF: Text vectorization based on term frequency and inverse document frequency.
- N-gram: Uses word pairs (bigrams) as features to capture context.
- Word2Vec: Embedding model that learns vector representations of words based on their usage context.
- FastText: Similar to Word2Vec but captures sub-word information, making it useful for morphologically rich languages.
- The dataset is preprocessed by tokenizing, lowercasing, and removing stop words.
- Each model is trained on the preprocessed data.
- The input text is passed through each model to predict sentiment (positive or negative).
- The final sentiment is determined by majority voting from the models' predictions.
├── nlp.py # Main script for training models and running predictions
├── IMDB Dataset.csv # Dataset (ensure it's downloaded and available)
└── README.md # This file