Movie Recommendation System This project is a content-based movie recommendation system that suggests movies similar to a given movie. It leverages movie metadata (such as genres, keywords, cast, crew, and overview) to make accurate recommendations. The system is built using Python, Pandas, Scikit-learn, and Streamlit for deployment, providing an interactive web interface for easy usage.
- ✅ Data Preprocessing: Handles missing values and extracts relevant features like genres, keywords, and cast.
- ✅ Exploratory Data Analysis (EDA): Visualizes key patterns and insights in the movie dataset.
- ✅ Feature Engineering: Transforms data into suitable formats for machine learning.
- ✅ Machine Learning Model: Implements cosine similarity between TF-IDF vectors for movie recommendations.
- ✅ Web App with Streamlit: Allows users to input a movie title and get recommendations based on similarity.
MovieRecommendationSystem/
│── dataset/
│ ├── tmdb_5000_movies.csv # Movie metadata
│ ├── tmdb_5000_credits.csv # Cast and crew data
│── app.py # Streamlit app for prediction
│── movie_recommendation.py # Logic for movie recommendation
│── movies.pkl # Serialized movie data (used for prediction)
│── movies_dict.pkl # Serialized movie metadata dictionary
│── requirements.txt # Required Python libraries
│── README.md # Project documentation
- ✅ dataset/: Contains movie metadata and cast information.
- ✅ app.py: Streamlit app for generating movie recommendations.
- ✅ movie_recommendation.py: Core recommendation logic.
- ✅ movies.pkl: Serialized movie data for quick access.
- ✅ movies_dict.pkl: Serialized movie metadata.
- ✅ requirements.txt: List of dependencies for the project.
- ✅ README.md: Project documentation.
The dataset used includes two CSV files from Kaggle:
movies.csv
– Basic movie infocredits.csv
– Cast and crew details
The dataset contains:
- movieId (Unique ID)
- title (Movie title)
- genres (Movie genres)
- keywords (Related keywords)
- cast (Movie cast)
- crew (Director, producers, etc.)
- overview (Movie description)
🔹 Handling Missing Data
Keywords: Filled with empty strings where missing. Overview: Left untouched, as missing values indicate that no description is available.
🔹 Encoding Categorical Variables
Genres: One-hot encoding for different genres. Keywords: Vectorized using TF-IDF for textual information. Cast and Crew: Extracted and cleaned for use in the recommendation model.
Algorithm Used: Cosine Similarity with TF-IDF Vectorization
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(movies['overview'])
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
-
Preprocessing
- Merge
movies.csv
andcredits.csv
- Clean and combine relevant features: title, genres, overview, keywords, cast, crew
- Convert text data into lowercase and remove spaces/special characters
- Merge
-
Feature Engineering
- Create a new feature called
tags
which combines:- Overview
- Genres
- Keywords
- Cast (top 3)
- Director (only one)
- Create a new feature called
-
Vectorization
- Use
CountVectorizer
from scikit-learn to convert tags into numeric vectors
- Use
-
Similarity Calculation
- Compute similarity using cosine similarity
- Recommend top N movies with the highest similarity score
- The model provides accurate movie recommendations based on content similarity.
- Cosine similarity calculates similarity between movies, considering their metadata (genres, keywords, etc.).
Run Streamlit App
streamlit run app.py
Alternative: Run via Python Script
python -m streamlit run app.py
Access in Browser Once running, open in your browser.
- ✅ Implement collaborative filtering to enhance recommendations with user ratings.
- ✅ Add more machine learning models like Random Forest or SVM for better predictions.
- ✅ Deploy the app using Heroku or AWS for public access.
Feel free to contribute by submitting a pull request or reporting issues!
📧 Email: ramnrngupta@gmail.com 📌 GitHub: ram-narayan-gupta-02