๐ฌ Cine Suggest โ A Smart Movie Recommender
Cine Suggest is a content-based movie recommendation app built using Streamlit and trained on the TMDB Top 5000 Movies dataset. It helps users discover similar movies based on overview, genre, cast, crew, and keywords using NLP and cosine similarity.
๐ Features
๐ Search by movie title
๐ฏ Content-based recommendations (overview, genres, cast, director)
๐ง Cosine similarity with TF-IDF vectorization
๐ผ๏ธ Posters fetched live from TMDB API
๐ฑ Mobile-friendly UI with fuzzy search fallback
- Balanced Size + Richness: ~5000 movies, with overview, genres, keywords, release dates, popularity โ rich enough for a recommendation engine.
- Modular Structure: Split into two cleanly organized files โ movies.csv and credits.csv โ making merging easy via the shared id field.
- Complete Metadata:
- From movies.csv: title, overview, genres, keywords
- From credits.csv: movie_id, cast, director, writer, etc. extracted from the JSON-formatted fields
- Realistic for ML/NLP tasks: Overview and genre fields are perfect for content-based recommendations.
- IMDB Top 1000 movies database: 1000 movies seemed like a very low number where the original IMDB database contains way more (1000x) data than that.
- IMDB official Database: Huge database, (11803648 rows) which itself is a overhead for a project like this. The dataset lacks details like overview, plot etc and requires different datasets to get more info on the casts, crew etc.
-
Lowercased & No Spaces: Fields like genres, crew, and casts are converted to lowercase and joined by underscores. This prevents token overlap during vectorization. This preprocessing helps increasing the cosine distance between the vectors during vectorization.
Example: โNeal Cafferyโ and โNeal Frankenstineโ would both contain the word โNealโ โ misleading the model into finding them similar.
-
Result: Cleaned, deduplicated token space -> improved cosine distance between distinct vectors.
๐ฅ๏ธ Frontend: Streamlit
๐ Backend: Python, Pandas, Scikit-learn, Requests
๐๏ธ Data: TMDB 5000 Movies Dataset via Kaggle
๐งฉ API: TMDB API for fetching live posters
๐ Deployment: Streamlit Community Cloud
-
Clone the repo
git clone https://github.com/yourusername/cinesuggest.git cd cine-suggest
-
Install dependencies
pip install -r requirements.txt
-
Set your TMDB API key
Create a .streamlit/secrets.toml file:
TMDB_API_KEY = "your_api_key_here"
-
Run the app
streamlit run app.py
This project is licensed under the GNU AGPL v3.0. You are free to use, modify, and distribute this software, but any derivative work must also be open-sourced under the same license โ even if itโs hosted as a web service.
https://cinesuggest-soumyadghosh.streamlit.app/
Built by Soumyadeep Ghosh as part of a content-based recommendation exploration project. TMDB data ยฉ TMDB and respective contributors.