This document describes a movie recommendation system designed to help users discover movies based on their preferences.
https://movie-recommend.azurewebsites.net
- Flask
- Azure
The system leverages data from the IMDb: https://www.imdb.com/ movie database through two phases:
- Data source: GroupLens: https://grouplens.org/datasets/
- Files used:
- movies.csv (movieId, title, genre)
- links.csv (movieId, imdbId, tmdbId)
- ratings.csv (userId, movieId, rating)
- Preprocessing:
- Removed unused files (genome-*.csv, tags.csv)
- Filtered ratings.csv:
- Removed users with less than 100 ratings.
- Removed movies with less than 100 ratings.
- Dropped userId column.
- Calculated average rating per movie.
- Fetches additional movie attributes using imdbId:
- plot
- cast
- crew
- director
- countries
- languages
- production companies
The system follows a sequential pipeline:
- Downloads data from GitHub.
- Unzips and saves data to DataIngestionArtiacts folder.
- Cleans and prepares data.
- Generates two CSV files:
- movieId, title, genres, imdbId (saved in DataPreprocessingArtiacts)
- imdbId (used for further data collection)
- Fetches additional attributes using Cinemagoer library and imdbId.
- Retrieves data from GitHub by default (faster).
- Option to fetch live data by setting COLLECTION_FLAG to True (slower).
- Saves data to DataCollectionArtiacts folder.
- Combines data from both phases.
- Transforms data for model development.
- Saves data to DataTransformationArtiacts folder.
- Calculates movie similarities using CountVectorizer and cosine similarity.
- Uses title and imdbId for recommendations.
- Saves model artifacts to ModelDevelopmentArtiacts folder (including a Data folder).
- Flask framework is used to create a user interface.
- Users can select a movie and receive recommendations.
To execute this project and run the pipeline, install all the requirements and run the following command:
python main.py
Execution of the pipeline can be seen in the logs folder that will create during the run. All the components are saved inside the artifacts folder. A Data folder is also created that uses in the web development part.
Make sure you have executed main.py file. For demo of the project can be done by running the following command:
python app.py
This project is deployed on the Azure, and this is continuos deployment.
https://movie-recommend.azurewebsites.net/
- Ravi Kumar