
- Problem Statement
- Overview
- Working
- Features
- Setup
- Folder Structure
- Challenges & Solutions
- Impact
- Future Improvements
- License
- With the rise of streaming services, viewers now have access to thousands of movies across platforms.
- As a result, many viewers spend more time browsing than actually watching.
- This problem can lead to frustration, lower satisfaction and less time spent on the platform.
- Which can impact both the user experience and business performance.
- A production-ready content-based movie recommender system built with clean coding practices, modular design, proper version control and deployed as a web application.
- It analyzes metadata of 5000+ movies such as genres, cast, crew, keywords and overview to recommend top 5 similar movies based on a user-selected movie.
- The system uses techniques like
CountVectorizer
for text vectorization andcosine_similarity
to find similarity between movies. - The project not only focuses on functionality but on building a clean, scalable and production ready solution, applying industry standard practices.
- The dataset contains metadata for each movie including keywords, genres, cast, crew and overview.

- All these features are combined into a new column called
tags
to create a unified representation of each movie.

- Text preprocessing is applied to the
tags
column :- All text is converted to lowercase (e.g.,
"Action, Thriller"
becomes"action, thriller"
). - Spaces between words are removed (e.g.,
"action movie"
becomes"actionmovie"
). - Stemming is performed using
PorterStemmer
to reduce words to their root form.
- All text is converted to lowercase (e.g.,

- CountVectorizer is used to convert the
tags
column into numerical feature vectors. - Cosine similarity is used to calculate similarity between the vector representations of all the movies.
- The resulting similarity matrix is serialized and saved as a
.pkl
file for efficient loading during recommendation. - A Streamlit web application is built to provide an interactive interface for movie selection and recommendation :
- User select a movie from the dropdown list.
- The system recommends top 5 most similar movies based on the similarity score.
- Movie posters are fetched using the TMDB API to enhance the visual appeal of the recommendations.
- The project follows a modular approach by organizing modules into a dedicated
utils/
directory. - Each module in
utils/
directory is responsible for a specific task and includes :- Clear docstrings explaining functionality, expected inputs/outputs, returns and raises.
- Robust exception handling for better error tracing and debugging.
- Following the DRY (Don't Repeat Yourself) principle, this design :
- Reuse functions across notebooks and scripts without rewriting code.
- Save development time and reduce redundancy.
- The
utils/
directory also includes__init__.py
file in it, which serves a few important purposes in Python :- The
__init__.py
file tells Python to treat the directory as a package, not just a regular folder. - Without it, Python won't recognize the folder as a package.
- The
- To access these utility modules anywhere in the project, add the following snippet at the top of your script :
import sys, os
sys.path.append(os.path.abspath("../utils"))
- This is one of the functions I added in my project as
export_data.py
module inutils
directory.
import os
import pandas as pd
def export_as_csv(dataframe, folder_name, file_name):
"""
Exports a pandas DataFrame as a CSV file to a specified folder.
Parameters:
dataframe (pd.DataFrame): The DataFrame to export.
folder_name (str): Name of the folder where CSV file will be saved.
file_name (str): Name of the CSV file. Must end with '.csv' extension.
Returns:
None
Raises:
TypeError: If input is not a pandas DataFrame.
ValueError: If file_name does not end with '.csv' extension.
FileNotFoundError: If folder does not exist.
"""
try:
if not isinstance(dataframe, pd.DataFrame):
raise TypeError("Input must be a pandas DataFrame.")
if not file_name.lower().endswith('.csv'):
raise ValueError("File name must end with '.csv' extension.")
current_dir = os.getcwd()
parent_dir = os.path.dirname(current_dir)
folder_path = os.path.join(parent_dir, folder_name)
file_path = os.path.join(folder_path, file_name)
if not os.path.isdir(folder_path):
raise FileNotFoundError(f"Folder '{folder_name}' does not exist.")
dataframe.to_csv(file_path, index=False)
print(f"Successfully exported the DataFrame as '{file_name}'")
except TypeError as e:
print(e)
except ValueError as e:
print(e)
except FileNotFoundError as e:
print(e)
- Instead of hardcoding file paths, the project uses Python's built-in
os
module to handle file paths dynamically. - This improves code flexibility, ensuring the code runs smoothly across different systems and environments.
- Automatically adapt to the system directory structure.
- Prevent
FileNotFoundError
caused by rigid, hardcoded paths. - Make deployment and collaboration easier without manual path updates.
current_dir = os.getcwd()
parent_dir = os.path.dirname(current_dir)
folder_path = os.path.join(parent_dir, folder_name)
file_path = os.path.join(folder_path, file_name)
- Integrated
nbstripout
with Git to automatically remove Jupyter notebooks output before committing. - It helps maintain a clean and readable commit history by :
- Avoiding large, unreadable diffs caused by cell outputs.
- Keeping only code and markdown content under version control.
- Especially useful when pushing to remote repositories, as it reduces clutter and improves code readability.
- Install it in your virtual environment using
pip
.
pip install nbstripout
- This sets up a Git filter to strip notebook output automatically on commits.
nbstripout --install
- Commit a notebook and observe that the outputs are removed from the committed version.
- The project uses Streamlit's
st.secrets
feature to handle the TMDB API key securely during local development. - A
secrets.toml
file is placed inside the.streamlit/
directory, storing the API key in the following format :
[tmdb]
api_key = "your_api_key_here"
- The API key is accessed in code using :
api_key = st.secrets["tmdb"]["api_key"]
Caution
The secrets.toml
file should not be pushed to a public repository to avoid exposing sensitive credentials.
You can add it to .gitignore to ensure it's excluded from version control.
When deploying to streamlit, the API key must be added via the GUI, not through the secrets.toml
file.

- In the project, a similarity matrix is computed to recommend movies.
- But due to its high dimensionality, size of the matrix becomes very large and exceeds GitHub's size limitations.
- GitHub restricts uploads larger than 100MB in public repositories, making it unsuitable for storing large files.
- While Git LFS (Large File Storage) is one option, it can be complex to configure and manage.
- To address this issue, the matrix file is :
- Uploaded to Google Drive.
- Downloaded at runtime using
gdown
library. - Stored locally on streamlit server when the app runs.
- This approach ensures :
- Compatibility with GitHub without needing Git LFS.
- A hassle-free experience for cloning the repository or running the app across environments.
import os
import gdown
import pickle
# Step 1: Define the Google Drive file ID
file_id = 'your_file_id_here'
# Step 2: Set the desired file name for the downloaded file
output = 'similarity.pkl'
# Step 3: Construct the direct download URL from the file ID
url = f'https://drive.google.com/uc?id={file_id}'
# Step 4: Check if the file already exists locally
# If not, download it from Google Drive using gdown
if not os.path.exists(output):
gdown.download(url, output, quiet=False)
# Step 5: Open the downloaded file in read binary mode
# and load the similarity matrix using pickle
with open('similarity.pkl', 'rb') as f:
similarity = pickle.load(f)
Follow these steps carefully to setup and run the project on your local machine :
First, you need to download the project from GitHub to your local system.
git clone https://github.com/TheMrityunjayPathak/movie-recommender-system.git
To avoid version conflicts and keep your project isolated, create a virtual environment.
On Windows :
python -m venv .venv
On macOS/Linux :
python3 -m venv .venv
After setting up the virtual environment, activate it to begin installing dependencies.
On Windows :
.venv\Scripts\activate
On macOS/Linux :
source .venv/bin/activate
Now, install all the required libraries inside your virtual environment using the requirements.txt
file.
pip install -r requirements.txt
Tip
It's a good idea to upgrade pip
before installing dependencies to avoid compatibility issues.
pip install --upgrade pip
Note
The .streamlit/
folder contains streamlit configuration settings.
But it's not necessary to included in your project until required.
- The
config.toml
file contains configuration settings such as the server settings, theme preferences, etc.
[theme]
base="dark"
primaryColor="#FD3A84"
backgroundColor="#020200"
- The
secrets.toml
file contains sensitive information like API keys, database credentials, etc.
[TMDB]
api_key = "your_tmdb_api_key_here"
Important
Make sure not to commit your secrets.toml
to GitHub or any public repositories.
You can add it to .gitignore
to ensure it's excluded from version control.
After everything is setup, you can run the streamlit application :
streamlit run app.py
Once you're done working, you can deactivate your virtual environment :
deactivate
movie_recommender_system/
|
├── .streamlit/ # Streamlit Configuration Files
├── raw_data/ # Original Datasets
├── clean_data/ # Preprocessed and Cleaned Datasets
├── notebooks/ # Jupyter Notebooks for Preprocessing and Vectorization
├── images/ # Images used in Streamlit Application
├── utils/ # Modular Python Scripts
├── app.py # Main Streamlit Application
├── requirements.txt # List of required libraries for the Project
├── README.md # Detailed documentation of Project
├── LICENSE # License specifying permissions and usage rights
├── .gitignore # All files and folders excluded from Git Tracking
Challenge | Solution |
---|---|
Keeping Commits Clean | Used nbstripout to remove notebooks output before committing. |
Managing Large Files | Used Google Drive with gdown to load large files effectively. |
Hiding Sensitive API Keys | Used st.secrets to securely store and access sensitive information. |
Reusability and Scalability | Structured the project with modular code in utils/ package. |
Dynamic File Paths | Used the os module for dynamic and platform-independent path handling. |
If this system gets scaled and integrated with a streaming service, this could :
- Reduce the time users spend choosing what to watch.
- Increase user engagement, watch time and customer satisfaction.
- Help streaming platforms retain users by offering better personalized content.
- Currently, tags are generated equally from cast, crew, keywords, genres and overview.
- We can improve this by applying feature importance or weighting certain features.
- This can be done by multiplying certain column values to give them higher importance.
- Add user-based data to provide more personalized recommendations.
- Collaborative filtering can suggest movies based on similar user behaviour.
- This will make the recommender system more user-centric.
- Fetch movies data from external sources to ensure the movies database is always up-to-date.
- This would allow to recommend the latest releases and remove outdated movies automatically.
- Instead of just Cosine Similarity, we can experiment with other advanced similarity measures,
- Like Jaccard Similarity, TF-IDF or Word2Vec for capturing semantic meaning in the movie descriptions.
- Enhance the user experience by providing filters to choose movies based on genres, actors or directors.
This project is licensed under the MIT License. You are free to use and modify the code as needed.