This project explores the IMDb movie database using graph theory and machine learning techniques. We analyze connections between actors, movies, and genres to derive insights and predict movie ratings.
The project processes and analyzes five IMDb data files:
actor_movies.txt
– Actors and the movies they've appeared inactress_movies.txt
– Actresses and the movies they've appeared indirector_movies.txt
– Directors and their filmographiesmovie_genre.txt
– Movie genresmovie_rating.txt
– IMDb ratings of movies
- Merge actor and actress lists.
- Retain only those with ≥5 movies.
- Build
act2movie_dict
andmovie2act_dict
for lookup. - Store results using pickling for future steps.
- Create a directed weighted graph:
- Nodes: Actors/Actresses
- Edge: Shared movie appearances
- Weight: Normalized count of shared movies
- Built using Python and
igraph
. - Network stats: 243,989 nodes, 57.8M edges.
- Run PageRank on the actor network to rank influence.
- Compare top-ranked nodes with well-known celebrities.
- Observed discrepancy due to bias toward prolific actors in less-known roles.
- Remove movies with <5 actors.
- Create undirected movie network:
- Nodes: Movies
- Edge: Shared actors
- Weight: Jaccard index of actor sets
- Network stats: 253,744 nodes, 62.2M edges.
- Run Fast Greedy Newman algorithm to detect communities.
- Assign genre tags if genre appears in ≥20% of a community.
- Most frequent genres: Drama, Short.
- For selected movies:
Batman v Superman (2016)
Mission: Impossible - Rogue Nation (2015)
Minions (2015)
- Find top 5 neighbors based on edge weights and shared community tags.
- Predict movie ratings using average ratings of 10, 30, and 50 neighbors.
- Results:
- Predictions fairly close to IMDb ratings.
- Better performance with moderate training set size.
- Features:
- Top 5 actor PageRank scores
- Director isTop100 boolean (1/0)
- Trained linear regression with scikit-learn.
- R² ≈ 0.005 – weak correlation observed.
- Bipartite graph: Actors ↔ Movies
- Actor score: Average of top 5 movie ratings
- Movie score: Average score of its actors
- Outperformed regression approach, especially for live-action films.
Method | Prediction Accuracy | Notes |
---|---|---|
PageRank Analysis | N/A | Good for network centrality, not fame |
Neighborhood Averaging | Moderate | Best with 30–50 neighbors |
Linear Regression | Low (R² = 0.005) | Poor due to weak features |
Bipartite Graph Averaging | Strong | Most accurate overall |
Install Python dependencies:
pip2 install -r requirements.txt
run run_all.sh
- Python 2.7
- igraph (Python + R)
- scikit-learn
- Regular expressions (re)
- Pickle
- Data cleaning and formatting were critical for network integrity.
- Genre tagging and network construction were memory-intensive and optimized using hashing and lookup tables.
- The study highlights the tradeoffs between algorithmic influence scoring and actual popularity or fame.
This project is released for academic and research purposes. Please credit the source if used in publications or derivative works. No commercial use of IMDb data is intended.