- Overview
- Dataset
- Key Questions Explored
- Tools and Libraries
- Installation
- Usage
- Results
- Future Improvements
This project is an exploratory data analysis (EDA) of the IMDb Top 1000 movies dataset. The analysis aims to uncover interesting insights about the top-rated movies, including trends related to genres, directors, movie ratings, and gross revenue. The project demonstrates skills in data cleaning, visualization, and interpretation using Python libraries such as Pandas, NumPy, Matplotlib, and Seaborn.
LINK TO DATASET: https://www.kaggle.com/datasets/harshitshankhdhar/imdb-dataset-of-top-1000-movies-and-tv-shows
LINK TO POWER BI REPORT: https://app.powerbi.com/view?r=eyJrIjoiOWZhMDY3ZmEtOTU0ZC00OThiLWI1YmQtZGE5MjliZGJlYzRmIiwidCI6IjAwZmZkYmE4LWNlOTEtNGUxMy1iMjg4LTc4MmU1NjQ2OGU2ZSJ9&pageName=22d03c1a6a020db1c41e
The dataset used in this project is the IMDb Top 1000 Movies dataset, which contains information on the top-rated 1000 movies according to IMDb ratings. The dataset includes various attributes such as:
- Title: The name of the movie
- Release Year: the year the movie was released
- Runtime: the duration of the movie in minutes
- Genre: a list of genres associated with the movie
- IMDB Rating: the IMDb rating of the movie
- Director: the director of the movie
- Star: the list of top 4 cast members in the movie
- Gross: the gross revenue generated by the movie
The dataset was obtained from Kaggle and cleaned for the purposes of this analysis.
- What are the most common movie genres in the top 1000 movies?
- How do IMDb ratings vary across different genres?
- Are there any directors whose movies consistently have higher ratings?
- Is there a correlation between movie duration and IMDb rating?
- Which genres tend to generate the highest average gross revenue?
- Is there a relationship between movie ratings and gross revenue?
The following Python libraries and tools were used for the analysis:
- Pandas: for data manipulation and cleaning
- NumPy: for numerical operations
- Matplotlib: for data visualization
- Seaborn: for enhanced statistical visualizations
- Jupyter Notebook: for interactive coding and analysis
- Microsoft Power BI: for creating interactive visualizations and dashboards
To run this project locally, follow these steps:
- clone this repository:
git clone https://github.com/okashashuda/data-analysis-movies.git
cd data-analysis-movies
- install the required libraries (skip this step if using an Anaconda environment)
pip install pandas
pip install numpy
pip install matplotlib
pip install seaborn
- start Jupyter Notebook
jupyter notebook
- Open the Jupyter notebook
movie-analysis.ipynb
to follow the data analysis process step by step - If you're interested in the cleaned dataset, you can use
imdb_movies_clean.csv
for your own analysis or visualization - The code is designed to be modular, so feel free to modify any of the analysis questions or visualizations to suit your needs
The analysis provided several key insights, including:
- movies that fall under the Drama genre are the most common and also have the highest IMDb ratings (not average).
- despite movies in Drama having the highest rating, Adventure movies bring in the most money (average and total revenue)
- the top directors in the industry are Spielberg, Scorsese, Allen, Nolan, Tarantino, Fincher and Eastwood for having directed the most movies
- actors such as Robert De Niro, Tom Hanks, Al Pacino, Brad Pitt and Leonardo DiCaprio are famous because they are cast in the most movies
- no strong evidence to suggest that there is a relationship between a movie's rating and gross revenue
- no evidence to prove that a correlation exists between movie runtime and rating or runtime and gross revenue
Feel free to check the visualizations and full insights in the movie-analysis.ipynb
notebook.
If I were to repeat a similar project, here is what I would do differently:
- Additional Questions: is there a correlation between the movie budget and revenue generated?
- Predictive Modeling: build a simple machine learning model to predict a movie’s IMDb rating based on its attributes
- Expand Dataset: include more movies beyond the top 1000 to get a broader view of trends