This project involves performing data exploration and visualization on the Netflix dataset to gain insights that can help Netflix decide what types of shows/movies to produce and how to grow the business in various countries.
Analyze the data to generate insights that could help Netflix decide which type of shows/movies to produce and how they can grow their business in different countries.
The dataset contains information about TV shows and movies available on Netflix, including the following attributes:
- Show_id: Unique ID for every Movie/TV Show
- Type: Identifier - A Movie or TV Show
- Title: Title of the Movie/TV Show
- Director: Director of the Movie
- Cast: Actors involved in the movie/show
- Country: Country where the movie/show was produced
- Date_added: Date it was added on Netflix
- Release_year: Actual Release year of the movie/show
- Rating: TV Rating of the movie/show
- Duration: Total Duration - in minutes or number of seasons
- Listed_in: Genre
- Description: Summary description
-
Define the problem statement and analyze basic metrics.
-
Analyze the data structure, detect missing values, and generate a statistical summary.
-
Perform non-graphical analysis: value counts and unique attributes.
-
Visual Analysis:
- Univariate, Bivariate analysis using various plots (Distplot, Countplot, Boxplot, Heatmaps, Pairplots).
- Missing Value and Outlier check.
-
Derive business insights and make actionable recommendations.
- Focus on popular genres like Drama, Comedy, and International TV Shows/Movies.
- Release TV Shows in July/August and Movies at the end or start of the year.
- For the USA, produce movies of 80-120 minutes and Kids TV Shows.
- For the UK, maintain the same movie length and target mature audiences.
- In India, increase the number of movies as it has been declining since 2018.
- Create Anime content for Japan and Romantic TV Shows for South Korea.
- Consider popular actors/directors and their combinations while creating content.
- Multiple Directors Issue: Some movies have two directors, making it difficult to perform certain operations. To manage this, I reduced the granularity.
- Duration Column: The 'Duration' column had numerical values for movies (e.g., '120 min') but categorical values for series (e.g., '2 seasons'), requiring special handling.
- Date Column Handling: The 'Date_added' column was recognized as an object data type by Pandas, hindering date extraction. I converted it using
pd.to_datetime()
. - Missing Value Imputation: Replaced missing values with the most appropriate estimates to improve analysis accuracy.
- EDA Challenges: Performed extensive univariate and bivariate analysis to extract meaningful insights from the data.
- PDF Report: Netflix_case_study_2.0.pdf
- Jupyter Notebook: Netflix_case_study_2.0.ipynb
- Dataset: Netflix Data