Skip to content

Arpita-deb/netflix-movies-and-tv-shows

Repository files navigation

Netflix Data Analysis: A Deep Dive into Streaming Trends

Netflix logo

Introduction:

Netflix is an American subscription video on-demand over-the-top streaming service which was launched on January 16, 2007. The service primarily distributes original and acquired films and television shows from various genres, and it is available internationally in multiple languages.

Netflix is the most-subscribed video on demand streaming media service, with 238.39 million paid memberships in more than 190 countries. As of October 2023, Netflix is the 24th most-visited website in the world with 23.66% of its traffic coming from the United States, followed by the United Kingdom at 5.84% and Brazil at 5.64%.

Using Postgresql I cleaned, analyzed and normalized the Netflix Streaming data, which contains information about the titles, genres, ratings, and countries of the streaming content. I then used Datawrapper and Tableau to create interactive and engaging visualizations of the streaming trends, such as the most popular genres, the distribution of ratings, and the growth of content over time.

Screenshot (1047)

Objective:

The goal of this project is to explore the characteristics and patterns of the Netflix Streaming data and communicate the findings effectively using data visualization tools. By answering the following questions I've conducted an exploratory data analysis of the streaming data of Netflix.

  1. What is the total number of movies and TV shows available on Netflix?
  2. How has the distribution of content (movies and TV shows) changed over time? For example, how many movies and TV shows were released in each decade?
  3. What are the most common genres of movies and TV shows on Netflix?
  4. Which country produces the most movies and TV shows on Netflix?
  5. What is the average duration of movies and TV shows on Netflix?
  6. What are the top-rated movies on Netflix?
  7. What are the most popular ratings on Netflix?
  8. Which were the top years in terms of the number of titles released?
  9. Which actor/director has most films/series in Netflix?
  10. What are the percentage relative frequency for genre?

Tools used:

  1. Google Slides - to create the project proposal
  2. PostgreSQL (SQL Shell) - for data cleaning, data normalization and analysis process
  3. Datawrapper - to create data visualizations
  4. Tableau - to create dashboard
  5. GitHub - for documentation

Methodologies used:

  1. Data Cleaning
  2. Data Manipulation
  3. Database design and Normalization
  4. Exploratory Data Analysis
  5. Data Visualization
  6. Documentation

Deliverables:

  1. A Project Proposal
  2. A cleaned dataset
  3. Normalized data in PostgreSQL database
  4. A full documentation of data cleaning and analysis process
  5. Data visualizations and dashboard in Tableau

About the dataset:

The dataset contains 6 files in total, namely -

  • Best Movie by Year Netflix.csv
  • Best Movies Netflix.csv
  • Best Show by Year Netflix.csv
  • Best Shows Netflix.csv
  • raw_credits.csv
  • raw_titles.csv

For this analysis I'll be using 2 raw files raw_credits and raw_titles which contain 5806 entries of movies/shows in raw_title and 77,214 entries for actors/directors in raw_credits respectively.

Data Dictionary:

1. raw_titles

Column name Datatype Type Description
index integer NON NULLABLE index of the rows
id string NON NULLABLE unique id for each entry
title string NON NULLABLE The title of the movie or TV show
type string NON NULLABLE The type of the movie or TV show
release_year integer NON NULLABLE The year the movie or TV show was released
age_certification string NULLABLE The age certification of the movie or TV show
runtime integer NON NULLABLE The runtime of the movie or TV show
genres string NULLABLE The genres of the movie or TV show
production_countries string NULLABLE The production countries of the movie or TV show
seasons integer NULLABLE The number of seasons of the TV show
imdb_score float NON NULLABLE The IMDB score of the movie or TV show
imdb_votes integer NON NULLABLE The number of IMDB votes of the movie or TV show

2. raw_credits

Column name Datatype Type Description
index integer NON NULLABLE index of the rows
person_id integer NON NULLABLE unique id for each entry
id integer NON NULLABLE id of the movie/show
name string NON NULLABLE The name of the actor or actress
character string NULLABLE The character the actor or actress played in the movie or TV show
role string NON NULLABLE The role the actor or actress played in the movie or TV show

Data Integrity:

  • Reliability and Originality: The raw dataset is created and updated by Eduardo Gonzalez. It has 6 files. For this analysis only 2 of them are used, raw_titles which contains 5806 entries of movies/shows in raw_title and raw_credits contains 77,214 entries for actors/directors.

  • Comprehensiveness: This dataset contains information on all of the movies and TV shows available on Netflix as of May 2022.

  • Citation: There are citation available on Kaggle and data.world.

  • Current: The dataset is updated upto 2022. So it is quite current.

Data cleaning:

The following steps were taken to clean the netflix data -

  • Step 1 - Made a backup copy of the original data in csv format

  • Step 2 - Created a new database called 'netflix' in PostgreSQL Database system

  • Step 3 - Created the raw_titles and raw_credits tables to hold the data loaded from csv files

  • Step 4 - Loaded the data into the raw_titles and raw_credits table

  • Step 5 - Checked the size of the dataset

  • Step 6 - Checked the datatypes

  • Step 7 - Removed the redundant columns (index, imdb_id from raw_titles and index from raw_credits table) that will not help us in the analysis

  • Step 8 - Looked for null values by counting the number of rows with Null values, 0s or wrong values. There are null values in titles, runtime, genres, production_countries, imdb_score and imdb votes columns in raw_titles table and in character column in raw_credits table.

  • Step 9 - After identifying the null values, I decided to remove them from title, runtime, imdb_score and imdb_votes columns. Here is the explanation for removing the values.

    I have 1 null value in title column, which needs to be deleted as the title column needs to be unique and non nullable. There are 24 rows with 0 runtime which doesn't make sense, so they're also to be deleted. The imdb_score and imdb_votes have 539 null values altogether. I can do any of the following with these 2 columns -

    1. Replace null values with 0, but it may not be the best approach, as it could lead to inaccurate results.
    2. Replace null values with a special value such as -1.0 or -999.0. This will allow me to distinguish between missing data and actual scores of 0.
    3. Leave null values as they are. But it can affect the analysis if I have to perform calculations. Null values can cause errors in calculations and can also affect the accuracy of the results. For example, if I'm calculating the average imdb_score of a set of movies and some of the imdb_score values are null, the average will be skewed and may not be an accurate representation of the data.
    4. Another approach is to use a string value such as 'No information' or 'N/A' to represent missing data. However, this approach requires converting the column to a string datatype first, which may not be ideal if I need to perform calculations on the column.
    5. Or, removing the rows altogether. Removing these rows with null values will leave us with 5806 - 1- 24 - 532 = 5249 rows which will not affect the analysis much.

    So I decided to remove these rows with null values.

  • Step 10 - Checked for duplicate values. There were no duplicate values in both table. But there are entries with same id and person_id but with different character or role like the data shown below. There are multiple entries of one person acting in one particular show/movie but the data is recorded in 2 or more rows for different characters/roles.

Screenshot (912)

This increases the number of records in our table. The character column needed to be concatenated into one single row for each person and show_id, for either actor or director role.

  • Step 11 - Changed the cases of the texts into Proper case

  • Step 12 - Trimed the white, leading and trailing spaces from the categorical columns

  • Step 13 - String manipulation

    • 13.1 - Removed the [ ] and '' (quotation marks) from genres and production_countries columns
    • 13.2 - Replaced the null values with default values
    • 13.3 - Set the Null values in character column to 'No information'
    • 13.4 - Set the null values in seasons to 0 which corresponds to movies
    • 13.5 - Set the null values in age_certification to 'Others'
    • 13.6 - Set the values with [] in genres column with 'N/A'
    • 13.7 - Set the values with [] in production_countries column with 'N/A'
  • Step 14 - Modified some titles

    • 14.1 - Renamed title '30 March' to '30.March'
    • 14.2 - Replaced '#' from the beginning of some titles
  • Step 15 - Removed the extra columns production_contries and genres since I now have cleaned columns country and genre.

  • Step 16 - Concatenating multiple characters into a single row

    The raw_titles table holds the data of unique shows, while the raw_credits table holds the data of people who have played a certain character in a particular show. Actors and shows have a many-to-many relationship, meaning that one film/show might have more than one actor, and one actor may play more than one character both in one show or in multiple shows.

    Screenshot (908)

    To make it easier for users to look up any actor associated with a particular show and also get the information on the character they played, a table named credits was created with similar fields as the raw_credits table. The only difference between these two tables is that raw_credits sometimes holds information about characters played by a certain actor in a certain show more than once, while in the credits table, there is only one (unique) combination of show_id, person_id, character played, and role (i.e., either director or actor). This reduces the number of duplicates in the table.

    DROP TABLE IF EXISTS credits; CREATE TABLE credits (person_id INTEGER, id VARCHAR(20), name TEXT, role VARCHAR(8), character TEXT);

    INSERT INTO credits (person_id, id, name,role, character) SELECT person_id, id, name, role, STRING_AGG(character, ' / ') FROM raw_credits GROUP BY person_id, id, name, role;

    This query groups the rows in the raw_credits table by person_id,id,name and role and concatenate the values in the character column for each group using the string_agg function. The resulting table will have 5 columns: person_id,id, name, character, role where character contains the concatenated values of the character column for each group. This table now has 77122 entries.

    Screenshot (909)

  • Step 17 - Renamed the 'raw_titles' table to 'titles'

  • Step 18 - Saving the new tables and import them as cleaned csv files

Database Design:

After cleaning the Netflix data in Part 1, we obtained two tables - 'titles' containing information about unique shows/movies and 'credits' containing information about the castings in different shows/movies. The data is now distributed in these two tables.

initial data dist

When we counted the unique shows in each of the tables (since both have id column which corresponds to unique shows), we found out that the number of unique shows in credits table is higher than the titles table.

Show

To have a consistency in the entire analysis, I've chosen only the data that are present in both tables by creating a view with the common data from both titles and credits table.

Using the best practices of database management system, we can now easily split the data into different tables that will allow us - * Less data duplication and more efficient storage usage. * Increased data integrity, accuracy and consistency. * Improved query performance and organization. * Increased security and connection.

Database Design:

Database design is the organization of data according to a database model. The designer determines what data must be stored and how the data elements interrelate.

Like any design process, database and information system design begins at a high level of abstraction and becomes increasingly more concrete and specific. Data models can generally be divided into three categories, which vary according to their degree of abstraction. The process will start with a conceptual model, progress to a logical model and conclude with a physical model.

  • Conceptual data model:

They are also referred to as domain models and offer a big-picture view of what the system will contain, how it will be organized, and which business rules are involved. Conceptual models are usually created as part of the process of gathering initial project requirements.

conceptual model

  • Logical data model:

They are less abstract and provide greater detail about the concepts and relationships in the domain under consideration. These indicate data attributes, such as data types and their corresponding lengths, and show the relationships among entities. Logical data models don’t specify any technical system requirements.

logical model

  • Physical data model:

They provide a schema for how the data will be physically stored within a database. As such, they’re the least abstract of all. They offer a finalized design that can be implemented as a relational database, including associative tables that illustrate the relationships among entities as well as the primary keys and foreign keys that will be used to maintain those relationships.

physical model

The following steps were taken to create the database -

  • Step 1 - To give the data First Normal Form (1NF), I unnested the nested columns first by 'genres' column and then by 'production_countries' and saved the data in a view called country_nested. As for primary key, the id column is a unique identifier of the show/movie and can act as a primary key.

  • Step 2 - Then I created a table 'netflix_table' with all the data from country_nested view to further manipulate it easily. Due to unnesting of the data, the number of data has increased to 271234.

  • Step 3 - To keep the database uncluttered I removed the redundant views from the database.

  • Step 4 - Then to give the data in netflix_table Second Normal Form (2NF), I needed to make sure that each non-key attribute is functionally dependent on the primary key(show_id) only. The attributes that are functionally independent from the primary key, will have their separate tables with a primary key. So I grouped different categories into different tables according to the physical model of the database.

  • Step 5 - I created the leaf tables and inserted the values in respective tables from netflix_tables and provided the primary keys to each of the tables.

  • Step 6 - To link theses relations(tables) with each other I defined foreign keys.

  • Step 7 - There were four attributes genre, country, actor and director that had many-to-many relationship with the show table. So to join these tables with show table individually I needed 4 junction/linking tables with the foreign keys as columns.

  • Step 8 - To test that the keys work fine and link the tables properly, I've performed several queries with JOIN statements at each step.

  • Step 9 - In the country table to make the data more comprehensible, I decided to add country name along with country ISO code. So I joined a table that have been previously downloaded from Wikipedia and inserted the country names into the country_name column in country table.

  • Step 10- Finally, I checked for transitive dependencies among the leaf tables and their attributes. There were none. This satisfied the Third Normal Form (3NF).

  • Step 11 - To make the data clean, consistent and complete, I performed some string Manipulations, trimmed some parts of strings, used REGEX to filter out inconsistencies and changed them.

This database is now normalized upto 3rd Normal Form. It ensures that -

  1. Each table have columns that only store a single piece of data and that data is accessed through a unique key (Primary Key). [First Normal Form]

  2. Each Non-key attributes (i.e., columns other than primary key(s) are functionally dependent on the primary key only. [Second Normal Form]

  3. There is no transitional dependency of the non-key attributes i.e., each table has columns that are dependent only on the primary key. [Third Normal Form]

Data Analysis:

1. What is the total number of movies and TV shows available on Netflix?

total_content

2. What is the total number of contents per type (movie/show)?

show-movie-freq

distribution-of-movies-and-shows-on-netflix

3. How has the distribution of content (movies and TV shows) changed over time?

distribution-of-contents-on-netflix-over-the-years

4. Which were the top years in terms of the number of titles released?

top-years

5. How many movies and TV shows were released in each decade?

decade

rise-of-netflix-over-the-decades

6. What are the most common genres of movies and TV shows on Netflix?

top-5-genre-movies

top-5-genre-shows

popular-genre-on-netflix

7. Which country produces the most movies and TV shows on Netflix?

top-10-country-movie

top-10-country-show

total-movie-released-per-country

total-shows-released-per-country

distribution-of-contents-in-top-10-countries

8. Calculate descriptive statistics for imdb score, imdb votes and runtime of the shows.

imdb_score-dist

imdb_votes-dist

runtime-dist

9. Which shows/movies are of longest and shortest duration?

highest-runtime

lowest-runtime

10. Calculate the distribution of seasons for shows.

seasons

popular-seasons-on-netflix

11. What are the 10 top-rated movies and shows on Netflix?

top-10-movies

top-10-show

top-10-movies-in-netflix-based-on-imdb-scores

top-10-shows-in-netflix-based-on-imdb-scores

12. What are the most popular certifications on Netflix?

age-cert

distribution-of-contents-on-netflix-by-age-certification

13. List Top 10 Actors with number of shows/movies acted.

top-10-actors

14. List Top 10 Directors with number of shows/movies directed.

top-10-directors

15. Categorize the contents in 3 parts (Short, Medium and Long) in terms of duration and give their respective percentage frequency.

duration-freq

distribution-of-contents-on-netflix-based-on-runtime

16. Categorize the contents in 10 ratings based on the imdb_score.

rating-freq

what-do-people-think-of-the-contents-on-netflix-

17. What is the percentage frequency of genre?

genre-freq

18. Calculate the number of shows with runtime greater than the average duration?

num_content_greater_than_avg_runtime

19. Calculate the number of contents with imdb_score greater than average imdb_score.

num_content_greater_than_avg_imdb

Dashboard:

netflix

The tableau dashboard can be accessed here.

Summary:

  • The analysis revealed that Netflix has more movies than shows, mostly from the USA and India, and that its popularity has increased dramatically since 2010. Japan and North Korea are seen as emerging show producing countries.
  • Over the past ten years (2011–2021), Netflix’s popularity surged as the number of contents released increased almost tenfold and the trend seem to continue to the next decade.
  • It also shows that the most common genres are drama, comedy and action, and that the average runtime, IMDB score and IMDB votes are 81.61 minutes, 6.5269 and 24724.6467 respectively.
  • It also indicates that most of the contents are for mature audiences as 14% of the total contents have TV-MA certification followed by R (11%) and PG-13 (8%).
  • Most of the shows have one season.

Recommendations for various stakeholders:

  • Members: Netflix can make its members happier and loyal by suggesting contents that they like based on what they watch. Netflix can also use the data to find and make more contents that suit its global members' tastes and needs. For example, Netflix can invest more in producing original shows from Japan and North Korea, as they are seen as emerging markets for show production. Netflix can also create more contents in different languages and genres, such as comedy, drama, and action, which are the most popular among its members.

  • Investors: Netflix can use the data to show its investors how it is leading and growing in the streaming industry. Netflix can show its investors how it has become more popular and successful by releasing more contents in the past ten years, and how it will keep doing so.

  • Regulators: Netflix can use the data to follow the rules and standards of different countries and regions where it offers its service. Netflix can make sure that its contents are fit and respectful for the people who watch them, and that it follows the culture and law of each place. Netflix can also use the data to make decisions on how to license, tax, and censor its contents, and to deal with any issues or concerns from the regulators.

  • Communities and organizations: Netflix can use the data to connect with the communities and organizations that are important for its business and social impact. Netflix can work with independent producers and content creators to give them a chance to share their work with the world. Netflix can also work with schools, NGOs, and media to promote its contents and raise awareness on various topics and issues that are in its contents. Netflix can also use the data to improve its diversity and inclusion efforts, and to support the causes and initiatives that match its values and mission.

Limitation of the project:

  • One possible limitation of the Netflix dataset is that it only contains data up to May 2022, which may not reflect most current contents and trends.
  • Due to rigorous data cleaning of the dataset, many incorrect and inconsistent entries have been removed from the dataset. So, it does not cover the entire contents of Netflix.
  • Since there were no reviews, we couldn’t perform sentiment analysis to provide more insights into user behavior and preferences.

Some future project ideas:

  • Analyze the relationship between the director and the cast of the movies and shows.
  • Explore the variation of content in other countries as well as for different years.
  • Perform sentiment analysis on the descriptions of the movies and shows.
  • Create a recommendation system based on the user’s preferences and viewing history.

Resources:

  1. Dataset Used

  2. Data Cleaning in SQL

  3. Data Design

  4. Regular Expressions

  5. Data Analysis

  6. Dashboard

About

Exploratory Data Analysis of the Netflix data in SQL

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published