Exploratory Data Analysis: Netflix Data
Introduction:
For this project, I decided to explore and analyze a dataset of media content released on Netflix. The dataset used can be found at https://drive.google.com/file/d/1GFMqVUwat-ReYQ-Gb8GHUEqw8J0deswJ/view?usp=sharing.
To analyze the data, I came up with the following questions:
- What is the content makeup on Netflix?
- How many programs does each rating have?
- What are the top 10 content-producing countries?
- What are the top 5 genres on Netflix?
- What year had the most content released in the past 20 years?
- How much content has been released throughout the years?
Preparing the Data:
To prepare the data, I first looked for any duplicate rows. There were not any, so this was not an issue. I then looked for any null values in the data. There was missing data in the Rating, Country, Director, Cast, and Date_Added columns. I dropped the Director, Cast, and Date_Added columns because they were not relevant to the analysis. I populated the missing ratings using information found on Google, as well as the Country column. After cleaning the data, I was left with a 7787 row by 9 column data frame.
Preparation process:
Data Exploration:
- What is the content makeup on Netflix?
This pie chart shows how the content on Netflix is split up between movies and TV shows, with 69.1% of Netflix content being movies and the other 30.9% being TV shows.
2. How many programs does each rating have?
This bar graph shows each rating and the number of programs each rating has. As we can see in the graph, there are two “TV-MA” ratings, despite seeing no duplicate rows in the data preparation process. This highlights an issue in the data quality.
3. What are the top 10 content-producing countries?
This bar graph shows the top 10 Netflix content-producing countries in descending order. The United States is the top content producer, followed by India, UK, Japan, South Korea, Canada, Spain, France, Egypt, and Turkey.
4. What are the top 5 genres on Netflix?
This bar graph shows the top 5 genres of programs on Netflix. The top genre for Netflix programs is Documentaries, followed by Stand-Up Comedy. In this graph, we can see that some categories have multiple Genres in one (e.g. Dramas, International Movies), which affects the analysis of the data.
5. What year had the most content released in the past 20 years?
This bar graph shows the number of programs produced per year in the past 20 years, in order from least to most. As we can see in the graph, 2018 had the most programs released, followed by 2017, 2019, and 2016.
6. How much content has been released throughout the years?
This line graph shows the trends of TV shows, movies, and all programs released throughout the years, from 1925 to 2020. As we can see in the graph, the time period between 2010 and 2020 saw the highest number of program releases.