For our project, we decided to use a Netflix dataset that includes information of tv shows and movies available on Netflix as of 2019. Our data comes from Flixable, which is a third-party Netflix search engine. Although this data came from Flixable, it is available on Kaggle. We have data of 7787 movies and TV shows on Netflix, each with 12 columns of information. Each row in the dataset represents the information for one production. The primary variables we will be looking at are the type of content (movie or TV show), the genres, ratings, cast lists, durations, descriptions, and countries of production.
Before beginning our analysis, we will explore these 7 variables to make sure we understand the way they are represented. First, type
is a categorical variable with values either “Movie” or “TV Show” representing the type of content. Our data contains 5377 Movies and 2410 TV Shows. We also have genres stored in a variable called listed_in
, which gives us a string of all of the genres that each TV show and movie is listed under. Since each production can have many genres, these are separated by commas. Next, rating
is a categorical variable that gives us either the Motion Picture Association film rating for movies (G, PG, PG-13, R) or the television content rating for TV shows (TV-Y, TV-Y7, TV-G, TV-PG, TV-14 or TV-MA). The variable cast
holds comma-separated strings of each cast member. Next, duration
is a variable that holds the length of the content. For movies, duration
is the number of minutes in a string of the form “x min”, and for TV shows, duration is the number of seasons in the form “x Seasons”. The next variable description
holds a string with the description of the content that is shown to viewers on Netflix. Finally, countries
is a categorical variable holding a comma separated string of the names of the countries where the content is produced.
The first research question we are interested in is how movies and tv shows added to Netflix differ in content. Specifically, we are interested in the differences in their genres and descriptions. Answering these questions can help us understand which medium is more or less frequently used to present certain content.
We first examine the differences between Movies and TV show’s descriptions.
It seems that movies on Netflix tend to be more family-oriented than TV shows, with words such as “girlfriend”, “mother”, “father”, “couple”, “child”, “wife’, and “son” showing up frequently. TV shows seem to be darker in nature, with words such as “crime”, “power”, “drama”, “fight”, “evil”, and “survive” showing up frequently. It seems that producers tend to tell stories of familial, coming-of-age type content in the format of movies, while crime, action and fantasy-oriented content are more often produced in the format of a TV show.
Next we examined the differences in genres between movies and TV shows.
From the comparison word cloud, “show” and “kid” are the most commonly used words for both movies and tv shows, suggesting out of all productions, kids tv series are the most popular. The common genres in TV shows also include “docuseries”, “reality”, and “crime”, indicating TV shows tend to cover a broad range of topics. Additionally, TV shows tend to be more globally diverse with genres “korean”, and “spanishlanguage”, and “british” showing up frequently as well. For movies, the most commonly used word is “documentary”, and some other popular genres include “music”, “thriller”, “standup”, and “drama”. Similar to TV shows, movies also cover a broad range of topics. These topics are common between both mediums and are shared between movies and TV shows.
For this question, we based “diversity” on the countries of production, as well as the variety in casts. Besides the content, we also wanted to evaluate diversity of different productions. Even though Netflix is a US-based company, it is important to have inclusive media representation.
We first looked at the different count of productions from all countries.
There are 3296 films from the US, 990 films from India, 722 films from the UK, and 412 films from Canada produced since 2008. In this dataset, most of the films produced were from the US. This makes sense because Netflix is a US company. The UK, India, and Canada are also active producers of Netflix films.
Next we look at the total number of countries represented by the productions added to Netflix on given dates with a timeseries plot.
It seems that the diversity in countries of production has increased over time as Hollywood pushes for multicultural representation. One observation to note is that around January 2018 and January 2020, Netflix added movies that were produced in up to 18-20 different countries.
Lastly, we look at the names of cast numbers as a way to represent diversity of nationalities.
The above word cloud of cast members can give us a sense of the nationalities of the cast members of movies and tv shows on Netflix. We can see that the most common names are European names such as Michael, John, and James. We also see some names that are traditionally used by people from South Asia, such as Sharma, Singh, and Kapoor. This also lines up with the heat map above which shows where tv shows and movies are produced, which shows that most of the content on Netflix is produced in the United States, the UK, and India.
Earlier, we pointed out that Netflix is a US-based company, we wonder if this affects the process of adding different productions to Netflix, especially for international films. More specifically, do TV Shows and movies produced in the US gets added faster?
Table 1: Mean difference in release and add date (years) by country and film type
Movies | TV shows | |
---|---|---|
US | 6.064324 |
|
Non-US | 5.453139 |
|
Table 2: Variance of difference in release and add date (years) by country and film type
Movies | TV shows | |
---|---|---|
US | 0.098 |
|
Non-US | 0.061 |
|
The distributions of the time in years between the release and add date for Movies appear similar for movies produced in the US vs not produced in the US. However, there are some differences in the distributions of the time in years between the release and add date for TV shows. The density for US TV shows is nearly 3 times higher than the density for non-US TV shows around 0 years after the release date. This shows that there is a higher density of US TV shows added to Netflix soon after release than non-US TV shows. This could be because Netflix is a US company. The density also decreases quicker for US TV shows than for non-US TV shows after the initial peak. However, for both movies and TV shows, there are some movies that are added to Netflix many years after their release date (over 50 years). This could be because there are some well-known older US movies that are in high demand, which Netflix added. These additions of older movies skews the mean difference between add date and release date for US films. This is a possible explanation for why US movies take longer to be added.
Even though the distribution of the time in years between the release and add date for Movies appear similar for movies produced in the US vs not produced in the US, the KS test (Kolmogorov–Smirnov) shows that the distributions are different. The distributions for US and non-US tv shows were also statistically different. The distributions have different variances. The mean difference between the release and add date is not statistically significantly different for tv shows, but it is for movies. The following table shows the results from the statistical tests.
Table 3: Statistical Significance Tests Results for US and Non US Films (p-values)
Movies | TV shows | |
---|---|---|
KS Test | < 2.2e-16 |
|
F-Test for variance |
|
|
T-test for Difference in Means |
|
0.2104 |
Now that we understand the differences between movies and tv shows, we will focus our investigation on movies and explore how they change over time. Specifically, we will explore the changes in proportion of R-rated movies and cast size throughout the years. This will give us insights on movie trends for the future.
We are wondering if as people have become less conservative regarding media over time, an increasing proportion of R-rated movies have been produced.
This plot answers the above question by showing proportion of R-rated movies vs. release years. It seems that the proportion of movies stayed low, with a few outliers, until about 1970. Then we saw a sharp increase in the proportion of r-rated movies, peaking around 1995, before it has shown a steady decrease in recent years. So while it does seem appear that more R-rated movies are released in recent years than there were before approximately 1960, we are also seeing less movies produced that are rated-R. Next, we look at whether there is a relationship between case size and movie duration.
From the graph, we see that most movies have a cast size less than 20 people. As the cast size increases, the movie duration tends to increase slightly. Furthermore, it seems that movies in the 2010s and 2000s had bigger casts ranging in the 30-40 people. Since there are only a few movies with this big of a cast size, they can be considered as outliers.
To better understand the general trend, we investigate the graph without the outliers.
In this graph, we eliminated cast sizes that are greater than 25 so we can scrutinize the relationship without being affected by outliers. It seems that there is a slight positive relationship between cast size and movie duration. Hence, a bigger cast size is associated with longer movies. Furthermore, movies in the 1970s-2000s tend to have medium cast size of around 10-15 people, whereas movies in the 2010s have large ranges of cast sizes between 0 to 25 people.
From our investigation, it seems that TV shows and movies share many of the same genres with TV shows containing more diverse productions compared to movies. Logically, it seems that some genres fit the medium of a TV show better than a movie or vice versa: crime, action and fantasy-related content may often be told over a long storyline fit for multiple seasons of a TV show, whereas a family drama can get resolved within the 1-2 hour limits of a movie.
Next, larger cast sizes seem to be associated with longer movie durations. Furthermore, more recent movies tend to have a large range of cast compared to older movies, which tend to have small and medium sized casts. We also see that while there are definitely significantly more R-rated movies in the past few decades than there were before 1960, we are also seeing a steady decline in the number of R-rated movies produced over the past 20 years or so.
Furthermore, most Netflix productions are still very US/Western centric, but there has been a recent increase in national and ethnic diversity. Specifically, there seems to be a notable increase in South Asian representation over time.
Lastly, most of the movies in this dataset are from the US, and movies produced in the United States are added to Netflix sooner than those from other countries, on average.
In our current investigation, we examined the common actors and genres of the movies, but we did not explore how these factors can change over time. In future work, we can see which actors and genres were popular during certain time periods and use this information to predict which actors/genres will be popular in the future. This information can be useful for estimating the future profits for Netflix since Netflix can invest in these genres. This information can also be useful to ensure that Netflix is actively investing in increased representation in media. Since Netflix is a widely used media platform, it is important to analyze these factors to provide the best experience for viewers.