Netflix is a subscription streaming service and production company. Founded in 1997, it currently has 221 million users worldwide.
This data set consists of metadata on all TV shows and Movies on the streaming service. Data is updated every month and contains the following variables
We will conduct a series of analyses and visualizations on this data to investigate the Netflix’s overall release trends and how the COVID 19 pandemic potentially impacted these trends.
We begin with a general univariate analysis on the distribution of Netflix Movies vs Netflix TV Shows
We note that for the categorical type
variable (which classifies each release as a ‘Movie’ or ‘TV Show’), there are about 2x more movies than TV shows. Additionally, the faceted histogram of show release year demonstrates similar left skewed and unimodal distributions that start at around 2015 and have the mode at 2019.
We then analyze distribution based on whether the show/movie was added pre or post covid. To do this, we facet on a new binary variable called “precovid binary”. This variable indicates whether or not the particular movie or show was added to the streaming service before or after 3/11/2020 (The date the World Health Organization declared Covid-19 a national pandemic).
After faceting on precovid binary
, a brief observation reveals that post pandemic counts appear lower than the increasingly high counts that occurred in prepandemic years.
We then view the relation between the release year of Netflix media vs the date they were added to the Netflix platform with basic clustering.
Here, the solid black line indicates the line y=x (i.e. dots on the black line represent shows/movies that were added to Netflix the same year they were released). Almost all data lies above this line which makes sense as most media is added to Netflix after their initial release date. The single cluster that occurs marks the highest density region on the scatterplot that signifies that most media released 2015-2020 were added between 2019-2021.
We conclude EDA with a statistical test that will motivate the remainder of the visualizations. We conduct the following pearson chi squared test to note if there is independence between the type
and precovid binary
variables:
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: Pearsons
## X-squared = 25.84, df = 1, p-value = 3.708e-07
From the pearson chi squared test output, the observed chi-square test statistic is 228.99 and we achieve a p value of 2.2e-16. Since the p value is less than an alpha of 0.05, we reject the null hypothesis in favor of the alternative. The two variables are not independent.
This has interesting implications. The fact that there is some relationship between show/movie counts and the pre covid or post covid release date of that media reveals that the pandemic does have some observable impact on netflix trends. The remainder of the report will look into three trends in order to analyze the potential impact of COVID, specifically, we aim to answer the following research questions:
The research question that we are interested in for this part is “How does COVID-19 affected the geographical distribution of the producer of Netflix TV shows/movies”.
To understand and compare the spatial distribution of production pre-covid and post-covid, we group the data by country
and precovid binary
. We want to investigate how are the producers of Netflix shows/movies distributed in the world before and after covid. Therefore, we will map out the proportion of the total production per country pre/post covid. In other words, we are interested in how the distribution of producers has changed before and after the outbreak of COVID-19.
The graph plotted shows the comparison of the spatial distributions of number of productions by country before and after the outbreak of covid. The countries colored in grey have no production added on Netflix at all. In other words, the total number of production for the countries colored in grey is 0. The countries with darker color has a larger number of total production added on Netflix. We set the midpoint
of the color gradient equal to the median of the proportion, which is 0.004504505.
Looking at the graph, we can see how the spatial distribution has changed after the outbreak of covid. First, countries that dominated Netflix shows production remained to be dominating. For example, we can see that the US is colored dark blue both in the pre-COVID plot and in the post-COVID plot. Similarly, India’s blue also stayed the same in the post-COVID plot. There are also countries, such as Iran, that had productions before the outbreak of covid, but no production after. Most of these countries are located in Central Asia, West Asia and Europe. To the contrary, some countries in Northern and Southern Africa, for example Algeria, had no production before covid, but have production added on Netflix after covid.
To better understand how the spatial distribution has changed, we decide to animate the number of production by country over time. The below figure shows how the total number of movies/shows added on Netflix by country has changed between years. The displayed year is included in the plot title. Similar as the graph made above, the countries with darker color has a larger number of total production added on Netflix.
From viewing the above visual, we can see how the total number of shows released by country per year has gradually increased from 2010 to 2020. The US’s color darkened significantly, and is much darker than other regions, which shows its dominancy. The number of countries releasing films/shows on Netflix also increased. From 2020, the most obvious change we can see is that the total number of production decreased dramatically in the US. However, given that many other countries have stopped releasing shows on Netflix and the general decreasing trend in number of production, the US remains to dominate the market.
Conclusions: Overall, the spatial distribution of production on Netflix did not change significantly after the outbreak of COVID-19. The dominant producers remained to be dominating. There are countries that started and stopped releasing shows on Netflix. However, there is a decrease in the total number of production made by countries under the impact of COVID-19.
To figure out how Covid has affected the content of Netflix’s Movies/TV Shows, we wanted to look into the descriptions of the movies and TV shows with text analysis.
After removing stop words and performing stemming on the ‘description’ of the Netflix TV shows and movies, we analyzed the overall sentiment to get an aggregate measure of how “positive” or “negative” the descriptions were before and after Covid. From our proportional bar chart we see that there were the same amount of positive and negative words in the descriptions.
In addition, to get an idea of the common words expressing sentiment before and after Covid, we also created word clouds separating out the positive and negative words. Looking at the word clouds above, we see that most of the negative and positive words used before Covid were also being used to describe movies and TV shows that were uploaded after Covid.
Conclusions: From both graphics, we were able to conclude that there was no significant change in the descriptions of the movies and TV shows from Covid based on text sentiment analysis.
A broader question we would like to analyze is whether or not there have been visible trends in the types of movies/shows that Netflix has released over time, and whether or not the Covid-19 pandemic has had any effect on these trends.
First, we take a look at Netflix releases based on age rating over time.
In this graph, the vertical grey line denotes the date the World Health Organization declared Covid-19 a national pandemic. Although the ratings of TV Shows do not seem to be affected by this event, there is a notable drop in the average number of Movie releases with ratings PG and up coinciding with this date. In terms of the marginal distribution of ratings, it seems consistent among both Movies and TV Shows that Netflix releases more media with higher age ratings; with ‘Mature/Rated R’ and ‘PG-13/TV-14’ rated media consistently having higher release counts than the rest. In terms of trends over time, the number of releases for Mature rated TV Shows has been on a visible incline since 2017. For all other categories, it appears there has also been a slight increase in releases overtime, up until the dip that occurs a little a little after 2020.
Now, we take a look at Netflix releases categorized by genre tags over time.
This graph shows the rolling average releases for the top 5 Netflix tags over time (with the remaining tags still visible but in grey). As before, the vertical grey line denotes the start of the Covid-19 pandemic. As before with the Netflix movies, there is a notable drop in the number of releases for almost all categories released around the start of the pandemic, however this quirk does not appear to occur for TV Shows. For Netflix Movies there appears to be a gap separating the top 3 tags from the rest, and just before 2020 their release counts converge before dropping. With TV Shows there is also a notable gap between ‘international’ tagged shows and the rest. Like before, most tags for both Movies and TV Shows appear to gradually increase, although for Movies there appears to be a peak at 2019, and in TV shows the peak is around 2020.
Conclusions: In general, the order of the most common Netflix tags and age-ratings have stayed the same from 2017 to 2021. In addition, although there seems to be no effect of the COVID-19 Pandemic on TV Show releases, there is a noticeable drop in Movie releases coinciding with the time that COVID-19 was announced as a pandemic.
This report covers many Netflix release trends. However, we do recognize that the data set is not complete. For instance, many shows concluded production before the COVID 19 pandemic and thus the release date data and date added data were not strongly affected by the pandemic. Additionally, research can be improved as more post COVID data gets added to the data set. Finally, A significant set of variables involving user data (views, subscription numbers, etc.) are not in this data set. A study that tracks these trends can supplement our findings on pandemic effect on overall netflix production and usage trends, and a study of these variables is likely to yield some significant pre/post pandemic trends.