Describe the Data

Netflix is an extremely popular streaming service used across the globe. Subscriptions to the platform are increasing as more and more shows are added to the library of available options on Netflix. The dataset from Kaggle provides information about the thousands of TV shows and movies that are available to watch on Netflix. The following variables are included in the dataset:

Main Research Questions

Our main research questions include:

We wanted to learn about the common words seen in the descriptions in Netflix overall as well as if the descriptions for TV shows and movies differ. This suggests that we look at the type and description variables.

To first show this we should look at just the description variable overall.

The wordcloud shows an emphasis on subject matter. Words such as: Life, world, family are all heavily used in these descriptions. These findings back up our intuition that the description is mostly about describing the movie.

Next we will look at the description variable split by type to view specific descriptions for movies and tv shows.

The above graph shows a comparison of words used in the description of a movie or TV show. For movies there are alot of words that describe a person such as man, women, son, father, girlfriend, daughter, wife, mother, etc. While the TV shows have more of what genre the show is for example adventure, docuseries, crime, anime, mystery, series, crime, reality, etc. This shows that when describing a movie there tends to be a more description on what people are involved in the movie and who the main characters are in the movie, while for a tv show it’s more of what the tv show is about and the genre.

We also wanted to look at how rating would affect other attributes of TV shows and movies. However, TV shows and movies have separate naming conventions for identical age-appropriateness (for example, “PG” and “TV-PG”). So to consolidate the ratings variable, we created 5 different categories of decreasing family-friendliness (G, PG, PG13, R, then X).

Our “G” rating consists of titles with ratings of G, TV-G, TV-7, TV-Y7, and TV-Y7-FV. Our “PG” consists of PG and TV-PG. Our “PG-13” grouped together PG-13 and TV-14. Our “R” rating is R and TV-MA. Finally, our “X” rating is Unrated, Not Rated, and NC-17.

In this graph, we first notice the distribution of release year given type. Release year is more right-skewed and typically has a larger spread in movies, versus TV shows. This means that, on average, movies are older than TV shows. Contextually, this makes sense, given the fact that movies were available to public audiences long before TV shows.

Looking at the release year given rating, however, doesn’t yield many obvious associations. The distributions visually seem identical across all ratings. The only slight outlier is with rating X, which looks to have a narrower spread than the rest of the ratings, which indicates there were many more newer titles with this rating, versus older titles.

The attribute we would look at would be the duration variable given rating and type. This will allow us to see how the distribution of the durations of TV shows and movies differ based on the rating.

The attribute we would look at would be the duration variable given rating and type. This will allow us to see how the distribution of the durations of TV shows and movies differ based on the rating.

This plot shows the distribution of duration given the type and rating. Looking at the distribution for TV shows, this shows the number of seasons and most of the TV Shows on Netflix only go for one season. The distribution of duration is right skewed given TV shows with the max season being about 10 seasons. This is interesting because this shows that not many shows get renewed and are not popular with the most tv show rating is in the R category and PG-13. We can see that the G rating TV shows differences from season 1 to season 2 is not as drastic as the other ratings so it looks more likely for G rated TV shows to be renewed compared to the other ratings in Netflix. For movies there is a normal distribution with the center being around 100 minutes long. This makes sense as that is about the average of a movie of an hour and 40 minutes. The largest number of movies are rated R. What is interesting to see is that the G rated movies which are for little kids have a distribution shift to the left with the highest being around the 60-80 minute mark which makes sense given that kids have a shorter attention span in movies.

Next, we want to see whether the number of releases of each rating differ over time for movies and tv shows. We will use a time series plot of both movies and tv shows.

The faceted plots above compare tv show and movie releases on Netflix from the past 100 years conditional on the rating. The trends for each of the 5 ratings remained fairly constant until ~2000s, when the number of releases increased for movies and tv shows. This is likely because Netflix was founded in 1997, so they include more media that was released around that time period. We can see that there is a drastic increase in the number of releases of R rated tv shows and movies compared to the other ratings. There is a decrease at the end of the time series plot and it most likely pertains to the pandemic and there not being that many new releases compared to the previous year. Overall, the most major differences across ratings are for R rated media in both movies and tv shows. For movies, PG13 has a large jump in comparison to the other ratings. This could be because PG-13 and R rated movies are more popular than other ratings, therefore Netflix pushes that type of media.

To build our intuition on how time has affected content over the years. We examine the type of content Netflix has added over the years and see if there are any major changes of seen between dates added of movies and TV shows.

From the graph we can see, Netflix favors adding Movies over TV shows by a significant margin. Movies make up almost 70% of total content on Netflix.There has been a constant increase in content added over the years. However, there has been a decrease in content added in 2021, this is likely due to the pandemic and less content being produced last year.

Main Conclusion

In our research, we aimed to answer three main questions. First, we wanted to learn about the description variable and how this differed between TV shows and movies. To learn more, we created word clouds of the description variable, first alone, then split via the type of Netflix title. The latter was especially important, as it showed how movie descriptions tend to be more character-oriented, versus TV shows which are more genre-focused.

Next we wanted to learn how the rating variable is affected by other variables such as type (TV show vs. movie), duration, and release year. We split the consolidated the rating variable into 5 different categories of decreasing family-friendliness. From a series of side-by-side boxplots, we found that movies are older than TV shows on average, which makes sense in context. We also looked at the duration variable given rating and type via stacked histograms, faceted by type. Here we saw that TV shows typically have very short lifespans (only 1-2 seasons). We also see more family-friendly titles are often shorter, which makes sense given the short attention span of younger populations.

Finally, we wanted to learn about how TV shows and movies changed over the years. First we viewed a stacked histogram, colored via type. This showed us Netflix favors adding movies over TV shows by a significant margin (70% versus 30%, in favor of movies). And the influx of new titles significantly slowed down in 2020, likely due to the COVID-19 pandemic. Finally, we see that over the years, PG-13 and R-rated titles always were the most commonly added to Netflix, but in recent years that margin has exponentially increased.