Cinematic productions have been a prominent form of entertainment throughout the decades. While films are mainly used as entertainment, they also play a large role in shaping and representing different trends in entertainment production and presentation throughout the decades. We will specifically investigate certain genres of these films to see how much or little movie production and trends have changed. These changes could be tied to the more prominent events during each period. From the Cold War in the 1980s to the Recession in the 2000s, these historical event had a high impact on society and possibly the media that they consume. There is also the chance to explore the evolution of films from a technological standpoint as well.
The dataset used in this report was obtained through Kaggle which was initially scrapped from IMDb. The data is a collection of movies spanning from 1980 to 2020 with the report’s initial focus on movie revenue.
The original variables in the dataset are as follows:
budget
: the budget of a movie. Some movies don’t have
this, so it appears as 0.
company
: the production company
country
: country of origin
director
: the director
genre
: main genre of the movie
gross
: revenue of the movie
name
: name of the movie
rating
: rating of the movie (R, PG, etc.)
released
: release date (YYY-MM-DD)
runtime
: duration of the movie
score
: IMDb user rating
star
: main actor/actress
votes
: number of user votes
writer
: writer of the movie
year
: year of release
To answer our main questions, our analysis will focus primarily on
movies released in the United States between the years 1980 and 2019 and
movies that are listed under the top ten genres. To do this, we created
a new dataset called movies_clean
where we filtered out any
films released outside of our range of 1980-2019 as wells as the movies
released outside the US. Within this dataset two new variables were
created to support our analysis as well listed below. Another dataset
named movies_top_genres
, was then made to filter through
movies_clean
to only focus on the top ten genres with the
highest number of movies.
The new variables of interest added are as follows:
date.released
: numerical release date
decade
: decade of release (80s, 90s, etc.)
month
: month of release (01,02, 03, etc.)
season
: season of release (Winter, Spring, Summer,
Fall)
Question 1: What genre of movies made the most money in each
decade?
Question 2: Do certain genres perform better in different seasons?
Question 3: How much has changed in the production of animated
movies?
From the bar graph above, we can see genres such as Action and Animation increase as time moves on, while Comedy and Crime have seen a decrease. These changes made us interested in exploring genres and changes throughout time in general.
In order to evaluate the profitability of the movies, we certainly want to look at two variables in the dataset “gross”, and “budget”. We would like to look at profitability in two perspectives, the absolute value and the profit ratio. Thus, we will introduce two new variables to the data set, being “profit”, and “ratio”. “Profit” is “gross” minus “budget”, and “ratio” being “gross” divided by “budget”.
To answer the question of “What genre of movies made the most money in each decade”, we would first like to know about the general situation of profitability.
In Figure 2, we have a heat map of Genre and Decade, with the color corresponding to how much average “profit” it makes. The deeper color it has, the higher average profit that genre makes. The gray area for Mystery in the 1980s is because of the absence of this genre of movie in that decade within the dataset. One block that really stands out is the genre Animation in 2010s. There are also some genres that appear to be more profitable throughout the years, like Animation, Action, and Mystery. Interestingly, for the majority of these genres, we can see the color getting deeper as the time approaches recent years, like Action and Animation. However, for Fantasy and Mystery, we notice this color getting whiter as time progresses, implying that these are the genres that current audiences probably are less interested in.
When we notice that it seems like for the majority of the genres, the color is getting deeper, we are actually noticing a major problem when trying to compare and contrast profitability across time, and that is inflation rate. Certainly, that $100 in 1980 could buy much more eggs than it can now in 2023, and also much more than it could buy in 2010. As a result, in order to eliminate this impact, we will be doing “ratio”, which is, as its name suggests, a proportion instead of absolute value. This will bring a new perspective to look at the profitability of movies.
In Figure 3, we look into the dataset more specifically. We select the 20 movies with highest ratios of each decade from 1980s to 2010s. What worth noticing is that some information will be missing because we only select the 10 genres with highest count. One reason we do count instead of the value of ratio is because movie is a industry with really extreme profit situation. Some could have a ratio high to 10000 while some ratios are even smaller than 0.002. To plot them on the same graph will make the lower ones nearly impossible to see. In Figure 3, we can see that Horror and Comedy are the two most profitable movies across all times, which makes a lot of sense. Since we are not doing absolute value, the movies that have a lower budget will jumped out, and Horror and Comedy are two genres that are famous for how great a effect they can have with extremely small budget. For Action, it might seem counterintuitive at first since most of the more famous Action movies are pretty recent. But, consider that the budget of the action movies skyrocketed because of the expense of special effects, resulting in a lower ratio.
Drama has been pretty consistent, while Comedy is getting less profitable, and Horror becomes much more profitable. This could imply the popularity of each genre as the year progresses.
To evaluate the success of movies overtime, we looked for patterns in movie releases by season across each decade. We wanted to evaluate if there was a difference in the time of year movies are released over the different decades and if that was related to other variables within the dataset. For example, do movies in the summer tend to have higher ratings on average or if more family movies are released in the winter. For the purposes of our analysis, we defined winter as December, January, and February; spring is defined as March, April, May; summer is defined as June, July, August; and fall is defined as September, October, and November.
First, we created a mosaic plot with Pearson residuals as a visual representation to check if the season and decade of a movie’s release are independent.
The mosaic plot shows that the combination 2010s/summer has an observed count that is significantly higher than would be expected under independence. There are no cells with significantly lower counts. A chi-squared test on decade and season gives a test statistic of 13.475 and a p-value of 0.1423. Since the p-value is larger than \(\alpha = 0.05\), there is not enough evidence to reject the null hypothesis that decade and season are independent. We conclude that there is a similar distribution of movies released each season for every decade. The mosaic plot supports the conclusion from the chi-squared test. Only one cell had significantly higher or lower counts than expected.
Fall | Spring | Summer | Winter | |
---|---|---|---|---|
1980s | 0.258 | 0.258 | 0.263 | 0.220 |
1990s | 0.273 | 0.253 | 0.240 | 0.234 |
2000s | 0.270 | 0.244 | 0.242 | 0.243 |
2010s | 0.249 | 0.240 | 0.290 | 0.221 |
The table above supports the conclusions that the two variables are independent. The proportion of movies released each season for every decade is relatively similar.
The table of proportions of movies released by decade and season supports the conclusions that the two variables are independent. The proportion of movies released each season for every decade is relatively similar.
After evaluating the distribution of movie releases each season for each decade, we wanted to see if there was a difference in other variables in a particular season overtime. Generally, it seemed that in each season, other variables (like genre and rating) had similar distributions over the decades. However, there seems to be a slight difference in average movie score for fall movies.
The graph above shows overlaid density curves for score distribution for movies released in the fall. Generally, movies released in the fall of the 1980s had a lower score than fall releases of other decades. We ran KS-tests to compare the mean score for fall movies released in the 1980s to the other 3 decades. The KS-test for mean movie score in 1980s and 1990s gave a test statistic of 0.13154 and a p-value of 0.004097. The KS-test for mean movie score in 1980s and 2000s gave a test statistic of 0.13543 and a p-value of 0.003573. The KS-test for mean movie score in 1980s and 2010s gave a test statistic of 0.20522 and a p-value of 2.289e-06. All three p-values are smaller than \(\alpha = 0.05\). Since the p-values are small, we reject the null hypothesis that the mean score for fall releases in the 1980s is similar to the 1990s, the 2000s, or the 2010s. Therefore, we conclude that the mean score for fall releases in the 1980s is different from the other three decades.
The chi-squared test shows that the decade and season variables are independent. However, the mosaic plot shows that 2010s/summer has an observed count that is significantly higher than would be expected. This makes sense within the pop culture context of the 2010s. The 2010s was known for having many big, summer blockbuster releases. Generally, other variables in the data (such as run time or rating) did not change significantly for a specific season over time. One statistically significant result was the the difference in means between fall movies scores in the 1980s and the other three decade groups. Movies released in the fall of 1990s, 2000s, and 2010s all received better average movie scores than fall releases of the 1980s.
For our last question, we are interested in how animation compares to other genres and how it has changed over time. As said in the introduction, animation is a particularly interesting genre because of its different production styles, the primary target audience, and the massive studios that produce it.
To examine this, we will first examine the content ratings given to animated movies throughout the decades compared to other genres.
When looking at animation, we see that, unsurprisingly, the majority of movies are under the G and PG rating. Surprisingly, while the G rating has decreased, PG has had a significant surge. This is in line with the gradual decreasing of G rated movies in general. Both of these happened at the start of the 2000s and continued into the 2010s. Interestingly, we see that there was a proportion of animation movies in the 1980s that were not rated. This could be from a number of factors, such as the PG-13 rating not being created until 1984.
When looking at the other genres, it can be seen that the PG genre has started to gradually decrease as the years have gone by, and PG-13 has taken up its place. This means that animation has become the primary source of PG movies since the turn of the century.
To further investigate how animation differs from other movie genres,
we will examine how gross and runtime changes based on genre, using a
multiple linear regression plot.
This scatterplot can tell us many useful things about animation as it relates to runtime. One notable aspect is that despite 40 years of time passing, not one animation movie exceeds 120 minutes; when looking at the data, it almost appears that there is a sharp cutoff at the 120-minute mark. This is most likely due to the fact that children have shorter attention spans on average. Despite this, it also seems that longer movies tend to gross more money at the box office. This is true for both other genres and animation, but animation has a steeper slope, suggesting that it makes more money than other genres despite the shorter runtime.
Lastly, we wanted to see how similar animation movies are to each other based on our quantitative variables. To do this, we first looked at a distance matrix.
## 1 2 3 4 5 6 7
## 2 0.8084129
## 3 2.3611697 1.4633222
## 4 0.7826291 0.3914720 1.6539706
## 5 0.3127119 0.9298953 2.4730261 0.9527070
## 6 0.7542799 0.8544652 2.0845264 1.1096448 0.5882710
## 7 2.1568586 1.5473658 1.1693480 1.8436879 2.1319999 1.9030629
## 8 1.0105078 0.8431011 1.8607098 1.1084407 0.8797400 0.4369242 1.3087267
Based on this distance matrix, we can conclude that the distance between coordinates 1 and 2 is very close, although not as close as it is to 1 and 5. The closeness of these points tells us that there are considerable similarities in the data.
To visualize this, we created an MDS plot and colored it by decade.
This graph shows very recognizable bands of decades. From this plot, we see that animation movies are the most similar in aspects like gross and review scores compared to other animation movies released in the same decade. This suggests that animation as a genre is very influenced by the time it was produced. This is a fair assumption to make, as technology changes have enabled the fast production of longer movies, as well as general interest in the animation genre, causing the rise in box office success.
For the first question, we were curious of which are the genres that are the most profitable across the year. By looking from two different perspectives: the absolute value “profit”, and the proportion “ratio”, we can notice that Animation, Mystery and Action really stand out in the absolute value aspect, and Horror and Comedy have always been some of the most profitable movies from the “ratio” perspective, due to the fact that they usually get great result even with low budget.
For the 2nd question, we found that the decade and season variables are independent. The distribution of movies released every season is relatively similar in each decade. One exception was that there were more movies released in the summer in the 2010s than expected under independence. This makes sense, since the 2010s were known for having a lot of big, summer movie releases. Generally, there are many similarities in the properties of movies (such as runtime or rating) for each season over time. One exception is the average score of movies released in the fall. The average score for movies released in fall of the 1980s was lower than the average score of fall movies from other decades.
For the 3rd Question, we observed that animation as a genre has seen a large growth in the amount of movies produced since the 80s. We can also conclude that as the decades go by we will continue to see a decrease in the number of animation movies that are rated G, and a rise in the movies rated PG. We can also conclude that the PG rating category will mostly be animated movies in the future. We can also see that animated movies, while having a shorter runtime than most non-animation movies, make more money on average. Lastly, we can see that animation movies are largely defined by the decade they come out in, as animation movies are usually the most similar to other movies that came out in that decade.
When looking at the original dataset, before it is cleaned, we only have around 7600 observations. This is too small to include every movie produced since 1980 around the world. Furthermore, the dataset only includes major movies and is missing many smaller movies that have been released. While the selection standard of movies to include in this data set is unknown, we also noticed a lot of NAs throughout the data set, which could impact the analysis of this data set. For example, either NA in “gross” or “budget” will result in the observation’s absence in the study of profitability. It would have been useful to have more information on how movies were selected for this dataset. It would also be interesting to evaluate another movie dataset that includes more niche movies.
It would also be interesting to look deeper into the specific genres and the differences between countries that can be seen. For example, when looking at animation, it would be interesting to compare how the genre is different in Japan than in the US. Another example could be the difference in the romance genre in India compared to the US. There are a lot of geographic differences that we were not able to explore but it would be interesting to look into. Also, it may be interesting to evaluate how movies released in theaters compare to movies released on streaming services. Streaming has become increasingly popular and is now a big part of the entertainment industry. However, this dataset only included movies released in theaters. Movies released exclusively on streaming services may be very different from movies in theaters in terms of profitability or score.