The dataset that we chose to analyze consists of the IMDb scores and related information for four decades of movies. This data was scraped from IMDb and can be found on https://www.kaggle.com/datasets/danielgrijalvas/movies?resource=download. This dataset contains approximately 6,820 different movies from 1986 to 2016 and focuses on 5 quantitative and 10 qualitative variables. These are split up into the budget, revenue, IMDb score and the associated votes, the release date, genre, rating, as well as the movie production information such as company, cast, writers, directors, country of origin, and runtime. Data cleaning and preprocessing helped to remove missing instances, create informative new variables such as profit, decade, and release season. The data and the analysis below provides the possibility to explore and learn more about the underlying trends seen in the movie industry.
Through our analysis of the data, we have identified three overarching questions that help to understand the dataset and the interesting relationships between the movie variables.
This question helps to address the relationships between movie ratings, IMDb scores, and the overall distribution of movies which helps to show how the global audience feels about specific types of movies. Understanding this will help to identify truly good movies and may even help to predict which upcoming movies may receive a high IMDb score.
This question helps to compare all movies in each decade which spans the entire dataset by focusing in on genre and titles. Both of these components to a movie drive it’s success, so understanding how this differs over time will prove helpful to grasp the evolution of movie production.
This question helps to better understand what movie producers and financiers think of when selecting a release date (season). There are many different types of movies released at strategic times, and this question will help understand if some movie release seasons make for a more successful movie. [The season variable was not originally in the data, but was created based on a string decomposition and grouping of the release date variable.]
Learning the question of how movie ratings impact IMDb scores and overall movie composition distributions starts with an EDA to understand preliminary relationships in the data.
Our first chart, below, shows how the distribution of IMDB scores differs across ratings and the decade the movie came out.
The above chart shows the density of the IMDB distribution and how this differs across the chosen four most common ratings (G, PG, PG-13, and R) and the timeline (the four decades of release dates) of the dataset. We would like to point out how the G and R ratings have very different distributions. This can be seen in the 1980s where the G rating density spike above all the other ratings. Also in the 2000s, there is a notable difference in the distribution of G and R as seen in how the G distribution is more narrow and wide while the R distribution is taller and peaks at a higher IMDb score.
These two decades are quite interesting, so a zoomed in time series of the two decades with only the G and R rating movies was performed.
The above two time series show the scores of movies across the two decades. There are two separate colors showing the G and R movies. The first thing that is interesting to note is that the frequency of G movies is definitely far greater than the frequency of R rated movies. First, let’s look at the 1980s. R movie rating scarcely went below 5.5 while G rated movies went down in ratings very frequently throughout the decade. Also, in 1988 and 1990, the ratings for R movies tended to be much higher than the ratings for G movies as seen from the plots above. Now, let’s look at the 2000s. This plot looks greatly different from the 1980s plot. The ratings of R movies fluctuates more in this decade. It still looks like R movies tend to have higher ratings than G movies as the G movies are the ones that have the lowest ratings on the chart shown. The fluctuations are also for both G and R rated movies in the 2000s. In the 1980s, there were notable instances where the G and R rated movies were not fluctuating in sync across the decade. This can be seen in the 1980s chart where the highs and lows sometimes do not match for the two ratings.
Next is the analysis of IMDb scores versus the four ratings more specifically. First, we will be showing a simple box plot of ratings vs the IMDb score. This can help us narrow down the relationship between rating and IMDb score without any other variables.
The main takeaways from this chart are that G rated movies tend to have higher ratings. The other ratings tend to have a lot more outliers with lower ratings. Plus, PG and PG-13 movies have similar ratings distributions. R rating movies tend to have the second highest IMDB score of the four ratings. It also looks like R ratings have the most variation in scores as seen from the box plot and the outliers.
Next, we will be looking at the ratings vs the scores again, but this time we will be adding in another variable, the number of votes. After looking at the variables that have the highest correlation to score through a pairs plot, we found that the number of votes has the highest number of votes. The pairs plot is not shown here, but it was interesting to see how the budget and gross revenue did not correlate much with IMDB scores. The number of votes had around a 0.43 correlation with the IMDB score. This makes sense as the IMDB score is calculated by aggregating the votes.
The above contour plots/maps show the IMDb score on the x axis and the votes on the y axis. It is facetted by the rating. From this plot, we can see the concentrations as well as the relationship between votes and IMDb score. First, G rated movies tend to cluster around 25000 votes and a score of 6.5 to 6.75. PG rated movies tend to cluster around 5000-10000 votes and a score of 6. PG-13 rated movies tend to cluster around 30,000 votes and a score of around 6.25. R rated moves tend to cluster around 25,000 votes and a rating of around 6.25. It’s also interesting to note that R rated movies are more heavily concentrated in the cluster than G and PG-13 rated movies which tend to have more variation. Also, there appears to be more of a positive relationship between votes and scores for R and PG rated movies than the other two ratings.
Next is a textual analysis of the titles of movies and ratings, as it is necessary to see which types of movies would correlated with which rating.
The above word cloud shows the four wordclouds for the four different ratings. These are wordclouds for the titles of the movies. The red is G rating, the green is PG, the blue is PG-13, and the grey is R. It is interesting to see that the G rating titles tend to have more young children friendly titles with words such as “princess”, “muppet”, and “toy”. The PG rating is still children friendly with words such as “kid” and “adventure”. The PG-13 rating has more older people concepts in the title such as “war”, “dark,”love, and “night”. The R rating has more adult concepts such as “kill”, “death”, “dead”, and “evil”. It seems like if further analysis is done, one could predict rating based on titles to some accuracy because these title names tend to differ across the ratings.
In conclusion, these charts give some context on how the movie ratings affect the distribution of the movies as well as the IMDB scores of the movies. Next, it is important to delve into the details of how these given movies compare with one another over time by focusing on popular genres and titles.
At first, it was tried to divide the dataset into meaningful clusters. Before running principal component analysis and finding the right number of clusters, the quantitative variables were processed to get the average information of every movie in a given year from 1980 to 2020, because or else, the number of observations would be too high for conducting PCA.
From the elbow plot created, it was decided to choose two for the number of principal components, since the proportion of total variation explained by principal components levels off after the second one (or we see a clear “elbow” at 2.)
Now another form of clustering can be seen that will help visualize the relationship in question much better.
Based on this complete-linkage dendogram, it can be seen how all the
movies from 1980’s and 1990’s and most from 2000’s are in one cluster
while the rest (few from 2000’s and all the movies from 2010 and beyond)
are in the other cluster. Since one cluster contains leaf labels from 1
to 27 while the other cluster contains leaf labels from 28 to 41, also
it can be said that the dataset can be clustered into movies from 1980
to 2006 and movies from 2007 and beyond. Since the variables taken into
account are all the numerical variables (score
,
votes
, budget
, gross
,
runtime
), it is suspected that general tendency for budgets
and gross revenues to increase over the years might account for such
clustering of movies.
Now it is important to look at the frequency values per genre groups to better understand the changes over time, which can be accomplished in the facetted plot below.
It appears that for most of the decades, Comedy
is the
most frequent genre. The variables Comedy
,
Action
and Drama
are always in the top 3 most
frequent genres. In the most recent decade, we can witness the rise of
Animation
genre, which was never seen among the top 5
genres in the previous decades. If we define popularity as abundance
(from simple equation: increase in demand -> increase in supply),
then the most popular genres in all the decades would be
Comedy
, Drama
, and Action
.
This can be better seen in the context of score, so now we will look at a time series plot for average IMDb score per genre over time.
We also take a brief look at how the average IMDb score for each of the top three popular genres differs by year. We can observe that Drama has the highest average IMDb score across the years. For Comedy and Action, it appears that the average score for Comedy is usually higher compared to that for Action before year 2000 but after that, Action seems to have higher average IMDb score compared to Comedy.
While focusing on how movie characteristics differ over time, it is necessary to examine how naming movies has changed over the decades. This is accomplished via a wordcloud seen below.
We then examine if there’s any apparent trend in naming movies across decades. The word clouds from left to right are each from 1980s, 1990s, 2000s and 2010s-beyond. The most frequent words in movie titles from 1980s are “night” and “man”, followed by “part”, “little”, “love”, “dream” and “adventure”. Movies from 1990s contain the words “man”, “love” “dead”, “day” and “blue” the most in their titles while those from 2000s contain the words “love”, “man”, “day” and “girl” the most. Movies that were released in the most recent years from 2010 and onward, also have “man”, “day” and “love” most frequently in their titles and also “dark”, “night”, “movie”, “time”, “world”, “last”, “life”, “star”, “boy”, “girl”, “house”, and “black”. It appears that no matter which decade the movie is from, it is most likely to contain the word “man” or “love” in its title.
This concludes the analysis on how movie characteristics differ over each decade, so now we can move on to our next overarching question of commonly viewed trends for movies released in the same viewing season (fall, winter, spring, summer).
Movie release dates are very important, and especially the season in which the movie is shown to the public for the first time. There are many holiday movies and seasonal specific movies released on certain dates, so it is important to see if release date impacts production of the movie and its overall success post public consumption.
It is necessary to start with a breakdown of how many movies are released in each season and if this value differs based on genre or rating
Below we see two side-by-side bar plots showing the number of movies released in each season and how this breaks down even more for the top genres and ratings.
Here it is evident that the fall season has the most movie releases, but the trend for some of the top ratings is also seen. In all seasons, the R rated movies are released the most by far followed by PG-13 and PG. By viewing above we can see that movie released for each rating by season are very similar for every season. Some of the only differences we can see is that during the Summer, we see significantly more G movies released than in other seasons. This makes perfect sense as the kids in school are our for summer and would be more willing to go to the theater. Overall, we can see that the Fall season releases the most movies, and that there are more R movies released in each season compared to every other rating.
Now let’s examine this through the genre lense.
We see the same trend when it comes to number of movies released in each season, although there are some very interesting changes when it comes to movie releases per season based on genre. We can see that the action, comedy, and drama genres dominate each season, but there is a very different combination depending on the given season. For Fall, we see a very large number of releases for comedy (most out of any season), and a respectable number of releases for action and drama. For the Spring, we see high values for action and comedy. For Summer we see a similar trend for comedy, but an unusually high number of releases for action. Lastly in the Winter, the most interesting difference is that the number of biographies releases is more than double of the number in any other season. Here we can see the trend in which different genres of movies are released.
Knowing the trends for movie releases per season based on rating and genre helps to answer the overarching question of how season impacts movie production and type as well as eventual success towards its ideal audience.
Above we saw the total number of releases per season from 1986 to 2016, but now it is important to view how this may have chanaged over time.
Looking at how the dominant release season has changed over time will help to understand how this may change in the future, and if it may coincide with current events. This is aaccomplished by displaying the moving average of mvoie releses per season over time.
The number of movie releases in each season over time has not stayed anything close to constant. We see drastic changes since 1986 which shows how the typical movie release seasons developed. In the beginning, we see a huge summer release season and very small winter release season which stayed the same until about 1990. After this, the dominant release season changed a bit which resulted in a huge fall release season compared to the rest until about 1995. After this we see a significant drop in spring releases relative to the other seasons while the winter season took charge until 2000. We then see movie releases drop in every season except for Fall. The fall release season takes a massive lead on the others from 2000-2010 with Spring continuing to remain as the lowest producing movie season. In the end, we see lots of movement, but in 2016 Summer takes the lead as the season with the most movies released and Spring remains at the bottom of the pack.
This shows how movie released seasons have changed over time and where the most competition will be when releasing a movie. This answers the overarching question of how moving production changes in every season and over time. Additionally, this information can be helpful to use for movie producers when planning on a movie release date.
Next it is important to examine movie success by season, so we will look at the IMDb score split and profit by season.
Here we see a fairly even split of IMDb scores by movie release season which shows that the critics do not necessarily take the movie release date into account when scoring. Given that these box plots and violin densities show similarity, we do see a spike in the density of Winter on the right hand side which indicates that there is a higher density of better scores. Additionally, we see that in the Fall, the median score and densities are the slightest bit further right which indicates better scores. Despite the scores being fairly similar for all seasons we see slightly higher median scored for the Fall and Winter seasons.
Now we will examine profit within the scope of release season.
Above we can see the profit values given movie release season. Overall, the profits are fairly similar, but we so see that the box plot for Summer indicated a slightly lager median profit value. We have examined the densities and know that most of the profits are close to zero, but we can note that slightly higher profits were seen for movies released in the Summer.
Overall, we have seen how movie production decisions and success have been noted by the mvoies release season.
We have made many key takeaways relating to the analysis of how movie ratings impact IMDb scores and movie distributions, how movies in each decade compare with one another in terms of popular genres or words in their titles, and how movie production and success metrics differ based on movie release season. We understand the distinct changes inn how ratings such as R and PG-13 impact the potential success and outreach of movies, how movies have evolved and grown throughout the decades looking at genre and movie titles, and how the specific release date indicates much more about a movie than expected.
These conclusions are key in understanding how the movie industry operates and changes over time, but there are still some unanswered questions involving how movie production companies impact given profits, how “better” movie titles lead to better success, and many more. This analysis above and the main conclusions given have the chance to help movie producers to make better decisions based on historical data in order to create high achieving IMDb rated movies.