36-315 Statistical Graphics and Visualization Final Project

Abstract

The film industry or motion picture industry has shaped pop culture since the early 20th century, being a large part of the entertainment lives of all nationwide and internationally. The cinema of the United States has produced the dominant style of classical Hollywood cinema, and remains the largest in terms of box office gross revenue. In 2018, the global box office was worth $41.7 billion.

Given the socio-economic benefits from the film industry alone, we are interested in better understanding what constitutes a good movie, and how directors, producers, and screenwriters can support a movie to produce the most revenue based on past trends of successful movies. Also, by exploring other data points surrounding movies, including user (general public) ratings and critic ratings, we hope to reach conclusions that can influence how movie critic rate movies in the future, to better reflect the interests of the general public and seem more relatable.

Data

Loading our Data

We output the head() of the data, to get a glimpse of the data, before going into our data description and what these columns and rows mean in the next section.

## # A tibble: 6 x 12
##    Rank Title Genre Description Director Actors  Year `Runtime (Minut… Rating
##   <dbl> <chr> <chr> <chr>       <chr>    <chr>  <dbl>            <dbl>  <dbl>
## 1     1 Guar… Acti… A group of… James G… Chris…  2014              121    8.1
## 2     2 Prom… Adve… Following … Ridley … Noomi…  2012              124    7  
## 3     3 Split Horr… Three girl… M. Nigh… James…  2016              117    7.3
## 4     4 Sing  Anim… In a city … Christo… Matth…  2016              108    7.2
## 5     5 Suic… Acti… A secret g… David A… Will …  2016              123    6.2
## 6     6 The … Acti… European m… Yimou Z… Matt …  2016              103    6.1
## # … with 3 more variables: Votes <dbl>, `Revenue (Millions)` <dbl>,
## #   Metascore <dbl>

Data Description

For our project, we decided to pull data describing the world’s most popular IMDb movies from 2010 to 2016. IMDb is an online database of information and the world’s most popular and authoritative source for films, television programs, home videos, video games, and streaming content online, including cast, production crew and personal biographies, plot summaries, trivia, ratings, and fan and critical reviews. Although this data is coming from IMDb as the root source, it is hosted on Kaggle, and the specific dataset can be found here.

Our IMDb movies dataset contains data of the 1,000 most popular movies on IMDb from 2006 to 2016. This dataset contains 1,000 rows, or instances, and 11 columns, or attributes. The types of variables in this data are a mix of quantitative and qualitative attributes. To be more exact, there are 5 quantitative attributes, and 6 qualitative attributes.

The quantitative attributes are Title, Genre, Description, Director, and Actor. The qualitative attributes are Year, Runtime (Minutes), Rating, Votes, Revenue (Millions), and Metascore. Title is the title of the film. Genre is a comma-separated list of genres used to classify the film. Description is a one-sentence movie summary. Director is the name of the film’s director. Actors is a comma-separated list of the main stars of the film. Year is the year that the film released as an integer. Runtime (Minutes) is the duration of the film in minutes. Rating is the user rating for the movie from 0 to 10. Votes is the number of votes for the movie. Revenue (Millions) is the revenue of the movie in millions. Metascore is an aggregate average of critic scores. Values are between 0 to 100, and higher scores represent positive reviews.

Understanding our Data

Before plotting and performing statistical analysis to understand our research questions, it is important to understand the data we are working with in more depth. Particularly, because we will be focusing on genre, movie performance, user and critic ratings, title, and descriptions, it is important the data for these columns is in a format we can work with.

We begin by looking at Genre, and see that our dataset has over 200 subsets of genres, many overlapping each other. Based on the output below, where we have only outputted the head of the table with all 207 genres present, we see many overlaps already as these all include the action genre, just in varied combinations.

## 
##                     Action           Action,Adventure 
##                          2                          3 
## Action,Adventure,Biography    Action,Adventure,Comedy 
##                          2                         14 
##     Action,Adventure,Crime     Action,Adventure,Drama 
##                          6                         18

We considered cleaning this data to group genres with overlap together, but understood there was room for mistake and misclassification. Thus, for our plots, proceeding with the top 10 genres with the most frequency of movies represented in this dataset will give us a good compromise to explore all common genres, while still having a representative sample.

Then, we outputted the column specification data type that each attribute in our dataset was parsed with.

##               Rank              Title              Genre        Description 
##           "double"        "character"        "character"        "character" 
##           Director             Actors               Year  Runtime (Minutes) 
##        "character"        "character"           "double"           "double" 
##             Rating              Votes Revenue (Millions)          Metascore 
##           "double"           "double"           "double"           "double"

Title, Genre, Description, Director, and Actor, are of character type. Year, Runtime (Minutes), Rating, Votes, Revenue (Millions), and Metascore are of double type. Because we will be using the user rating, or Rating and Votes, critic rating, or Metascore, and movie fiscal performance, or Revenue (Millions) attributes a lot, paired with the year the movie was released, orYear, it is appropriate that they are all the same data type and quantitative, so we do not need to perform any cleaning here.

Now that we have understood our data, lets move on to the research questions!

Does a movie’s genre play a role in the predicted popularity and revenue of that movie over time?

The first research question we are interested in learning more about is the relationship between genre and a movie’s popularity. Do certain genres consistently product low-performing movies? Which genres does the public enjoy to watch? Answering these questions will give critics insights when predicting the success of a certain movie, and producers and writers pointers towards underrepresented genres in the industry or genres that consistently are not enjoyed by the general public.

Plots

When there is higher demand for a type of movie, more movies in the same genre are produced. By this logic, we can measure the popularity of a genre by considering how many movies of a certain subset of genres are made per year. As we have 207 different subsets of genres, we limit this graph to the top 10 combinations of genres provided by our dataset, as these 10 combinations cover the majority of movies produced, and are not particularly too niche.

It is interesting to note that the production of all movies increased relatively steeply, except for Romantic Comedies, which stayed the same, and Action, Adventure, Sci-Fi movies, which decreased. The number of Drama movies had the largest increase.

Now let’s look at the effect of Genre on Revenue (Millions).

By our graph, we have that the Action, Adventure, Sci-Fi genre grosses the most money, while Action, Adventure, Fantasy and Animation, Adventure, Comedy are closely tied for second. The rest of the genres gross similar amounts. Since the top three genres all contain Adventure, it could be that Adventure movies gross more than other types of movies. This makes intuitive sense, as many of the top grossing movies come from action and adventure, like the Marvel Cinematic Universe (MCU) films that are a series of American superhero films produced. From the genres that gross the least revenue, we see drama as the common genre between the groups, along with romance, crime, and comedy mixed with drama. While it is not to say that movie lovers do not enjoy watching these types of movies, we can intuitively conclude that these genres might be best suited for small-screen productions, like to be released on Netflix or Hulu, and not the box-office. For prouducers and writers looking to make the most money in the box-office, Action, Adventure, Sci-Fi and Animation, Adventure, Comedy genres are the best to target. Partnerships with Marvel or Disney, for example.

Statistical Tests and Analysis

To support our plots, we will use a two-sample t-test for equal means to determine if adventure movies really do perform better then movies not in the Adventure genre.

Let’s first partition our movies into two disjoint sets.

Then, we perform our test.

## 
##  Welch Two Sample t-test
## 
## data:  adventure$`Revenue (Millions)` and not_adventure$`Revenue (Millions)`
## t = 11.015, df = 290.44, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   84.1156 120.7161
## sample estimates:
## mean of x mean of y 
## 157.16004  54.74419

By our test, we conclude that there is a significant difference of the means in Revenue (Millions) of adventure and non-adventure movies, as the p-value is very small (p < 2.2e-16) and less than our significance level (0.05). We have rejected our null hypothesis that the average of the means in Revenue (Millions) for adventure movies and non-adventure movies are the same.

Learnings

By our analysis, we can conclude several things. First of all, adventure movies gross higher than all other genres, as shown by our graph as well our statistical analysis. This would explain our second fact: the recent increase in the production of adventure movies, as shown by our first graph. Though we cannot speak to the causal relationship between these two, is is common knowledge that if there is a high demand for a product (shown by increased revenue), the supply will grow (number of movies produced).

Do well-rated movies, based on critics and user (general public) reviews, do better fiscally?

Now that we have formed recommendations for genres to pursue for box-office that will perform best fiscally versus for the small-screen, understanding the rating system behind movies and what might predict the success of a movie fiscally is our next interest.

When we think about whether we would want to pay to watch a movie we tend to think about the ratings, like is the movie well-liked or is it hated by the general public? We like to gather what others think about the movie to filter out which one to watch. So logically, the better movies would be the movies that make the most money, right? Recently, there have been movies that are rated badly but have made millions of dollars such as the Twilight Saga and so, it has made us wonder if ratingsactually correlate with revenue.

So what are the general trends when comparing the revenue with the general public’s ratings. Does the revenue grow as the populus rate the movie higher?

Plots

As a pre-step to doing any of the work, we want to make sure that none of the rows have NA, especially since we are comparing two quantitative variables.

It seems that if the movie is generally rated higher, then it has the possibility of making more in the box office. The movies dataset’s higher revenues are from the higher rated movies. This supports the thought that higher ratings are more probable to make more money.

However, in general it is a bit hard to see the distinct direct correlation between the ratings and the revenue from this chart. We thought it would be best if we made additional charts where we would plot the average and the median revenues of each of the ratings in both whole digit and its original decimal format and seeing if that made a difference. We also went ahead and filled these bar graphs by color based on the movie revenue, just to make it even easier for the reader to distinguish the fiscally best performing movies (and which category of ratings they fell under).

From this, where we truncated the decimal digits, we can see that the average revenue and the median revenue do actual do increase with the increase of the rating. The graph is left-skewed, with lower ratings indeed producing lower revenues.

This can even be seen through the decimal version as well below.

Although the change in the decimals are more drastic, they still seem to show that there is an increase in revenue with the increase in general population’s aggregate ratings. The graph is still left-skewed, but now we can see some more bumps in the scale of ratings from left to right, where a low-rated movie still performed relatively well.

These charts made the data much more clear in interpretation than the scatterplot, but they were both valid ways of viewing the data.

Statistical Tests and Analysis

Now, it is not only important to be able to see trends through graphs but also through statistical analysis as well. In this case we want to see if the ratings and the revenue actually do in fact correlate, meaning that we want to perform linear regression.

## 
## Call:
## lm(formula = `Revenue (Millions)` ~ Rating, data = movies.mod)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -126.94  -68.25  -26.35   31.05  818.83 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -91.60      27.62  -3.316 0.000951 ***
## Rating         25.85       4.02   6.431 2.13e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 102.1 on 836 degrees of freedom
## Multiple R-squared:  0.04713,    Adjusted R-squared:  0.046 
## F-statistic: 41.35 on 1 and 836 DF,  p-value: 2.134e-10

From this we can see that the p-values of the co-efficients and the f-statistic are both less than our significance level (0.05), which means that the null hypothesis is rejected. The null hypothesis is that the co-efficients of the ratings are the same, meaning that there is no significant relationship between Rating and the Revenue (Millions). However, because the null hypothesis is rejected, it means that there is a relationship.

Now another waywe could go about this is if we wanted to test the averages of the ratings against each other. Since the ratings could technically count as an ordinal categorical grouping as well. If we were to want to test the averages against each other instead, we would use the ANOVA test to prove our point and analyze variance.

##                    Df  Sum Sq Mean Sq F value   Pr(>F)    
## as.factor(Rating)  49  998397   20375   1.971 0.000123 ***
## Residuals         788 8145391   10337                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The null hypothesis in this test is that the averages of all the revenues throughout the ratings are equal. However, given that the p-value = 0.000123, that means that the p-value is less that 0.05, meaning that the null hypothesis is rejected. This means that the revenues are different between at least one of the rating groups. This means that there is some level of a relationship of Rating and Revenue (Millions).

Learnings

At the end of the day, it makes sense that the user Rating and the Revenue (Millions) would be correlated in some way. Because our world has so much data, we can be more efficient and frugal by just Googling what the best movies are to watch. Websites like Rotten Tomatoes allow for review-aggregations film and television, so that quick Google search before going to the movies can tell you what others are thinking about a movie. Even social media, like how your friends are reviewing and talking about a new-released movie on platforms like Facebook or Instagram, greatly impacts our likelihood of going to watch a movie. We can spend the money on a movie that we know is good rather than take a chance now, and that is most probably why it is the case there these is a relationship between the two variables. User ratings drive us to the movie theater, or drive us away from it.

Do critic and users (general public) rate movies similarly, and thus usually agree with what constitutes a good movie?

As we have been on the trend for predicting good movies, or fiscally well-performining movies, we next want to examine the relationship between user ratings and critic ratings. Obviously, film criticism is the analysis and evaluation of films and the film mediu from a professional standpoint when performed by film critics. They analyze themes, motifs, acting, plot, and so much more, from a detailed background. Meanwhile, user ratings are usually, but not always, much more feel-good based, and do not dive as deeper into the exact, intellectual details of a film. However, does this mean that the relatively more lax reviews from users cannot predict a good movie, and are movie critics always correct in their reviews and the response from the general public on the movie’s success. We examine these questions in the following plots.

Plots

First, let’s get a better understanding of critics ratings (Metascore in the dataset) and user ratings ( Rating in the dataset). Metascore ranges from 0 to 100 and user Rating range from 0 to 10. Later, when performing statistical analyses, these values will have to be equalized so that they have the same ranges.

How should user popularity be defined? It can be defined as the number of votes a movie receieved from users for its rating. To make tiers of popularity visible in the graph, let’s divide popularity into 5 tiers by quantiles. We also use the smoothing method to aids us in seeing patterns in the presence of overplotting with a straight line.

From this graph, it appears users and critics generally tend to agree on which movies are good or bad barring a few outliers. It also appears that the better the movie, the more people rate it.

Now let’s see the general distribution of critic and user ratings. User ratings must first be converted to the 0 to 100 scale like metascores are.

User ratings are more consistently in the middle of the pack, and critic ratings achieve lower lows and higher highs often. Both curves are fairly normal, but critic ratings are flatter and more spread out. It also appears critic’s mode is lower than the user’s mode (60s versus 70s).

Lastly, earlier in another research question, we plotted a scatter plot of user ratings and revenue of movies by year. To further supplement this research question, we plotted a similar scatter plot, but now comparing critic ratings, or Metascore, and revenue of movies by year to see how accurate critics are in predicting movie success. By color by year, we can also see the years critics were right on par with the public opinion. This information can help with evaluating critic performance throughout the years.

From our resulting plot, we see that a high critic rating, or Metascore, most definitely does not guarantee high revenue. In fact, some of the poorly reviewed movies, or movies with a low Metascore, outdid the highly reviewed movies for total revenue. Moreover, we see that the movies that are making the most revenue have been released in most recent years, or 2016. However, more movies released in the late 2010s performed more accurately in terms of the relationship between Metascore versus Millions (Revenue).

Statistical Tests and Analysis

We proceed with a two-sample Kolmogorov-Smirnov test to determine if the distribution of critic ratings and user ratings are similar. We chose this test because a Kolmogorov-Smirnov test is a useful and general non-parametric method for comparing two samples. It is sensitive to differences in both location and shape of the empirical cumulative distribution functions of critic rating and user ratings.

## 
##  Two-sample Kolmogorov-Smirnov test
## 
## data:  movies$Metascore and userratings
## D = 0.3389, p-value < 2.2e-16
## alternative hypothesis: two-sided

After running the test, we conclude that the distributions of critic ratings and user ratings are not similar, as the p-value is very small (p-value < 2.2e-16) and less than our significance level (0.05). We have rejected the null hypothesis that the critic ratings and user ratings were drawn from the same distribution.

Next, we use a t-test to determine if the average critic rating is different from the average user rating.

## 
##  Welch Two Sample t-test
## 
## data:  movies$Metascore and userratings
## t = -12.993, df = 1255.7, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -9.861773 -7.274265
## sample estimates:
## mean of x mean of y 
##  59.57518  68.14320

From this test, we conclude that the true difference in averages of critic ratings and user ratings is not 0. The 95% confidence interval is all negative (-9.495726 -6.998189) and does not include 0 in the range, indicating that critics rate movies lower than user do.

Learnings

Although critic and user ratings follow a positive trend, indicating movies are rated similarly by the groups, the scatterplot plotting critiic and user ratings does not tell the whole story. From the density plot, it is shown that the distribution of critic and user ratings are actually not the same. This is confirmed by the Kolmogorov-Smirnov test, and the t-test also confirms that there is a significant difference in how critics and users rate movies. Also, from our last scatter plot, we see that many times critic ratiings were not on par with public opinion. The total revenue that a movie makes, as seen by Millions (Revenue), is a form of an indirect user rating system. Why wouild people keep going to watch a movie, and pay money, if they did not hear good things about it or like it? We see saw several instances of movies where critics rated the movie poorly, but the movie still performed well in the public eye, as measured in revenue. This, this also supports our conclusions from the density plot.

Do title and description relate to user ratings and user votes, constituting a good movie?

We have already examined, in-depth, what might predict a successful movie and how ratings relate to the fiscal success of a movie. Lastly, we chose to dive deeper into qualitative text measurements of titles and descriptions of all the movies in our datasets, to pull out common phrases and compare and contrast those phrases found with the lowest user rated, highest user rated, lowest voted, and highest voted movies from this time period.

Plots

First, we plotted a word cloud of the most popular words in movie titles, colored and sized by frequency, with purple words being the least represented.

From this word cloud, we see that man, love, rise, day, part, war, girl, dark, and street are some of the most frequently used words. Moreover, captain, hunger, monster, king, and evil, all seen in pink, make up part of the second layer of words.

Next, we plotted a word cloud of the most popular words in movie descriptions, colored and sized by frequency, with sea green teal words being the least represented.

From this word cloud, we see that young, new, find, world, and life are some of the most common words, as well as words that prompt some form of eagerness for the reader to watch the movie, like discover, mysterious, and must.

Now that we have identified the most commonly used words in titles and descriptions, we now want to compare and contrast to see if these words are present in the lowest user rated and user voted movie and in the highest user rated and user voted movie.

We use the min() and max() function to see that from out dataset, the movie that recieved the minimum rating had a user rating, or Rating, of 1.9. This same movie also recieved the minimum votes, or Votes, of 61. The movie that recieved the highest rating had a user rating, or Rating, of 9. This same movie also recieved the highest votes, or Votes, of 1,791,916. Next, we filter our dataset to extract the 2 rows corresponding to these movies.

## # A tibble: 1 x 13
##    Rank Title Genre Description Director Actors  Year `Runtime (Minut… Rating
##   <dbl> <chr> <chr> <chr>       <chr>    <chr>  <dbl>            <dbl>  <dbl>
## 1   251 Bonj… Come… Anne is at… Eleanor… Diane…  2016               92    4.9
## # … with 4 more variables: Votes <dbl>, `Revenue (Millions)` <dbl>,
## #   Metascore <dbl>, popularity <dbl>

The lowest user rated and voted movie was titled “Bonjour Anne”, and had the following description: “Anne is at a crossroads in her life. Long married to a successful, driven but inattentive movie producer, she unexpectedly finds herself taking a car trip from Cannes to Paris with a … see full summary”

## # A tibble: 1 x 13
##    Rank Title Genre Description Director Actors  Year `Runtime (Minut… Rating
##   <dbl> <chr> <chr> <chr>       <chr>    <chr>  <dbl>            <dbl>  <dbl>
## 1    55 The … Acti… When the m… Christo… Chris…  2008              152      9
## # … with 4 more variables: Votes <dbl>, `Revenue (Millions)` <dbl>,
## #   Metascore <dbl>, popularity <dbl>

The highest user rated and voted movie was titled “The Dark Knight”, and had the following description: “When the menace known as the Joker wreaks havoc and chaos on the people of Gotham, the Dark Knight must come to terms with one of the greatest psychological tests of his ability to fight injustice.”

Learnings

From these titles and descriptions, we see that both descriptions used words to prompt some form of eagerness for the reader to watch the movie, as “The Dark Knight” uses must and “Bonjour Anne” uses unexpectedly and finds. Overall, we conclude that the word choice used for titles and descriptions does not really dictate the success of a movie.

Conclusion

Through our research questions consisting of plots, statistical analysis, and learnings at each step, we have discovered some interesting notions behind the most popular IMDb Movies from 2010 to 2016.

Firstly, there is a clear winner for the most popular genre of movies in terms of total money made, measured by Revenue (Millions). Adventure and action movies consistently are watched in movie theaters, over perhaps lighter comedy, romance, and drama movies, which people seem to leave to watch until they are released on DVD or are available to be streamed online. We did not see this as suprising, as we referenced adventure and action movies, like the Marvel Cinematic Universe (MCU) and Star Wars American epic space-opera media franchise, to be more enjoyed at movie theaters, with the full surround sound system and large screen, over watching it at home after social media has spoiled the excitement over the series through spoiilers.

The bulk of our remaining conclusions arrived from the relationship between user (general public) ratings and critic ratings. Based on the intuitive notion that if the general public likes a movie, the total revenue of the movie is to be higher, it made sense that we reached the statistically-backed conclusion that higher user ratings scores, measured on a scale from 1 to 100, are positively correlated with higher revenue. User ratings drive us to the movie theater, and thus drive revenue, so movie directors and producers should truly invest in circulating the best public press and marketing for all movies released. Critic ratings play far less of a role in revenue, as we saw that a high critic rating most definitely does not guarantee high revenue. In fact, some of the poorly reviewed movies by critics, outperformed the highly critic rated movies in total revenue. By our density plot, we also learned that the distribution of critic and user ratings are actually not the same, and the groups rate movies differently, which should perhaps be changed. What is the purpose of critics if they are not pleasing the public and on par with what they look for in a movie? Overall, we see that eyeing for the best critic rating isn’t nearly as important as eyeing for the general public’s approval.

Future Work

It is important to note that our dataset only consisted of 1,000 of the most popular IMDb movies. There are so many movies that are not considered, especially movies that performed poorly and could give us even more insight into what to avoid for a “flopped” movie, because relatively speaking, all movies in this dataset still made revenue, even if just a little.

Moreover, the column attributes in this dataset were a bit limited, and really only provided insight on surface level details of movie. The movie timeline, from casting to release, is a long, lengthy process that includes many more factors that could be useful, including budget, time of year of release, demographic breakdown of viewers, and more. With such information, we could have the possibility of exploring more data plotting methods, like a choropleth map to deduce where in the nation the movie grossed the most revenue from movie theaters, and what the demographic and age breakdown of the population is like there.

Although it might be impossible to perfectly predict what will constitute a succesful movie, shaping movie critic guidelines and aiding directors, producers, and screenwriters from an earlier stage with characteristics of successfull movies in the past and tips and tricks for marketing methods for the movie (like when in the year to release, where geographically to focus advertising), can help prosper the industry more, especially when the industry will need to make a quick comeback after COVID-19.

36-315 Statistical Graphics and Visualization Final Project

Exploring Popular IMDb Movies from 2010 to 2016

Pavi Bhatter, Olivia Deng, Jessica Li, and Jiwoo Yoo

May 4, 2020

Abstract

Data

Loading our Data

Data Description

Understanding our Data

Does a movie’s genre play a role in the predicted popularity and revenue of that movie over time?

Plots

Statistical Tests and Analysis

Learnings

Do well-rated movies, based on critics and user (general public) reviews, do better fiscally?

Plots

Statistical Tests and Analysis

Learnings

Do critic and users (general public) rate movies similarly, and thus usually agree with what constitutes a good movie?

Plots

Statistical Tests and Analysis

Learnings

Do title and description relate to user ratings and user votes, constituting a good movie?

Plots

Learnings

Conclusion

Future Work