Intro

This dataset features over 10 thousand movies from IMDB, and includes information on numerous variables including popularity, revenue, budget, cast, and more. All variables used in this analysis include revenue, popularity, time, budget, tagline, and genres. These variables are all numerical, save time, tagline, and genre. The genre is a categorical variable, and tagline is text.

Research Questions

Using our dataset, we wanted to explore the following research questions by first exploring the relationship of revenue and popularity to other variables in the dataset and then investigating the trends of movies over time.

EDA

Looking at the dataset, we decided that revenue or popularity is the best way to measure the success of a movie. However, inspecting our dataset, we noticed that the median of our revenue is 0. Since the units of our revenue is just USD, it seemed strange that there are so many 0 values. Looking at the histogram, we can see that the existence of these 0 values make the entire distribution very skewed with the higher revenue bar not even being visible in our histogram.

Even though we would be losing out on over half of the data, we decided to remove these revenue values when dealing with the revenue variable in our study.

The data here was filtered to show budgets and revenues that didn’t exceed $1e+8. This graph shows that, in this range, the quantity of low-popularity movies strongly outweighs the number of medium- and high-popularity movies. The graph also shows that there appears to be a relationship between popularity and revenue, although this is better explained elsewhere.

What variable affect the success of a movie?

First, we started by examining the relationship between revenue and popularity in our dataset as hinted in our EDA.

## 
## Call:
## lm(formula = movies.popularity ~ movies.revenue_adj, data = adj_movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9000 -0.4131 -0.1827  0.1802 27.0449 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        6.041e-01  1.862e-02   32.44   <2e-16 ***
## movies.revenue_adj 3.833e-09  8.107e-11   47.28   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.122 on 4848 degrees of freedom
## Multiple R-squared:  0.3156, Adjusted R-squared:  0.3154 
## F-statistic:  2235 on 1 and 4848 DF,  p-value: < 2.2e-16

The graph above shows the distribution of the revenue of movies across the dataset and the corresponding popularity of the movie. To clean up the data, we removed any rows that had revenue = 0 as explained above and limited the popularity (y-axis) from 0 to 10. Furthermore, we fitted a linear regression to see if there was a linear relationship between the two variables. However, after we fitted the linear model we saw that the adjusted R^2 was 0.3154 which is not very high. This means that movies can have low revenue and still be popular (perhaps movies that are popular among a niche group of people) and vice versa.

Another hypothesis we made was that the sentiment of the movie taglines could be a good predictor for movie popularity.

## # A tibble: 5,230 × 5
##    original_title             popularity sentiment n_words total_assigned_words
##    <fct>                           <dbl> <chr>       <int>                <int>
##  1 (500) Days of Summer            3.24  negative        1                    3
##  2 (500) Days of Summer            3.24  positive        2                    3
##  3 *batteries not included         0.683 positive        1                    1
##  4 $5 a Day                        0.299 negative        1                    1
##  5 10                              0.245 negative        1                    1
##  6 10 Things I Hate About You      1.77  negative        1                    1
##  7 10 Years                        1.11  negative        1                    3
##  8 10 Years                        1.11  positive        2                    3
##  9 10,000 BC                       1.84  positive        1                    1
## 10 100 Degrees Below Zero          0.266 negative        2                    2
## # … with 5,220 more rows

The graph above shows the distribution of the sentiments of movie taglines across the dataset and the corresponding popularity of the movie. Along the x axis are the popularity of movies given by the “popularity” column in the dataset. Along the y axis is the percentage of either negative or positive sentiment for taglines for each popularity group. Consequently, the bars in the graph are filled by either a negative or positive sentiment fill. If we determine “lower-ranking” movies to be those that have a popularity value<5 and “higher-ranking” movies to be those that are >5, we can see that (for the most part), “lower-ranking” movies have higher positive sentiment in their taglines while “higher-ranking movies” generally have more negative sentiment in their taglines.

## # A tibble: 21 × 3
##    `as.integer(popularity)` sentiment     n
##                       <int> <chr>     <int>
##  1                        0 negative   2333
##  2                        0 positive   1895
##  3                        1 negative    336
##  4                        1 positive    353
##  5                        2 negative     89
##  6                        2 positive     85
##  7                        3 negative     36
##  8                        3 positive     29
##  9                        4 negative     13
## 10                        4 positive     15
## # … with 11 more rows

Finally, we wanted to see how the number of directors affected the revenue

The directors in this dataset have been grouped by the number of movies on the list that they’ve directed. Directors with less than 10 movies are classified as “small”, between 10 and 18 as “medium”, and more than 18 as “large”. The graphs suggests that the general density of revenues is approximately the same, with more common directors claiming slightly higher revenues. We also see that the highest density occurs with small-scale directors.

How did the popularity and revenue of movies change over time?

In this research question, we were interested in the pattern and progress of revenue and popularity of movies over time. First, as we examined our dataset, we noticed numerous observations with revenue= 0, which would definitely hinder our graph. Hence, we used a refined portion of the dataset in which the revenue > 0.

Since our dataset included movies from 1960-2015 range(refinedmovies$released_year,) we decided to focus on the “revenue_adj”, as this accounts for the inflation throughout the decades and make it more accurate to compare across such large time periods. We created a revenue summary of the data that only focused on three variables-“release_year”, “revenue_adj”, “count”. By grouping by the release_year, we summed up the total revenue of movies of that respective year, as well as the count of how many movies were released that year in order to analyze a secondary time series of the average revenue per movie over time.

## # A tibble: 56 × 3
##    release_year revenue_adj count
##           <int>       <dbl> <int>
##  1         1960 1069117146.     7
##  2         1961 2463621899.    10
##  3         1962 1553996299.     9
##  4         1963 1334357137.     7
##  5         1964 2397193109.     8
##  6         1965 3170184648.     5
##  7         1966  569262322.     5
##  8         1967 4823050701.    14
##  9         1968 1659601419.    12
## 10         1969 1450145313.     5
## # … with 46 more rows

This time series plot of total revenue (adjusted for inflation) on the y-axis and release_year on the x-axis conveys the total movie revenue per year from 1960-2015. It can be observed that although there are alterations and minor up and downs between consecutive years, the total movie revenue per year has positively increased, and the total movie revenue in 2015 is nearly 5 times the total movie renue in 1960.

This time series plot of average revenue (adjusted for inflation) on the y-axis and release_year on the x-axis conveys the average movie revenue per year from 1960-2015. Unlike the previous plot, it can be observed that the average movie revenue per year in 2015 is lower than the average movie revenue per year in 1960, with the peak of average movie revenue occurring in 1965. Although this contrasts with our previous plot, this is plausible if we take a look at our “count” variable. In the beginning of our data time frame in the 1960’s, there are only 6-7 movies released per year, compared to in the 2000’s and 2010’s where there are around 200 movies released per year. Hence, as our average is dividing the total revenue that year by the number of movies released, it can make sense that the average revenue per movies drops, while the total revenue is constantly incresing.

## # A tibble: 56 × 3
##    release_year popularity count
##           <int>      <dbl> <int>
##  1         1960       7.13     7
##  2         1961       7.88    10
##  3         1962       7.35     9
##  4         1963       6.95     7
##  5         1964      10.3      8
##  6         1965       5.27     5
##  7         1966       1.98     5
##  8         1967      12.5     14
##  9         1968       9.15    12
## 10         1969       4.92     5
## # … with 46 more rows

This time series plot of total popularity on the y-axis and release_year on the x-axis conveys the total movie popularity per year from 1960-2015. It can be observed that compared to our revenue time series plots, there are fewer and smaller up and downs between consecutive years, and that the total movie popularity per year has positively increased. There is a huge upward surge in movie popularity in the 2000s and 2010s and the total movie popularity in 2015 is nearly 70 times the total movie popularity in 1960 which is very likely of our society’s rapid transition into a technological era and the development of advanced film cinematic.

This time series plot of average popularity on the y-axis and release_year on the x-axis conveys the average movie popularity per year from 1960-2015. It can be observed that there are a series of bumps in the downward trend until the mid 1990s, from which the average popularity per year transitions into a positive slope and gradually increases. This plot is plausible despite the significantly greater counts of movies released in the 2000s and 2010s because more and more people are immersing themselves in movies than before, as now more diverse age groups, especially the younger audience, are actively participating in movie spectating.

Conclusion

Through our analysis of numerous variables and patterns via diverse plotting methods, we were able to draw these conclusions about our respective research questions: -What variable affect the success of a movie? The correlation between revenue and popularity R^2 measured 0.3154 which is not very high. Hence, movies can have low revenue and still be popular “Lower-ranking” movies have higher positive sentiment in their taglines while “higher-ranking movies” generally have more negative sentiment in their taglines. The general density of revenues is approximately the same among diverse classes of directors, but more common directors claimed slightly higher revenues

-How has the popularity and revenue of movies changed over time? The total movie revenue per year has increased from 1960-2015, while the average movie revenue per year has decreased. Both the total movie popularity per year and the average movie popularity per year has seen an upward trend from 1960-2015, with a huge positive surge in the 2010’s.

-How are the genres of the movies related to its release date? From the very high standardized Pearson residual, we can observe significant correlation between Horror and October. Other highly positively correlated data is with Family movies in the November and Drama movies in September. On the other hand, some negatively correlated data is with Adventure and Family movies in September and Horror and Thriller movies during December. It seems likely that due to Thanksgiving on November and Christmas that there’s an unusually high number of Family movies during these two months and a low number of Horror and Thriller movies.

Further questions we could look into is the critical success of production companies. For example, we could take a stance as an agent who is trying to book their client for critically successsful movie (perhaps to win an Oscar). In order to analyze that we could try to look at the budget and revenue for movies for the production companies listed in the dataset. However, there are many movies that either have low budget or poor box office revenue and are still loved by the critics. Even the popularity column will not be effective enough to do a close analysis, since that data is collected by looking at the amount of people who simply searched it up or added it to their watchlist. There would need to be a separate “critical popularity” column that looks at how critics perceived the movies in the database.