Data

Our dataset contains data on 20 variables for each of 4,803 movies taken from The Movie Database, a popular online database for movies. Specifically, we use a curated version of the dataset obtained from Kaggle, which removes 197 movies from the intial TMDb 5000 due to inaccuracy. This includes a number of quantitative variables, like the revenue, runtime, and average rating of each movie, as well as categorical variables like the genre and production company of each movie, and text data in the form of keywords and plot synopses. The extensive information in the TMDb dataset allows us to answer a number of interesting questions about how how different types of films are categorized and how they perform across time.

Although TMDb categorizes films into 20 different genres, many of these are vague and highly overlapping (for instance, many Crime movies are also Action movies). So, we restrict our analysis to five distinct, unique genres: Horror, Science Fiction, Western, Romance, and Crime.

genre count
Drama 2297
Comedy 1722
Thriller 1274
Action 1154
Romance 894
Adventure 790
Crime 696
Science Fiction 535
Horror 519
Family 513
Fantasy 424
Mystery 348
Animation 234
History 197
Music 185
War 144
Documentary 110
Western 82
Foreign 34
TV Movie 8

Out of the 32 possible combinations of these genres, 19 are represented in TMDb. 5 of the 6 most common combinations of genre are just single genres, supporting our desire to find mostly non-overlapping categories, while the fifth most common is Horror & Science Fiction.

The unique combinations of genres, showing up only once each in the database, are some particularly strange films:

Genres Movie
Science Fiction and Western Wild Wild West(1999)
Western and Horror Ravenous (1999)
Crime, Horror, and Science Fiction Split Second (1992)
Romance and Horror Return of the Living Dead III (1993)
Romance, Crime and Horror The Wraith (1986)
Romance, Crime, and Western Lone Wolf McQuade (1983)

What’s going on with genre?

How do these different types of genre films perform? Are some more popular than others?

Here we see the average revenue and rating for each of the full 20 genres, with the 5 that we selected highlighted in red. We chose these genres over others because these tend to encompass more specific types of films, as opposed to more broad genres like “Foreign” that are composed of lots of different kinds of movies.

The distributions of revenue for our five genres of interest are fairly similar (heavily skewed leftwards, which makes sense given the massive variability in scope of funding films can receive), but Science Fiction has the highest values (mostly thanks to Avatar and the Avengers series), and Westerns are especially low-revenue, even in more modern times.

What’s the talk around genretown?

In addition to the quantitative variables we examined, TMDb provides keywords (or “tags”) for each film, along with plot synopses. How do these textual descriptions vary for each genre? Are certain words and/or tones more common for each genre?

These wordclouds of the keywords for each genre show clearly distinct sets of words. In order (from left to right, top to bottom), they are: Horror, Science Fiction, Western, Romance, and Crime, and the most common words (as well as the less common ones) reflect those. Users of TMDb could probably get a good impression of the genre of a film just by skimming its keywords.

Distinguishing between the types of film based on their content is exactly the function keywords are supposed to provide to their users. But what information is represented in the plot synopses? Is there a substantial difference in tone between the different genre’s plot descriptions?

Performing sentiment analysis on the plot synopses of each genre, we find that the proportion of positive words have some differences, but none of the differences are statistically significant.

They don’t make them like they used to…

Are movies more popular than they used to be? Do they make more money? How do these features vary across time? We don’t split up movies by genre for this analysis, to simplify things, so we’re considering the full dataset, not just the five main genres we looked at previously.

## 
## Break Even or Profit                 Loss 
##                 3477                 1326

We investigated the profit of movies by subtracting revenue from budget. We found that 3477 movies broke even (profit = 0) or made a profit, while 1326 movies made a loss (profit < 0) and there is a general increasing trend, though slight, in the profit of movies over the years. In the earlier years, most movies broke even or made a profit but from 1980s onward, we see an increase in the number of movies that made a loss but also an increase in the number of movies that broke even or made a profit.

The plot of vote_average over years shows an overall decreasing trend. The range of vote_average values also increases over the years. It’s also interesting to note that in the later years (2000 and after), we see that there are more movies that have a vote_average value of 0 but at the same time, there are also more movies with a vote_average value of 10. None of the movies released before 1983 had a vote_average value of 0 and none of the movies released before 1998 had a vote_average value of 10.

Conclusions

The main takeaways:

Perhaps the biggest concern here is selection bias. These movies are the top 5000(ish) in the database, by an arcane and time-sensitive metric of TMDb’s own devising, but heavily favoring movies with more activity on their website. This naturally leads to greater representation of contemporary movies, and movies that received a wide-release. In contrast, this means that the older movies appearing here are more likely to be the canonized classics, skewing their average score higher.

There are also some other concerns here, such as uncertainty that all the revenue numbers are global vs. domestic and the non-adjustment of inflation, but we are less concerned about that as it seems to affect the genres evenly and would only affect the scale of profit, not the positive/negative distinction.