Our dataset contains data on 20 variables for each of 4,803 movies taken from The Movie Database, a popular online database for movies. Specifically, we use a curated version of the dataset obtained from Kaggle, which removes 197 movies from the intial TMDb 5000 due to inaccuracy. This includes a number of quantitative variables, like the revenue, runtime, and average rating of each movie, as well as categorical variables like the genre and production company of each movie, and text data in the form of keywords and plot synopses. The extensive information in the TMDb dataset allows us to answer a number of interesting questions about how how different types of films are categorized and how they perform across time.
Although TMDb categorizes films into 20 different genres, many of these are vague and highly overlapping (for instance, many Crime movies are also Action movies). So, we restrict our analysis to five distinct, unique genres: Horror, Science Fiction, Western, Romance, and Crime.
genre | count |
---|---|
Drama | 2297 |
Comedy | 1722 |
Thriller | 1274 |
Action | 1154 |
Romance | 894 |
Adventure | 790 |
Crime | 696 |
Science Fiction | 535 |
Horror | 519 |
Family | 513 |
Fantasy | 424 |
Mystery | 348 |
Animation | 234 |
History | 197 |
Music | 185 |
War | 144 |
Documentary | 110 |
Western | 82 |
Foreign | 34 |
TV Movie | 8 |
Out of the 32 possible combinations of these genres, 19 are represented in TMDb. 5 of the 6 most common combinations of genre are just single genres, supporting our desire to find mostly non-overlapping categories, while the fifth most common is Horror & Science Fiction.
The unique combinations of genres, showing up only once each in the database, are some particularly strange films:
Genres | Movie |
---|---|
Science Fiction and Western | Wild Wild West(1999) |
Western and Horror | Ravenous (1999) |
Crime, Horror, and Science Fiction | Split Second (1992) |
Romance and Horror | Return of the Living Dead III (1993) |
Romance, Crime and Horror | The Wraith (1986) |
Romance, Crime, and Western | Lone Wolf McQuade (1983) |
How do these different types of genre films perform? Are some more popular than others?
Here we see the average revenue and rating for each of the full 20 genres, with the 5 that we selected highlighted in red. We chose these genres over others because these tend to encompass more specific types of films, as opposed to more broad genres like “Foreign” that are composed of lots of different kinds of movies.
The distributions of revenue for our five genres of interest are fairly similar (heavily skewed leftwards, which makes sense given the massive variability in scope of funding films can receive), but Science Fiction has the highest values (mostly thanks to Avatar and the Avengers series), and Westerns are especially low-revenue, even in more modern times.
In addition to the quantitative variables we examined, TMDb provides keywords (or “tags”) for each film, along with plot synopses. How do these textual descriptions vary for each genre? Are certain words and/or tones more common for each genre?
These wordclouds of the keywords for each genre show clearly distinct sets of words. In order (from left to right, top to bottom), they are: Horror, Science Fiction, Western, Romance, and Crime, and the most common words (as well as the less common ones) reflect those. Users of TMDb could probably get a good impression of the genre of a film just by skimming its keywords.
Distinguishing between the types of film based on their content is exactly the function keywords are supposed to provide to their users. But what information is represented in the plot synopses? Is there a substantial difference in tone between the different genre’s plot descriptions?
Performing sentiment analysis on the plot synopses of each genre, we find that the proportion of positive words have some differences, but none of the differences are statistically significant.
Are movies more popular than they used to be? Do they make more money? How do these features vary across time? We don’t split up movies by genre for this analysis, to simplify things, so we’re considering the full dataset, not just the five main genres we looked at previously.
##
## Break Even or Profit Loss
## 3477 1326
We investigated the profit of movies by subtracting revenue from budget. We found that 3477 movies broke even (profit = 0) or made a profit, while 1326 movies made a loss (profit < 0) and there is a general increasing trend, though slight, in the profit of movies over the years. In the earlier years, most movies broke even or made a profit but from 1980s onward, we see an increase in the number of movies that made a loss but also an increase in the number of movies that broke even or made a profit.
The plot of vote_average
over years shows an overall decreasing trend. The range of vote_average
values also increases over the years. It’s also interesting to note that in the later years (2000 and after), we see that there are more movies that have a vote_average
value of 0 but at the same time, there are also more movies with a vote_average
value of 10. None of the movies released before 1983 had a vote_average
value of 0 and none of the movies released before 1998 had a vote_average
value of 10.
The main takeaways:
There are differences between the financial variables across genre, but don’t see a marked difference in the shape of the distributions.
The keywords of the five genres of interest do a good job distinguishing between them, but there is not a significant difference in sentiment between the various synopses.
More recent movies in the dataset are more likely to be profitable, but receive a lower score.
Perhaps the biggest concern here is selection bias. These movies are the top 5000(ish) in the database, by an arcane and time-sensitive metric of TMDb’s own devising, but heavily favoring movies with more activity on their website. This naturally leads to greater representation of contemporary movies, and movies that received a wide-release. In contrast, this means that the older movies appearing here are more likely to be the canonized classics, skewing their average score higher.
There are also some other concerns here, such as uncertainty that all the revenue numbers are global vs. domestic and the non-adjustment of inflation, but we are less concerned about that as it seems to affect the genres evenly and would only affect the scale of profit, not the positive/negative distinction.