Description of Dataset

This dataset is pulled from movie-ratings website IMDB. This data describes the top 250 movies of all time, according to this website. Our data is pulled from Kaggle. This dataset contains 250 entries, and originally was downloaded with 23 different columns to differentiate each movie. This data was raw, and included some preprocessing to make our analysis happen as efficiently as possible. Our new columns are as follows:

  • Title: The title of the movie
  • Year: The year in which the movie was released
  • Genre: Originally text data but made into binary columns for individual movie genres
  • Group: A tier-based system to define the movies into categories based on the placement on the list, in increments of 50
  • Director: Main director of the movie
  • Actors: The top 4 actors in each movie were listed in the data set, and thus, four separate columns were made
  • Production: The main production company
  • Languages: The number of languages the movie was made in
  • Countries: The number of countries the movie was made in
  • Awards: The number of awards the movie won
  • Metascore: The Metascore rating, a rating based on different criteria and judged by well-respected movie critics
  • iMDB rating: A rating system from the website the dataset was derived from
  • iMDB votes: The number of votes on the iMDB rating that created the IMDB rating

This dataset is extremely descriptive, and the variables span many different characteristics of the movies themselves as well as different audience ratings. It has qualitative, quantitative, and text data. Most importantly, we are interested in this dataset because we all love to watch movies, so we wanted to see the different factors that drive specific movies to be so popular.

Overarching Research Questions

Our original general idea when choosing the dataset was that we could look into what made a movie successful, or placed it onto this list of the best movies of all time. This is a general question that can be answered in many different paths, and we chose three paths that we were most interested in exploring.

  • We have two different rating systems and a count for votes on one of the rating systems. We were curious to explore what the differences were in these two columns. What is the best way to rate movies?
  • There are many different genres of movies in the original list. As we preprocessed the genre data into binary columns, this number came to over 20. Most of the movies on our list are award winners as well. Which genres make great award-winning movies?
  • Movies continue to evolve over time, whether it be based on inclusivity, improved technology, or accessibility. We are interested in exploring the different years and decades that were top performers in regard to iMDB’s top 250 movies. Which years and decades were the most successful?

Overall, we believe these recent questions provided a thoughtful means of analysis with the dataset. Our answers to these questions are provided below.

Research Question 1: What are the Differences in Rating Scores?

In this research question, We are mainly looking at two different rating systems and a count for votes on one of the rating systems. We are curious to explore what the differences were in these two columns. We have no preconceived notions heading into these analyses, because we know we are given a dataset of the top movies of all time.

We first looked at the relationship between iMDB rating and the number of votes that contributed to the iMDB rating. We also colored each point on the scatter plot based on the tier-based group variable as described above. We are looking to see if higher rated movies are rated by more users on the iMDB platform.

Our results from the graph show that there is no relationship between the number of votes and its rating, and the results here are all over the place. Ultimately, the number of votes will not help in our analysis.

Next, we looked at the two different ratings in our dataset: Metascore and iMDB ratings. These are two independent rating sources, and we hope to look at the relationship here between the two.

Our results are limited because we are looking at only the top movies of all time, so we expect both scores to be high. We see that there exists some relationship here. The higher tiered movies from the iMDB rankings are not rating lowly on the Metascore rating system. The range of Metascore is larger than the rating for iMDB rating, but both are inherently positive.

To further this idea, we want to explicitly look at the tiered groups versus the Metascore. This is done in a string of boxplots of Metascore compared to each genre.

Our results show that the higher the group, the higher the Metascore. This is shown in a step relationship, where groups 3-5 hover around the same distribution, and then the top 2 groups have higher average Metascores. It is worth noting that each boxplot has wide tails; however, the top group has the shortest tails out of all of the distributions. This adds evidence to the accuracy of the groups we have made.

Research Question 2: What genres are most common in the top 250 movie? What genres correlate to winning awards?

To explore the first question, we wanted to look at a word cloud of the genres to see the most used ones within the genre variable because each movie had multiple genres. To explore the second question, we wanted to look at a simple bar chart to see the distribution of awards among the genres.

This word cloud depicts the genres and frequency within the top 250. Each movie was categorized into multiple genre but “drama” seems to be the most common genre for a movie in the top 250 to be described as. “Crime” and “adventure” look to be the next most common. From this, we can conclude that the genre of “drama” can be correlated with the a movie being in the top 250. Our next step is to look at the number of oscars won by genre to see if there is a common genre that makes a movie successful.

This bar chart shows the number of awards (Oscars) won by genre. It is very clear that movies categorized as a “drama” have won more awards than any other genre. This would make sense as drama is also the most used genre to categorize a movie. The next genre to win the most awards is adventure which also makes sense for the same reasoning as drama.

Research Question 3: Are There Different Years That Performed Better in Regard to Top 250 Movies?

For our final research questions, we were interested in exploring the number of top 250 iMDB movies that were released by decade, with respect to the movie rankings. We first created a time series graph displaying the number of movies released by date, colored by the movie ranking group.

We can see that, between 1990 and 2020, there was a spike in top 250 iMDB movies. Specifically, around the 1990s, there were 5 top 50 movies released, making a great year for movie releases. It also seems that in earlier decades (like the 1950s), the top 250 movies released ended up being lower in the rankings, showing that the recent decades have produced higher ranked movies.

To visualize this further, we made a mosaic plot comparing the movies released by decade and the ranking group.

We can see that, within ranking groups 1 and 2, or rankings 1-100 on the top 250 list, the standardized residuals are significant. This indicates that there were significantly more top 100 movies released in the 1990s, which is also shown in our time series plot.

Main Conclusions

Overall, we think the two rating systems are similar, and compare well to each other. There is no real way to tell which rating system is “better”, but we see that they are well correlated, and therefore, both suffice as accurate rating systems for the data.

From visualizing the movie genres and the numeber of Oscars won, we can tell that movies in the top 250 are often categorized in the “drama” genre. This relates to our overall question because it can be said that a common factor is being in the “drama” genre. For our secondary question of what makes a movie successful, the same applies. The genre “drama” is correlated with winning awards.

When we looked at the release date of the movies and sectioned off by ranking group in increments of 50, we found that the 1990s was a great year for top movies worldwide, with significantly more top 100 movies than any other decade. This was closely followed by the 2000s, which also outperformed the other decades.

In the future, we plan on exploring the different rating metrics over time to see if the difficulty of getting a higher score has changed. This would help us answer our first research question because a metric that has more stable scores, independent of time, would better compare movies.