Video Games: An Analysis of Sales, Platform, and Genre

Authors: Xander Brick, Emily Ford, and Sophia Hill

Data

Our dataset is titled “Video Game Sales” and was found on Kaggle, at https://www.kaggle.com/datasets/gregorut/videogamesales.

The dataset contains a list of video games with sales greater than 100,000 copies. It was generated via web scraping.

The following is a description of each of the variables/columns in the dataset:

  • Rank - Ranking of overall sales
  • Name - The games name
  • Platform - Platform of the games release (i.e. PC,PS4, etc.)
  • Year - Year of the game’s release
  • Genre - Genre of the game
  • Publisher - Publisher of the game
  • NA_Sales - Sales in North America (in millions)
  • EU_Sales - Sales in Europe (in millions)
  • JP_Sales - Sales in Japan (in millions)
  • Other_Sales - Sales in the rest of the world (in millions)
  • Global_Sales - Total worldwide sales

Each row/observation in the dataset is a measurement taken of video game data at a specific year, on a specific platform. Each row also includes information like the sales ranking, platform, genre, and publisher of the video game, as well as the amount of sales per region and globally in the specified year. It is also of importance to note that the vast majority of data points occur before 2017, as scraping generally stopped in 2016.

Research Questions

We are interested in answering 4 research questions concerning our dataset.

  1. Which platforms are most popular in each region (determined by sales) and how does this change over time?
  2. Which genres are most popular in each region (determined by sales) and how does this change over time?
  3. How does platform use (determined by games produced) change over time?
  4. How does genre preference (determined by games produced) change over time?

These questions are motivated by our interests in video games, particularly how variables like genre, platform, region, and time affect their sales. By answering these questions, we can gather insight into what makes popular video games sell more. It would also be interesting to see where trends are currently going, so as to direct future work on the subject.

Graphs

EDA

To begin with, we wanted to get an overall sense of the distribution of games, over time. This suggests examining the variable Year. We can transform Year into a variable Decade, which takes on the discrete values 1980s, 1990s, 2000s, and 2010s.

From this graph, we can see that most video games were released in the 2000s. The distribution is left-skewed, with more games being released in the 2000s and 2010s than the 1980s and 1990s. Comparatively, a very small amount of games were released in the 1980s. It is also interesting to note that games released do not have a strictly positive relationship with time - the number of games released peaked in the 2000s, and decreased from the 2000s to 2010s. The decrease in the 2010s can also be partly contributed to the fact that we have few data points from 2017 onwards.

To look further into the distribution, we can look at it by Year instead of Decade.

Here, we can see that indeed most game sales are from between 2000-2010, with its peak around 2007-2008. We can also confirm that games sales are left-skewed, increasing before the 2000s and decreasing after, even taking our missing data from 2017 and after into account.

RQ 3: How does platform use change over time?

In order to answer this question, we will focus on the main few console lines. We do this for a few reasons. The first is that if we treated each console individually, we would have a lot of data that doesn’t really say a lot, as many of the consoles only have a few games listed within our data set. The second reason comes from the very nature of the game industry. Overall, there is little incentive to release games on older systems when a new system releases, so we rarely see new games from an older console when a directly superior console has been released. Due to the rate at which hardware is improving, we see new consoles from the largest producers (the main relevant parties as production of hardware is expensive) in each line every few years. Thus, we merge each “line” of consoles into a single console line, on which it is more reasonable to look at changes. Overall, we merge into 4 main categories, One for the Nintendo home line consoles, one for the Nintendo portable consoles, one for Play-Station line consoles, and one for Xbox line Consoles. We split Nintendo into 2 groups because both groups have games produced fairly independently, and both have a substantial number of games rivaling the other categories.

The first notable trend here is that the overall observed trends of the 2 Nintendo Line consoles, look different than the PlayStation and Xbox trends, in that the Nintendo line appears to have a much longer tail on the left before ballooning up similar to the others. This is however because Nintendo was making consoles far before Xbox and PlayStation, and before gaming became more popular, resulting in spending a longer period of time making consoles with less of an audience. In each plot we can also see the seasonal effect, likely caused by the console lifespans, where in the early parts of the consoles lifespan, there may be a couple major hits to try to sell the console, but most developers have not had the opportunity to make games capitalizing on the new features. Then, after a year or so, the console enters the prime of its life where many games are produced, resulting in the up-tick for each season. Lastly, as people use up all their ideas for this console and wait for the next one to be released, game production falls back down again, resulting in the down season until the prime of the next season. This repeats over and over, giving us the seasonality that we see here.

RQ 4: How does genre preference change over time?

In order to see if we have a significant link between the proportion of games produced in each genre, and the decade that they are produced in, we perform a Chi-Squared test for independence. Our Null-Hypothesis is that the proportion of games produced in a genre and the decade are independent, and our Alternative is that there is some dependence between the two variables. The results of the test are as follows:

Test Statistic Degrees of Freedom P-Value
\(\chi^2 = 1038\) \(df = 33\) \(p < 2.2\times10^{-16}\)

Overall, we have significant evidence to reject our null hypothesis that the number of games produced in a given decade and the decade itself are independent, and thus, we wish to further study any potential link between these two variables. To do so, we produce a Mosaic Plot with Pearson Residuals to see which decades and genres have results that are statistically significantly different from our null hypothesis.

Overall, looking at this plot, we have a few main takeaways. The most obvious takeaway is that nearly every cell is colored, which means that most cells have results that are not predicted by a null hypothesis of independence between proportion per genre and decade, which supports our conclusion from the hypothesis test.

Now, we analyze the overall trends that appear in this graph. The first thing we notice is that action games appear to have made a massive recovery in the last decade, compared to the two prior. While they were never unpopular, the proportion of action games was lower in the 1990s and 2000s than in the 1980s, but in the past decade, action games have taken over once again, being the most popular genre by far. Another interesting trend occurs with Adventure games, which basically did not exist in the 1980s, is now one of the most popular genres in the past decade. The last trend that we will comment on specifically is Sports games, which was another one of the more popular genres in the 1980s, and only grew in popularity moving into the 1990s and 2000s. In recent years though, we see a turnaround for the Sports Genre, which was going strong for the decades prior, suddenly losing a lot of steam moving into the 2010s. It is still one of the more popular genres, but compared to last time-frames where it was a higher proportion than expected, we see less sports games than we expected in the 2010s.

Conclusion

Main takeaways

From our EDA, we can see that the distribution of games is left-skewed, with a peak in games released in the 2000s.

In addressing research question 1, we discovered that several platforms have a majority of their sales in North America, PC is an outlier with the majority of its sales in the European Union, but there are no clear trends/patterns otherwise. Also, PS has the highest total number of sales, and Other has the lowest. In considering platform sales by region over time, we can see that the European Union and North America are Regions with the most overall sales, and both have PS and Wii as their highest selling Platforms over time. Further, the time series graph is consistent with our EDA findings, as generally Platform sales peaked in the 2000s, increasing before then and decreasing after.

For research question 2, we have statistical evidence to conclude that the proportion of each genre produced has changed over the decades. The main changes appear to be the resurgence of the action genre, the birth of the adventure genre, and the decrease of the sports genre in recent years.

The main takeaway from research question 3 is that overall, games have gotten a lot more popular throughout the 2000s and 2010s than they were in the past, for any of our major platforms. We see a bit of a decrease in more recent years, but believe that is more because of the dataset having less information on more recent games than an overall downward trend in the number of games produced. Within each platform, we see a general explosion of games produced around the turn of the century. We also see a fair amount of seasonality within each of the platforms, which we attribute to the general lifetime of the platforms and their nature of being replaced as new hardware is developed.

For research question 4, we see that there is a fairly strong link between the decade in question, and the types of games that are being produced. We see that the link is statistically significant, and that the largest overall changes appeared to occur within the Action, Adventure and Sports genres. The first was very popular, faded for a bit, and has made a comeback in recent years, the second, basically did not exist 30-40 years ago and is now one of the most popular genres, and the last, while it was going strong, appears like it may be on the decline.

Limitations

  • Overall lack of data from more recent years. We have few data points from 2017 onwards, which makes it difficult to do interpretations for recent trends.
  • Only having three regions to work with in the data aside from the all-encompassing “other”. Having more region specific data would allow making claims about how popular games are outside some of their larger spheres.
  • No data on launch times or other relevant factors for timing aside from year. We don’t have more specific information about the launch times, such as months or days, which may allow for an analysis of how game releases change over individual years.

Potential directions for future work

  • What are trends for game production in the last 10 years?
  • Do different countries prefer different game genres/titles?
  • How do seasons affect game sales?
  • Can we predict future game sales?

For the first three directions for future work, these could be answered given we had more data. As mentioned in our limitations, we did not have access to data in the years 2017 onward, for individual countries, or for times more specific than years. In the future, we could also try to predict future games sales using this type of data, but we have not learned that statistical technique yet.

Given our motivations, these questions for future work would enable us to find more specific, nuanced trends in video games data, and would additionally allow us to make predictions on future sales.

Citations

Theme source: http://vrl.cs.brown.edu/color