Data
Our dataset is titled “Video Game Sales” and was found on Kaggle, at
https://www.kaggle.com/datasets/gregorut/videogamesales.
The dataset contains a list of video games with sales greater than
100,000 copies. It was generated via web scraping.
The following is a description of each of the variables/columns in
the dataset:
- Rank - Ranking of overall sales
- Name - The games name
- Platform - Platform of the games release (i.e. PC,PS4, etc.)
- Year - Year of the game’s release
- Genre - Genre of the game
- Publisher - Publisher of the game
- NA_Sales - Sales in North America (in millions)
- EU_Sales - Sales in Europe (in millions)
- JP_Sales - Sales in Japan (in millions)
- Other_Sales - Sales in the rest of the world (in millions)
- Global_Sales - Total worldwide sales
Each row/observation in the dataset is a measurement taken of video
game data at a specific year, on a specific platform. Each row also
includes information like the sales ranking, platform, genre, and
publisher of the video game, as well as the amount of sales per region
and globally in the specified year. It is also of importance to note
that the vast majority of data points occur before 2017, as scraping
generally stopped in 2016.
Graphs
EDA
To begin with, we wanted to get an overall sense of the distribution
of games, over time. This suggests examining the variable Year. We can
transform Year into a variable Decade, which takes on the discrete
values 1980s, 1990s, 2000s, and 2010s.

From this graph, we can see that most video games were released in
the 2000s. The distribution is left-skewed, with more games being
released in the 2000s and 2010s than the 1980s and 1990s. Comparatively,
a very small amount of games were released in the 1980s. It is also
interesting to note that games released do not have a strictly positive
relationship with time - the number of games released peaked in the
2000s, and decreased from the 2000s to 2010s. The decrease in the 2010s
can also be partly contributed to the fact that we have few data points
from 2017 onwards.
To look further into the distribution, we can look at it by Year
instead of Decade.

Here, we can see that indeed most game sales are from between
2000-2010, with its peak around 2007-2008. We can also confirm that
games sales are left-skewed, increasing before the 2000s and decreasing
after, even taking our missing data from 2017 and after into
account.
RQ 1: Which platforms are most popular in each region (determined by
sales) and how does this change over time?
Since we’re interested in examining how platform use differs by
region, in terms of sales, we want to look at the variables Platform,
EU_Sales, JP_Sales, Other_Sales, and Global_Sales. First, we transform
each platform name to its broader super-platform group (e.g. PS, PS2,
PS3, PS4, PSV, and PSP all become PS). Through this, we can look at 6
Platform values (PS, Wii, DS, Xbox, PC, Other) instead of 31.
Next, we can create a variable Region such that we can create the
following stacked bar chart:

Here, for each Platform, we can see the proportion of sales by Region
(NorthA, Japan, EU, Other). We have also sorted the bars by total sales.
From this, we can see that PS has the highest amount of total sales and
Other has the lowest. We can also see that for almost half of the
platforms, Wii, Xbox, and Other, the highest proportion of sales is in
North America. For all platforms, the region Other corresponds to the
smallest proportion of sales. There are also no clear trends in
increasing/decreasing proportion of sales for any region, when
considering the platforms in their order (by highest to lowest total
sales). So, from this graph we can say that several platforms have a
majority of their sales in North America, but there are no clear
trends/patterns otherwise. However, it is also interesting to note that
PC sells the most in the European Union, which no other Platform
has.
We can also examine how platform sales by region have changed over
time. We want to add into consideration the variable Year to achieve
this. We can plot 4 time series graphs, one for each Region, showing
change in sales over time for each Platform.

Looking at this graph, first we can note how there have been higher
ranges in sales for each Platform in North America and European Union
compared to Japan and Other. North America has the greatest peaks in its
sales for most Platforms. Further, we can see that generally across all
Regions there is a peak in sales in the 2000s. In fact, across all
Regions, all Platforms show a decline in sales from around 2010 to the
present, and an increase in sales from before 2000 to the 2000s. These
findings are consistent with our EDA, which also demonstrates that video
game sales increased prior to the 2000s and decreased after the 2000s.
Also note that PS and Wii have the largest ranges across the most time
in North America and the European Union, which is consistent with our
findings in the previous graph.
RQ 2: Which genres are most popular in each region (determined by
sales) and how does this change over time?
Next, we want to transition from how Platform varies by Region to how
Genre varies by Region. To begin with, we want to make a similar stacked
proportional bar chart as the one we had for the last research question,
but instead this time focus on the variables Genre and Region.

This graph shows which regions each genre is most popular in. While
the proportions remain fairly steady among most genres, there are some
notable outliers. One example of this is the role-playing genre, for
which Japan makes up a noticeably larger proportion of sales than with
other genres. It is also interesting to see that shooter games do not
get many of their sales from Japan, but they do get around half of their
sales from North America.
We can also examine how these trends have changed over time by
creating a time series for each Genre, looking at Region over time.

While the previous graph was useful for comparing the popularity of
each genre within the different regions, it does not help us to compare
the overall popularity of each region’s genre preference and how it
changes over time. This is what this graph is useful for. This graph
shows the average sales of each genre in each region over time. North
America appears to have the most sales for many of the genres. This is
especially notable with action, miscellaneous, sports, and shooter
games. Some genres stand out as genres that no region is particularly
fond of, such as adventure, puzzle, and strategy. Finally, although
platform and role-playing games never have the highest number of sales,
they do remain the most stable, yet high, among each region over
time.
RQ 4: How does genre preference change over time?
In order to see if we have a significant link between the proportion
of games produced in each genre, and the decade that they are produced
in, we perform a Chi-Squared test for independence. Our Null-Hypothesis
is that the proportion of games produced in a genre and the decade are
independent, and our Alternative is that there is some dependence
between the two variables. The results of the test are as follows:
\(\chi^2 =
1038\) |
\(df = 33\) |
\(p <
2.2\times10^{-16}\) |
Overall, we have significant evidence to reject our null hypothesis
that the number of games produced in a given decade and the decade
itself are independent, and thus, we wish to further study any potential
link between these two variables. To do so, we produce a Mosaic Plot
with Pearson Residuals to see which decades and genres have results that
are statistically significantly different from our null hypothesis.

Overall, looking at this plot, we have a few main takeaways. The most
obvious takeaway is that nearly every cell is colored, which means that
most cells have results that are not predicted by a null hypothesis of
independence between proportion per genre and decade, which supports our
conclusion from the hypothesis test.
Now, we analyze the overall trends that appear in this graph. The
first thing we notice is that action games appear to have made a massive
recovery in the last decade, compared to the two prior. While they were
never unpopular, the proportion of action games was lower in the 1990s
and 2000s than in the 1980s, but in the past decade, action games have
taken over once again, being the most popular genre by far. Another
interesting trend occurs with Adventure games, which basically did not
exist in the 1980s, is now one of the most popular genres in the past
decade. The last trend that we will comment on specifically is Sports
games, which was another one of the more popular genres in the 1980s,
and only grew in popularity moving into the 1990s and 2000s. In recent
years though, we see a turnaround for the Sports Genre, which was going
strong for the decades prior, suddenly losing a lot of steam moving into
the 2010s. It is still one of the more popular genres, but compared to
last time-frames where it was a higher proportion than expected, we see
less sports games than we expected in the 2010s.
Conclusion
Main takeaways
From our EDA, we can see that the distribution of games is
left-skewed, with a peak in games released in the 2000s.
In addressing research question 1, we discovered that several
platforms have a majority of their sales in North America, PC is an
outlier with the majority of its sales in the European Union, but there
are no clear trends/patterns otherwise. Also, PS has the highest total
number of sales, and Other has the lowest. In considering platform sales
by region over time, we can see that the European Union and North
America are Regions with the most overall sales, and both have PS and
Wii as their highest selling Platforms over time. Further, the time
series graph is consistent with our EDA findings, as generally Platform
sales peaked in the 2000s, increasing before then and decreasing
after.
For research question 2, we have statistical evidence to conclude
that the proportion of each genre produced has changed over the decades.
The main changes appear to be the resurgence of the action genre, the
birth of the adventure genre, and the decrease of the sports genre in
recent years.
The main takeaway from research question 3 is that overall, games
have gotten a lot more popular throughout the 2000s and 2010s than they
were in the past, for any of our major platforms. We see a bit of a
decrease in more recent years, but believe that is more because of the
dataset having less information on more recent games than an overall
downward trend in the number of games produced. Within each platform, we
see a general explosion of games produced around the turn of the
century. We also see a fair amount of seasonality within each of the
platforms, which we attribute to the general lifetime of the platforms
and their nature of being replaced as new hardware is developed.
For research question 4, we see that there is a fairly strong link
between the decade in question, and the types of games that are being
produced. We see that the link is statistically significant, and that
the largest overall changes appeared to occur within the Action,
Adventure and Sports genres. The first was very popular, faded for a
bit, and has made a comeback in recent years, the second, basically did
not exist 30-40 years ago and is now one of the most popular genres, and
the last, while it was going strong, appears like it may be on the
decline.
Limitations
- Overall lack of data from more recent years. We have few data points
from 2017 onwards, which makes it difficult to do interpretations for
recent trends.
- Only having three regions to work with in the data aside from the
all-encompassing “other”. Having more region specific data would allow
making claims about how popular games are outside some of their larger
spheres.
- No data on launch times or other relevant factors for timing aside
from year. We don’t have more specific information about the launch
times, such as months or days, which may allow for an analysis of how
game releases change over individual years.
Potential directions for future work
- What are trends for game production in the last 10 years?
- Do different countries prefer different game genres/titles?
- How do seasons affect game sales?
- Can we predict future game sales?
For the first three directions for future work, these could be
answered given we had more data. As mentioned in our limitations, we did
not have access to data in the years 2017 onward, for individual
countries, or for times more specific than years. In the future, we
could also try to predict future games sales using this type of data,
but we have not learned that statistical technique yet.
Given our motivations, these questions for future work would enable
us to find more specific, nuanced trends in video games data, and would
additionally allow us to make predictions on future sales.