Introduction
As collegiate soccer players and avid fans of the sport (plus one guy
who’s athletically challenged), we’ve grown up playing the FIFA
franchise soccer videogames all of our lives. The most recent edition,
FIFA 22, is extremely popular and played by millions across the globe.
We decided to do a deeper dive into exploring relationships within the
components that could have a larger role on affecting the player’s
overall rating.
Dataset Description
This dataset consists of 19,239 observations, representing each
player in the game. For each player, there are 110 variables, ranging
from quantitative measurements of their abilities to categorical traits
like nationality and special traits. Other relevant variables include
the main attributes that comprise overall rating, sub-attributes, wages,
league played in, position, and more.
It is important to note that some subsets of
players may be more interesting or offer greater insight than other.
Among the players included in our data, some play for “top” leagues
around the world, while many others don’t. At times, we will focus only
on players from top leagues in order to reduce some noise from lower
valued players and help with random sampling for certain
visualizations.
Research Questions
We seek to answer the following questions:
- Where are players from?
- Among the top soccer leagues, it would be interesting to learn more
about players’ countries of origin, and which countries are net
importers/exporters of soccer talent
- How are various attributes related to a player’s pay and overall
rating?
- We know that the overall rating is formed generally by the player’s
six main attribute statistics, but it is worthwhile to investigate
whether there are trends in how much effect each main attribute has, and
if there are other correlations with these “overalls”.
- How do players differ by position?
- We want to examine how distinguishable players are by position on
paper, by looking at their performance data and other attributes.
Where are Top Players From?
With data on thousands of players playing in leagues all across the
world, we want to learn where the top players play and where they are
recruited from. To do this, we create side-by-side choropleth maps which
display (1) the number of players in each country’s top national
league(s) (denoted in the data as having a league_level
value of 1), and (2) these players’ countries of origin.


In the first map, we observe that the
largest leagues are in the United Kingdom (which actually comprises both
the top English and Scottish leagues), the United States, and
Argentina. Many of the top European leagues are also large, with over
500 players per league. On the smaller side, the top Russian league has
under 250 players, as do most South American leagues and the South
African league. In the second map, we see that the countries with top
soccer leagues don’t pull entirely from their home countries. A
disproportionate number of players come from Argentina and Brazil, while
most African countries - all but one of which are without a top-level
league - have one or more citizens who venture to play in foreign
leagues. The US appears to be the biggest net importer of soccer talent,
with slightly fewer than 500 players going on to play in top soccer
leagues around the world; the UK is another large net importer.
To narrow our focus, we turn to look
specifically at what are widely regarded as the best of the best soccer
leagues, the “Big Five”: England’s Premier League, Germany’s Bundesliga,
Spain’s La Liga, Italy’s Serie A, France’s Ligue 1. Among these leagues,
we create a comparison word cloud that juxtaposes countries of origin
for players with an overall rating of 80 or higher vs players with an
overall rating less than 80. Then, we sample 150 players from each
group. The split by overall rating reduces bias from younger players,
who are more likely to be lower-rated and who are more often from the
league’s home country.

The comparison word cloud is interesting not
for the more common nationalities, but rather the more rare ones. Due to
random sampling of 150 players in each overall subset of the top five
soccer leagues, we may see different switches in color of the
larger-sized countries, although the general results make sense as
countries like Portugal, Brazil, and Argentina have highly rated players
(this is not to say well-known countries like Italy do not, however).
What stays more constant is the less frequent countries of birth; for
example, we see that the sampling of Chile (near middle height of red
portion, all the way on the left) produces a few players with overall
rating at least 80. This could hint at a trend of certain smaller
countries being linked to having higher rated players.
How Do Various Attributes Relate to a Player Pay and Overall
Rating?
Our next question is whether a potential relationship between player
skill - measured by performance rating and proxied by pay - and various
individual variables (age, dominant foot, and more specific performance
statistics) is different by league. There is one question embedded in
another, the first being is there some sort of linear relationship
between player skill and their attributes and the second being whether
or not this relationship is different by league. In order to inspect
this relationship, we need to look into players’ Value, Age, League,
Preferred Foot, Wage, Potential, Overall, Pace, Shooting, Passing,
Dribbling, Physical, Defending, Height, and Weight.
First, we want to check our assumption that
there is indeed a relationship between player overall rating and pay. To
do so, we consider a scatterplot graph between the two variables,
colored by age.

This plot demonstrates a clear trend that
players’ salary generally increases as overall rating increases. A loess
curve fits the data pretty well, which leads us to conclude that there
is more of a non-linear relationship between overall and salary, rather
than a linear one. Additionally, we notice that most higher-paid players
(paid greater than 100,000 euros a week) are above 26 years old,
suggesting that there is a premium to player experience.
In addition to looking at the relationship
between player pay and rating, we take at how player pay varies among
the “Big Five” leagues. To do so, we plot side-by-side boxplots of
log-pay by league and preferred foot.

We observe significant differences in values
by league, with clubs in England and Spain paying their players more on
average than those in France and Germany. Interestingly, there appear to
be more higher valued players with a right dominant foot than a left
dominant foot in the Italian Serie A, while differences in pay by foot
appear insignificant in the other leagues.
We turn now to a principal component
analysis (PCA) methods. In our first of two grpaphs, we compress the six
main components of overall performance rating (pace, shooting,
dribbling, passing, defending, and physical) using PCA to evaluate
whether a certain category could have a greater impact on overall
rating. We color the graph using ranges of overall performance rating
values.

The above PCA uses the subsetted player data
mentioned in the dataset description to examine the effect of the six
main attributes within the principal components of these attributes on
overall rating. The lack of any extremely long arrows seem to confirm
that the effect of each attribute has a similar weighting effect on the
overall attribute, which makes good sense logically. Just because a
player has one very highly rated statistic, like pace (speed), doesn’t
mean that players’ overall rating should be skewed.
In our second PCA, we again consider
players’ dominant foot, this time in relation to value, wage, weight,
height, age, overall performance, potential and our six main components
of performance.

The graph above shows that value, wage,
overall, potential, and passing all appear to point in very similar
directions, implying that they are all correlated. Looking at the
vectors with relation to value, we can see that most vectors appear to
positively contribute to value except for weight and maybe height. It is
also interesting to note that there is no distinct grouping based on
dominant foot, so it may not make sense to continue to explore this
variable.
Are players by different positions characterized differently in
terms of their performance and how are they described?
With plenty of data focused on performance-based metrics, we want to
examine how these different metrics may co-vary. In particular we want
to see how players of different positions may perform different from
players of other positions. Generally, we would expect a positive
relationship between a player’s performance score for a specific action
- for instance, agility - and that player’s overall performance score.
We examine the validity of this by looking at the relationship between
players’ overall score and agility score, broken down by every
position.

In general, we see that our intuitive
hypothesis is true: players who are more agile also have better overall
performance. However, this does not hold strongly - or at all - for
every position. For some positions (CB, LCB, RCB, RF), the relationship
between agility and overall score is almost flat or even negative.
Additionally, there are some positions with a higher average agility
than others - some positions with the highest agility scores (and more
than a couple dozen observations) are CAM, LB, LM, LW, RCM, and RW,
while positions like CB, GK, LCB, and RCB appear to have significantly
lower average agility.
Having observed differences in quantitative
measurements between players of different positions, we seek now to also
observe what traits are attributed to players of different positions. To
do this, we perform multi-dimensional scaling on all player performance
statistics and display the resulting distances, colored by aggregated
position.

This scatterplot of two MDS dimensions is
colored by position and the dimensions are determined from their overall
player stats. This plot shows that the subs, who can belong to any
position, are an underlying constant throughout the graph, with each
other position tending to cluster together. One can see large sections
of intersection, which is because within each position class, players
tend to serve multiple roles on a team. Defenders and strikers, for
example, tend to be more physical while midfielders and wide forwards
tend to be more technical and less physical.
In addition to player skills, we create word
clouds by aggregated position - “Forward” (top left), “Defender” (top
right), “Midfield” (bottom left), and “Sub” (bottom right) - by using
the player_traits
column of our data. This column contains
different key phrases which describe each player.

We see that Forwards, Midfields, and Subs
are all often referred to as “technical dribblers” and are noted as
having “flair”, while these labels are not attributed as frequently to
Defenders. On the other hand, defenders are most often described as
“long passers” and players who “dive into tackles”, distinctions which
are relatively unique to them. “long shot taker” is a label associated
commonly with Midfields and Subs, but less so with Defenders or
Forwards. Generally, contrasts in which labels correlate with which
positions seem to follow our intuition for the role of the
position.
Conclusions And Future Work
We observed that soccer talent at the top echelons is imported from
all around the world, with the US and the UK being among the biggest net
importers and Argentina and Brazil being among the largest net exporters
of soccer talent. We can see generally positive trends between age and
overall, and the resulting wages for the players. Additionally, it seems
that FIFA does an effective job of weighing the 6 main attributes in
some equal fashion to determine a player’s overall. | There is a weak
connection that foot dominance has on a player’s value, potentially
ruling it out as a possible covariate in a regression model. We also see
this effect as weight and height increase, as well, that they are weakly
or not correlated to player value at all, showing that players’ height
may not be an informative metric in evaluating a player’s value.
Additionally, there are positional differences between players that
correlate with differences in the relationships between their overall
performance and measures of specific performance (specifically, we
observed this for agility), along with differences in the frequency of
descriptive phrases to characterize their strengths. | This exploration
lends itself to a wide variety of future avenues to explore. One highly
interesting extension would be to do a comparison between this dataset
and the similar female player dataset for FIFA 22. Additionally,
statistical inferences can be evaluated on the role of sub-attributes
and how they are weighted together to form the 6 main attributes that
contribute to overall rating.