Introduction

As collegiate soccer players and avid fans of the sport (plus one guy who’s athletically challenged), we’ve grown up playing the FIFA franchise soccer videogames all of our lives. The most recent edition, FIFA 22, is extremely popular and played by millions across the globe. We decided to do a deeper dive into exploring relationships within the components that could have a larger role on affecting the player’s overall rating.

Dataset Description

This dataset consists of 19,239 observations, representing each player in the game. For each player, there are 110 variables, ranging from quantitative measurements of their abilities to categorical traits like nationality and special traits. Other relevant variables include the main attributes that comprise overall rating, sub-attributes, wages, league played in, position, and more.

    It is important to note that some subsets of players may be more interesting or offer greater insight than other. Among the players included in our data, some play for “top” leagues around the world, while many others don’t. At times, we will focus only on players from top leagues in order to reduce some noise from lower valued players and help with random sampling for certain visualizations.

Research Questions

We seek to answer the following questions:

  1. Where are players from?
  1. How are various attributes related to a player’s pay and overall rating?
  1. How do players differ by position?

Where are Top Players From?

With data on thousands of players playing in leagues all across the world, we want to learn where the top players play and where they are recruited from. To do this, we create side-by-side choropleth maps which display (1) the number of players in each country’s top national league(s) (denoted in the data as having a league_level value of 1), and (2) these players’ countries of origin.

    In the first map, we observe that the largest leagues are in the United Kingdom (which actually comprises both the top English and Scottish leagues), the United States, and Argentina. Many of the top European leagues are also large, with over 500 players per league. On the smaller side, the top Russian league has under 250 players, as do most South American leagues and the South African league. In the second map, we see that the countries with top soccer leagues don’t pull entirely from their home countries. A disproportionate number of players come from Argentina and Brazil, while most African countries - all but one of which are without a top-level league - have one or more citizens who venture to play in foreign leagues. The US appears to be the biggest net importer of soccer talent, with slightly fewer than 500 players going on to play in top soccer leagues around the world; the UK is another large net importer.
    To narrow our focus, we turn to look specifically at what are widely regarded as the best of the best soccer leagues, the “Big Five”: England’s Premier League, Germany’s Bundesliga, Spain’s La Liga, Italy’s Serie A, France’s Ligue 1. Among these leagues, we create a comparison word cloud that juxtaposes countries of origin for players with an overall rating of 80 or higher vs players with an overall rating less than 80. Then, we sample 150 players from each group. The split by overall rating reduces bias from younger players, who are more likely to be lower-rated and who are more often from the league’s home country.

    The comparison word cloud is interesting not for the more common nationalities, but rather the more rare ones. Due to random sampling of 150 players in each overall subset of the top five soccer leagues, we may see different switches in color of the larger-sized countries, although the general results make sense as countries like Portugal, Brazil, and Argentina have highly rated players (this is not to say well-known countries like Italy do not, however). What stays more constant is the less frequent countries of birth; for example, we see that the sampling of Chile (near middle height of red portion, all the way on the left) produces a few players with overall rating at least 80. This could hint at a trend of certain smaller countries being linked to having higher rated players.

How Do Various Attributes Relate to a Player Pay and Overall Rating?

Our next question is whether a potential relationship between player skill - measured by performance rating and proxied by pay - and various individual variables (age, dominant foot, and more specific performance statistics) is different by league. There is one question embedded in another, the first being is there some sort of linear relationship between player skill and their attributes and the second being whether or not this relationship is different by league. In order to inspect this relationship, we need to look into players’ Value, Age, League, Preferred Foot, Wage, Potential, Overall, Pace, Shooting, Passing, Dribbling, Physical, Defending, Height, and Weight.

    First, we want to check our assumption that there is indeed a relationship between player overall rating and pay. To do so, we consider a scatterplot graph between the two variables, colored by age.

    This plot demonstrates a clear trend that players’ salary generally increases as overall rating increases. A loess curve fits the data pretty well, which leads us to conclude that there is more of a non-linear relationship between overall and salary, rather than a linear one. Additionally, we notice that most higher-paid players (paid greater than 100,000 euros a week) are above 26 years old, suggesting that there is a premium to player experience.
    In addition to looking at the relationship between player pay and rating, we take at how player pay varies among the “Big Five” leagues. To do so, we plot side-by-side boxplots of log-pay by league and preferred foot.

    We observe significant differences in values by league, with clubs in England and Spain paying their players more on average than those in France and Germany. Interestingly, there appear to be more higher valued players with a right dominant foot than a left dominant foot in the Italian Serie A, while differences in pay by foot appear insignificant in the other leagues.
    We turn now to a principal component analysis (PCA) methods. In our first of two grpaphs, we compress the six main components of overall performance rating (pace, shooting, dribbling, passing, defending, and physical) using PCA to evaluate whether a certain category could have a greater impact on overall rating. We color the graph using ranges of overall performance rating values.

    The above PCA uses the subsetted player data mentioned in the dataset description to examine the effect of the six main attributes within the principal components of these attributes on overall rating. The lack of any extremely long arrows seem to confirm that the effect of each attribute has a similar weighting effect on the overall attribute, which makes good sense logically. Just because a player has one very highly rated statistic, like pace (speed), doesn’t mean that players’ overall rating should be skewed.
    In our second PCA, we again consider players’ dominant foot, this time in relation to value, wage, weight, height, age, overall performance, potential and our six main components of performance.

      The graph above shows that value, wage, overall, potential, and passing all appear to point in very similar directions, implying that they are all correlated. Looking at the vectors with relation to value, we can see that most vectors appear to positively contribute to value except for weight and maybe height. It is also interesting to note that there is no distinct grouping based on dominant foot, so it may not make sense to continue to explore this variable.

Are players by different positions characterized differently in terms of their performance and how are they described?

With plenty of data focused on performance-based metrics, we want to examine how these different metrics may co-vary. In particular we want to see how players of different positions may perform different from players of other positions. Generally, we would expect a positive relationship between a player’s performance score for a specific action - for instance, agility - and that player’s overall performance score. We examine the validity of this by looking at the relationship between players’ overall score and agility score, broken down by every position.

    In general, we see that our intuitive hypothesis is true: players who are more agile also have better overall performance. However, this does not hold strongly - or at all - for every position. For some positions (CB, LCB, RCB, RF), the relationship between agility and overall score is almost flat or even negative. Additionally, there are some positions with a higher average agility than others - some positions with the highest agility scores (and more than a couple dozen observations) are CAM, LB, LM, LW, RCM, and RW, while positions like CB, GK, LCB, and RCB appear to have significantly lower average agility.
    Having observed differences in quantitative measurements between players of different positions, we seek now to also observe what traits are attributed to players of different positions. To do this, we perform multi-dimensional scaling on all player performance statistics and display the resulting distances, colored by aggregated position.

    This scatterplot of two MDS dimensions is colored by position and the dimensions are determined from their overall player stats. This plot shows that the subs, who can belong to any position, are an underlying constant throughout the graph, with each other position tending to cluster together. One can see large sections of intersection, which is because within each position class, players tend to serve multiple roles on a team. Defenders and strikers, for example, tend to be more physical while midfielders and wide forwards tend to be more technical and less physical.
    In addition to player skills, we create word clouds by aggregated position - “Forward” (top left), “Defender” (top right), “Midfield” (bottom left), and “Sub” (bottom right) - by using the player_traits column of our data. This column contains different key phrases which describe each player.

    We see that Forwards, Midfields, and Subs are all often referred to as “technical dribblers” and are noted as having “flair”, while these labels are not attributed as frequently to Defenders. On the other hand, defenders are most often described as “long passers” and players who “dive into tackles”, distinctions which are relatively unique to them. “long shot taker” is a label associated commonly with Midfields and Subs, but less so with Defenders or Forwards. Generally, contrasts in which labels correlate with which positions seem to follow our intuition for the role of the position.

Conclusions And Future Work

We observed that soccer talent at the top echelons is imported from all around the world, with the US and the UK being among the biggest net importers and Argentina and Brazil being among the largest net exporters of soccer talent. We can see generally positive trends between age and overall, and the resulting wages for the players. Additionally, it seems that FIFA does an effective job of weighing the 6 main attributes in some equal fashion to determine a player’s overall. | There is a weak connection that foot dominance has on a player’s value, potentially ruling it out as a possible covariate in a regression model. We also see this effect as weight and height increase, as well, that they are weakly or not correlated to player value at all, showing that players’ height may not be an informative metric in evaluating a player’s value. Additionally, there are positional differences between players that correlate with differences in the relationships between their overall performance and measures of specific performance (specifically, we observed this for agility), along with differences in the frequency of descriptive phrases to characterize their strengths. | This exploration lends itself to a wide variety of future avenues to explore. One highly interesting extension would be to do a comparison between this dataset and the similar female player dataset for FIFA 22. Additionally, statistical inferences can be evaluated on the role of sub-attributes and how they are weighted together to form the 6 main attributes that contribute to overall rating.