Dataset Description

The dataset is sourced from the 2021-2022 European Leagues Player Stats and provides a comprehensive overview of 2,921 male soccer players across prominent leagues, including the Premier League, Ligue 1, Bundesliga, Serie A, and La Liga, during the 2021-2022 season. There are a total of 143 variables. These variables encompass player demographic information, match statistics, passing accuracy, offensive and defensive actions, as well as detailed metrics on aerial duels, tackles, dribbles, and other key aspects of individual and team performance in soccer. The dataset provides comprehensive insights for detailed analysis about a holistic view of player performances.

The research questions that we will be exploring are:

  1. Which geographical areas are under and over-represented in the UEFA’s European Leagues, and how do these representations break down across different positions?

  2. Do any trends of higher passing performance emerge within a specific position and league?

  3. What defensive factors are predictors of whether a player is able to make a goal, and does this potential relationship vary by position?


Question 1: Which geographical areas are under and over-represented in the UEFA’s European Leagues, and how do these representations break down across different positions?

Graph 1.1a: World Map of Players Nationalities, Colored by League and Sized by Aggregate Number of Players

(Note: 108 players had to be removed due to difficulties matching players to countries in the world database). This map of the players’ nationalities visually breaks down which countries are represented in the five leagues in the dataset. The color of a point represents the league the players belong to, and the size of a point represents the aggregate number of players of a given nationality. While a diverse range of countries are represented, a large portion of players have some kind of European nationality, which we’d expect given that the leagues are European. Very few players have nationalities in Oceania or North America, with slightly more in South America and even more in Africa likely due to its geographical proximity to Europe. We can also see that Series A and Premier League are reflected in a multitude of countries, whereas Ligue 1 and La Liga tend to be comprised mainly of players with South American and African nationalities, and Bundesliga seems to be almost exclusively comprised of players whose nationalities are of a few almost few select European Countries.

Graph 1.1b: Europe Map of Players Nationalities, Colored by League and Sized by Aggregate Number of Players

Narrowing down the map to countries in Europe, we can get a better idea about the number of players here than from the world map. It is apparent that La Liga players tend to primarily have a Spanish nationality, Ligue 1 players tend to primarily have a french Nationality, Premier League players tend to primarily have an English Nationality, and Bundesliga players tend to primarily have a German, Austrian, or Hungarian nationality. More notably, we see that while the majority of European Serie A players have an Italian nationality, the distribution is spread out across many European countries.


Graph 1.2: Mosaic Plot of Player Position, Seperated by Continent of Player Nationality

The above mosaic plot represents the distribution of player position across five continents, with Antarctica excluded and North and South America grouped into one continent. The proportions of defensive players and midfielders do not seem to be significantly different across the five continents, however differences emerge when we inspect the forward and goalkeeper positions. Most notably, players with African and Asian nationalities are over represented in the proportion of forwards and players with European nationalities are underrepresented. In the same way, players with African nationalities are underrepresented in the proportion of goalkeepers. One potential reason these differences may occur is due to differences in athleticism, genetic build, and training across different continents.


Question 3: What defensive factors are predictors of whether a player is able to make a goal, and does this potential relationship vary by position?

Graph 3.1: Density Plot of Average Shot Distance Taken by Players Across Different Positions

The above graph plots the density curves of the average distance players shot from, grouped into their respective positions. At a high level, we can see that players tend to be in the same ballpark in terms of shot distance regardless of player position. There are three main peaks around 10-20 yards. It may seem intuitive to believe that the goal accuracy would increase the closer one gets to the goal, so the fact that there is no concentration of shots right near the goal seems surprising. However, opponent defense may prevent a player from getting as close to the goal as they want or they may strategically choose to shoot from further, so the existence of large peaks further away is not too surprising. What is curious however,is the distribution of defender shot distance. Generally, we expect these players to shoot less often than their counterparts, however it appears that the curve is more spread out and even peaks slightly earlier than forwards. This indicate that they take shots from varying distances, whereas midfielders have high density of shots in the same bandwidth.

Graph 3.2 : Grouped Bar Graph of Total Touches and Total Tackles Across Player Position and Zone of Field

This grouped and faceted bar graph examines the number of touches and tackles that players of different positions get in the three zones on a soccer field. Here, a tackle is defined as the taking away of the ball from the feet of an opponent by a player. In this way, the number of tackles is included in the number of touches a player gets. While one purpose of this graph is to examine how touches are distributed across the different positions, it also serves to show us which types of players dominate in certain zones of the field. As expected, we see that the players who are meant to play primarily in a given third of the field tend to get the most touches on the ball. However, the breakdown of tackles is slightly more interesting. Most notably, we notice that midfielders have a comparatively high performance in all areas of the field, whereas the number of tackles that defensive and forward players get tends to decrease the further they get from their assigned zone. Still, while the number of tackles achieved by defensive players in the attacking third is quite close to that of forward players, the number of tackles achieved by forward players in the defensive zone is much smaller than that of defensive players. One additional piece of information gleaned by this graph is that the number of touches in the attacking zone by forwards and midfielders are very similar but the number of tackles by midfielders is higher, indicating that midfielders tend to play a more defensive role in the attacking zone.

Graph 3.3: Shot Distance vs. Number of Goals by Player Position

We can observe that there is a positive linear association between the number of goals scored and the distances that the shots were taken from for players in the forward position. The regression lines for each player position have differing slopes, suggesting that there is an interaction between shot distance and player position.

We can conduct a partial F-test to determine which defensive metrics are significant predictors of the number of goals a player can make. We will use the shot distance, position, number of interception, number of tackles, and number of fouls drawn variables as predictors of goals in our full model. We can test this full model against a reduced model which only includes the number of fouls and intercepts using the null hypothesis that the full model is correct and the alternative hypothesis that the reduced model is correct. The test will use a significance level of alpha = 0.05. If we find that the full model is correct, then we can conclude that shot distance, player position and the number of tackles are significant predictors of the number of goals.

Now that we have determined the significant predictors, we can conduct individual \(\textit{t}\)-tests, with 2695 degrees of freedom, on each \(\hat{\beta_i}\) in our model to determine the nature of this association. We will test the null hypothesis that \(\hat{\beta_i} = 0\) against the alternative that \(\hat{\beta_i} \neq 0\), for a significance level of alpha = 0.05.

## 
## Call:
## lm(formula = Goals ~ ShoDist * Pos_simplified + Int + Tkl + Fls, 
##     data = noGoalie)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.3680 -0.0849 -0.0347  0.0412  4.7572 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               0.0519446  0.0162431   3.198   0.0014 ** 
## ShoDist                   0.0009145  0.0008670   1.055   0.2916    
## Pos_simplifiedFW          0.1494937  0.0256989   5.817 6.69e-09 ***
## Pos_simplifiedMF          0.0329268  0.0228894   1.439   0.1504    
## Int                      -0.0071958  0.0038362  -1.876   0.0608 .  
## Tkl                      -0.0078002  0.0032996  -2.364   0.0181 *  
## Fls                       0.0043884  0.0031099   1.411   0.1583    
## ShoDist:Pos_simplifiedFW  0.0039499  0.0016270   2.428   0.0153 *  
## ShoDist:Pos_simplifiedMF  0.0007627  0.0012928   0.590   0.5553    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2218 on 2695 degrees of freedom
## Multiple R-squared:  0.1528, Adjusted R-squared:  0.1503 
## F-statistic: 60.77 on 8 and 2695 DF,  p-value: < 2.2e-16

Defensive players are predicted to score approximately 4.67 goals when all other variables equal zero (95% CI[0.020, 0.084], p-value = 0.0014). This intercept value increases by approximately 13.45 to equal 18.13 goals for forward players when all other variables equal zero (95% CI[0.099, 0.2], p-value < 0.01). Additionally, for defensive players, an increase of one tackle made by the player is associated with a decrease in -0.7 goals (95% CI[-0.014, -0.001], p-value = 0.0181). For forward players, an increase in one yard of shooting distance is associated with an increase of 0.35 in the number of goals scored (95% CI[0.0007, 0.007], p-value = 0.0153). Overall, forward players are expected to score more goals than defensive players when all other predictors equal zero. We also see that the shooting distance and number of goals have a positive linear association for forward players when all other variables are held constant.


Conclusion and Main Takeaways

The analysis revealed that the most common geographical areas represented in the UEFA’s European Leagues are Europe and Africa, with forwards tending to be over-represented in the latter and underrepresented in the former. Additionally, goalkeepers from Africa were underrepresented. We found that defenders tended to have a higher passing performance than other positions. Additionally, the position, league, and interaction between position and league of a player are all significant predictors of passing completion percentage, PasTotCmp%. PCA showed trends of within-group similarity for positions when considering various passing metrics, but there did not seem to be any trends for leagues. Finally, we found that player position, shooting distance, and the number of tackles are significant predictors of the number of goals a player will make in one game. In particular, shooting distance is positively and linearly associated with number of goals scored for forward players.


Future Work

Potential areas to explore in the future include:

  1. How team tactics influence individual passing performance and whether certain approaches enhance passing capabilities of players in specific positions.

Understanding how team tactics impact individual passing and identifying strategies to improve passing for specific positions can optimize performance, and additional variables not provided in this dataset regarding team dynamics might be useful along with PasTotCmp% and PasProg.

  1. How player fatigue affects performance.

Exploring how player fatigue, measured by minutes played (Min), affects performance provides insights for injury prevention and player management.

  1. Which teams have the most effective set-piece strategies.

Investigating which teams excel in set-piece strategies using variables like corner kicks (CK) and completed dead-ball passes leading to a goal (GcaPassDead) can guide match strategies.

Sources