The dataset is sourced from the 2021-2022 European Leagues Player Stats and provides a comprehensive overview of 2,921 male soccer players across prominent leagues, including the Premier League, Ligue 1, Bundesliga, Serie A, and La Liga, during the 2021-2022 season. There are a total of 143 variables. These variables encompass player demographic information, match statistics, passing accuracy, offensive and defensive actions, as well as detailed metrics on aerial duels, tackles, dribbles, and other key aspects of individual and team performance in soccer. The dataset provides comprehensive insights for detailed analysis about a holistic view of player performances.
The research questions that we will be exploring are:
Which geographical areas are under and over-represented in the UEFA’s European Leagues, and how do these representations break down across different positions?
Do any trends of higher passing performance emerge within a specific position and league?
What defensive factors are predictors of whether a player is able to make a goal, and does this potential relationship vary by position?
(Note: 108 players had to be removed due to difficulties matching players to countries in the world database). This map of the players’ nationalities visually breaks down which countries are represented in the five leagues in the dataset. The color of a point represents the league the players belong to, and the size of a point represents the aggregate number of players of a given nationality. While a diverse range of countries are represented, a large portion of players have some kind of European nationality, which we’d expect given that the leagues are European. Very few players have nationalities in Oceania or North America, with slightly more in South America and even more in Africa likely due to its geographical proximity to Europe. We can also see that Series A and Premier League are reflected in a multitude of countries, whereas Ligue 1 and La Liga tend to be comprised mainly of players with South American and African nationalities, and Bundesliga seems to be almost exclusively comprised of players whose nationalities are of a few almost few select European Countries.
Narrowing down the map to countries in Europe, we can get a better idea about the number of players here than from the world map. It is apparent that La Liga players tend to primarily have a Spanish nationality, Ligue 1 players tend to primarily have a french Nationality, Premier League players tend to primarily have an English Nationality, and Bundesliga players tend to primarily have a German, Austrian, or Hungarian nationality. More notably, we see that while the majority of European Serie A players have an Italian nationality, the distribution is spread out across many European countries.
The above mosaic plot represents the distribution of player position across five continents, with Antarctica excluded and North and South America grouped into one continent. The proportions of defensive players and midfielders do not seem to be significantly different across the five continents, however differences emerge when we inspect the forward and goalkeeper positions. Most notably, players with African and Asian nationalities are over represented in the proportion of forwards and players with European nationalities are underrepresented. In the same way, players with African nationalities are underrepresented in the proportion of goalkeepers. One potential reason these differences may occur is due to differences in athleticism, genetic build, and training across different continents.
The box plot displays the distribution of passing completion
percentages (PasTotCmp%
) across different player positions,
with the data color-coded by the league to show variations between
leagues. The plot shows how passing completion percentages seems the
highest for defenders, whereas forwards seem to have lower passing
completion percentages. This might be because of defenders often engage
in more controlled passes, while forwards may attempt riskier passes for
goal-scoring opportunities.
The boxplots for different positions tend to overlap. Also, there does not seem to be a significant difference between the leagues for the passing accuracy by position, as all boxplots overlap for a given position. This plot seems to be relatively informative as we are easily able to compare the range of differences in passing accuracy by position and league that can help answer the question; however, there are some limitations. There appears to be many outliers, especially with lower passing percentages, for all positions other than goal keeper.
## Df Sum Sq Mean Sq F value Pr(>F)
## Pos_simplified 3 41498 13833 94.623 < 2e-16 ***
## Comp 4 6916 1729 11.827 1.57e-09 ***
## Pos_simplified:Comp 12 4830 403 2.754 0.000999 ***
## Residuals 2901 424093 146
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
After checking the Q-Q plot and residuals vs. fitted plot, the data
seem to reasonably meet the assumptions for an ANOVA test. The ANOVA
results above indicate significant differences in passing accuracy among
different player positions (Pos_simplified
), as indicated
by the F-statistic of 94.623 (p-value < 2e-16), at the standard alpha
level of 0.05. The factor Comp
(representing soccer
leagues) also significantly influences passing accuracy, with an
F-statistic of 11.827 (p-value 1.57e-09), suggesting variations in
passing performance across leagues. Furthermore, there is a significant
interaction effect between player position and league
(Pos_simplified:Comp
), denoted by an F-statistic of 2.754
(p-value 0.000999), indicating that the impact of player position on
passing accuracy differs across leagues. In summary, the findings
suggest that both player position and league affiliation significantly
contribute to differences in passing accuracy, and the interaction
between these factors should be considered for a comprehensive
understanding of the variations.
Using 7 passing variables- PasTotCmp
,
PasTotDist
, PasTotPrgDist
,
PasAss
, PasProg
, PasInt
, and
PasBlocks
- PCA projected the data into the
lower-dimensional space defined by the first two principal components.
There appears to be some grouping by player positions, suggesting that
the principal components may capture patterns related to passing
performance across different positions. However, there appears to
limited distinction by league in the projection, suggesting that the
variability explained may not strongly differentiate passing styles
between the considered leagues.
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.7327 1.2618 0.9368 0.82631 0.69654 0.5748 0.17217
## Proportion of Variance 0.4289 0.2274 0.1254 0.09754 0.06931 0.0472 0.00423
## Cumulative Proportion 0.4289 0.6563 0.7817 0.87926 0.94857 0.9958 1.00000
As shown above, the first two principle components explain 65.63% of the variability in the data. This indicates majority of the original information is captured in these two components, suggesting an effective reduction in dimensionality while retaining a considerable amount of the variability present in the seven passing variables.
The above graph plots the density curves of the average distance players shot from, grouped into their respective positions. At a high level, we can see that players tend to be in the same ballpark in terms of shot distance regardless of player position. There are three main peaks around 10-20 yards. It may seem intuitive to believe that the goal accuracy would increase the closer one gets to the goal, so the fact that there is no concentration of shots right near the goal seems surprising. However, opponent defense may prevent a player from getting as close to the goal as they want or they may strategically choose to shoot from further, so the existence of large peaks further away is not too surprising. What is curious however,is the distribution of defender shot distance. Generally, we expect these players to shoot less often than their counterparts, however it appears that the curve is more spread out and even peaks slightly earlier than forwards. This indicate that they take shots from varying distances, whereas midfielders have high density of shots in the same bandwidth.
This grouped and faceted bar graph examines the number of touches and tackles that players of different positions get in the three zones on a soccer field. Here, a tackle is defined as the taking away of the ball from the feet of an opponent by a player. In this way, the number of tackles is included in the number of touches a player gets. While one purpose of this graph is to examine how touches are distributed across the different positions, it also serves to show us which types of players dominate in certain zones of the field. As expected, we see that the players who are meant to play primarily in a given third of the field tend to get the most touches on the ball. However, the breakdown of tackles is slightly more interesting. Most notably, we notice that midfielders have a comparatively high performance in all areas of the field, whereas the number of tackles that defensive and forward players get tends to decrease the further they get from their assigned zone. Still, while the number of tackles achieved by defensive players in the attacking third is quite close to that of forward players, the number of tackles achieved by forward players in the defensive zone is much smaller than that of defensive players. One additional piece of information gleaned by this graph is that the number of touches in the attacking zone by forwards and midfielders are very similar but the number of tackles by midfielders is higher, indicating that midfielders tend to play a more defensive role in the attacking zone.
We can observe that there is a positive linear association between the number of goals scored and the distances that the shots were taken from for players in the forward position. The regression lines for each player position have differing slopes, suggesting that there is an interaction between shot distance and player position.
We can conduct a partial F-test to determine which defensive metrics are significant predictors of the number of goals a player can make. We will use the shot distance, position, number of interception, number of tackles, and number of fouls drawn variables as predictors of goals in our full model. We can test this full model against a reduced model which only includes the number of fouls and intercepts using the null hypothesis that the full model is correct and the alternative hypothesis that the reduced model is correct. The test will use a significance level of alpha = 0.05. If we find that the full model is correct, then we can conclude that shot distance, player position and the number of tackles are significant predictors of the number of goals.
Now that we have determined the significant predictors, we can conduct individual \(\textit{t}\)-tests, with 2695 degrees of freedom, on each \(\hat{\beta_i}\) in our model to determine the nature of this association. We will test the null hypothesis that \(\hat{\beta_i} = 0\) against the alternative that \(\hat{\beta_i} \neq 0\), for a significance level of alpha = 0.05.
##
## Call:
## lm(formula = Goals ~ ShoDist * Pos_simplified + Int + Tkl + Fls,
## data = noGoalie)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.3680 -0.0849 -0.0347 0.0412 4.7572
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0519446 0.0162431 3.198 0.0014 **
## ShoDist 0.0009145 0.0008670 1.055 0.2916
## Pos_simplifiedFW 0.1494937 0.0256989 5.817 6.69e-09 ***
## Pos_simplifiedMF 0.0329268 0.0228894 1.439 0.1504
## Int -0.0071958 0.0038362 -1.876 0.0608 .
## Tkl -0.0078002 0.0032996 -2.364 0.0181 *
## Fls 0.0043884 0.0031099 1.411 0.1583
## ShoDist:Pos_simplifiedFW 0.0039499 0.0016270 2.428 0.0153 *
## ShoDist:Pos_simplifiedMF 0.0007627 0.0012928 0.590 0.5553
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2218 on 2695 degrees of freedom
## Multiple R-squared: 0.1528, Adjusted R-squared: 0.1503
## F-statistic: 60.77 on 8 and 2695 DF, p-value: < 2.2e-16
Defensive players are predicted to score approximately 4.67 goals when all other variables equal zero (95% CI[0.020, 0.084], p-value = 0.0014). This intercept value increases by approximately 13.45 to equal 18.13 goals for forward players when all other variables equal zero (95% CI[0.099, 0.2], p-value < 0.01). Additionally, for defensive players, an increase of one tackle made by the player is associated with a decrease in -0.7 goals (95% CI[-0.014, -0.001], p-value = 0.0181). For forward players, an increase in one yard of shooting distance is associated with an increase of 0.35 in the number of goals scored (95% CI[0.0007, 0.007], p-value = 0.0153). Overall, forward players are expected to score more goals than defensive players when all other predictors equal zero. We also see that the shooting distance and number of goals have a positive linear association for forward players when all other variables are held constant.
The analysis revealed that the most common geographical areas
represented in the UEFA’s European Leagues are Europe and Africa, with
forwards tending to be over-represented in the latter and
underrepresented in the former. Additionally, goalkeepers from Africa
were underrepresented. We found that defenders tended to have a higher
passing performance than other positions. Additionally, the position,
league, and interaction between position and league of a player are all
significant predictors of passing completion percentage,
PasTotCmp%
. PCA showed trends of within-group similarity
for positions when considering various passing metrics, but there did
not seem to be any trends for leagues. Finally, we found that player
position, shooting distance, and the number of tackles are significant
predictors of the number of goals a player will make in one game. In
particular, shooting distance is positively and linearly associated with
number of goals scored for forward players.
Potential areas to explore in the future include:
Understanding how team tactics impact individual passing and
identifying strategies to improve passing for specific positions can
optimize performance, and additional variables not provided in this
dataset regarding team dynamics might be useful along with
PasTotCmp%
and PasProg
.
Exploring how player fatigue, measured by minutes played
(Min
), affects performance provides insights for injury
prevention and player management.
Investigating which teams excel in set-piece strategies using
variables like corner kicks (CK
) and completed dead-ball
passes leading to a goal (GcaPassDead
) can guide match
strategies.