In this project, we mainly used data for team stats from the National
Women’s Soccer League for 2016-2019 seasons, as well as supplemented the
data with from some external sources. The GitHub Repo supplied the
overall dataset which includes stats like wins
,
goals scored
, corner_kicks
, etc. This
repository also includes game-by-game data and advanced team stats, such
as “big chances created” or “accurate forwad zone passes” as well.
As a group, we wanted to answer the following research questions related to factors that may contribute to winning throughout a season:
As a follow up to our previous question, we also wanted to learn how big chances created per game correlated with how many points were scored per game.
From the graph, we can see big chances created per game is highly correlated with goals per game, but can potentially explain team attacking strength and consistency better over an entire season since there is an element of luck and variance to scoring goals. In the graph, we see a positive, linear relationship between big chances created per game and points per game. The relationship is statistically significant due to a p-value nearly 0 in a linear regression model with big chances created per game as the explanatory variable of points per game. Based on the linear regression model, we expect teams that create 1 more big chance per game to earn 0.48 more points per game, which over an entire season is quite a few more points. Simply, if a team creates more chances per game, we expect them to earn more points per game, which aggregated over an entire season, means a higher finish in the table.
When trying to make a model for performance in soccer, including
location data is a challenging idea. Instead, we addressed the question
of whether location
had a relationship with soccer
performance by calculating the average difference between goals scored
for the “home” team and goals scored for the “away” team and visualizing
the difference in a map.
This graph shows the average goal difference (home team - away team) for each team in the NWSL between 2016 and 2019. There is not a clear pattern here, largely because there are only twelve cities. Furthermore, this does not control for team quality, so it is difficult to differentiate between a big difference due to location or home court advantage, and a difference because the team is good. However, we see teams in the Northeast have very small (and even negative) goal difference, while teams in the west have positive and relatively great goal differences. These observations are consistent with studies done in other sports leagues that show home court advantage is linked to elevation (such as the one observed in Salt Lake City) and travel time (like the ones teams visiting Portland or Seattle would need to endure). From this, we may hypothesize that the same applies to the NWSL.
To further study this hypothesis, we used the k-means algorithm to cluster the twelve locations into four regions. The “Southeast” region contains Orlando’s and North Carolina’s teams; the “Northeast” region contains the teams in New York, New Jersey, Washington, D.C. and Boston; the “Northwest” region contains the teams in Utah, Seattle and Portland, and the “Midwest and South” region contains the teams in Chicago, Kansas City and Houston. Then, we tested whether the mean goal difference between home and away teams differed among any pair of these four regions with statistical tests and the boxplot below.
In the boxplot, we an see that all boxes overlap, but the boxes for the “Southeast” and “Northwest” regions are slightly more to the right, which means those regions have slightly greater average goal difference than the other two regions. In addition to that, the “Southeast” box is wider, indicating more variance between the values. From running a statistical test, we learned there is a significant difference in means in at least one of the pair of means in average goal difference.
The relevant statistics we are trying to gauge is a notion of scoring
chance conversion rate. One important factor of conversion scoring
chance is shooting on target. Thus, to begin, our notion of conversion
rate will be to compare total scoring attempts with on target scoring
attempts. The factor we will analyze is home vs. away games.
We see a clear strong positive correlation between total and on target scoring attempts. This makes intuitive sense as a team is expected to have more chances on targets if they have more chances in total. The two regressions lines represent the relationship between total and on target scoring attempts for home and away games separately. Eyeballing the graph, the slopes of the two lines (and thus the two relationships) seem to be roughly equal.
We also wondered whether there was a relationship between time and soccer performance. Like with location, including time in any model is challenging, so we instead looked at a time series plot that compared the average goals scored per day for home teams and visiting teams in the NWSL.
From this plot, we can appreciate that the league plays during the summer. Additionally, we can see that the line corresponding to “home” teams is usually higher than that of “away” teams, which reinforces observations we had made before about home court advantage. In the 2016 and 2017 seasons, we can see that “away” teams start averaging more goals per day as the season progresses, which might mean that the players are starting to shake off some of the rust from not having played in a while. The opposite appears to be true during the 2018 season: “away” teams start to score less as the season progresses; and there is no visible change for “away” team averages during the 2019 season. For “home” teams, there is higher variance in their goals-per-day averages at the beginning of 2016 and 2019 seasons, compared to beginnings of the other two seasons. For all four seasons, “home” teams have a high variance of goals-per-day averages in the later days of the season, and there is no obvious trend.
We then wondered why visiting teams improved as the season progressed during 2016 and 2017, but stopped following this trend in later seasons. One reason might be if a new team is introduced into the league or if teams leave the league.
From the data, we know that two teams had their last seasons in 2017 and a new team came into the league. This begs the question – within the dataset, how well do “new” teams (teams introduced for their first season) fair compared to “old” teams (teams that have existed for at least 1 season in the dataset)? First, we look at some stats teams hold important to doing well in a soccer match. So, after some data pre-processing to find which teams were “new,” we used this binary variable to inspect goals, fouls committed, shots on goal, and wins.
From looking at the box plots, we see that overall, the mean of the
expansion teams seem to be lower than the amounts of non-expansion
teams. This is especially apparent in the “Shots on Goal” and “Fouls
Committed” box plots, where we see the entire box of the expansion teams
lower than the box of the non-expansion teams. This general trend
suggests there may be an apparent difference between expansion teams and
non-expansion teams in terms of how their teams perform on the
pitch.
To take a closer look at how well expansion teams actually perform at scoring goals compared to non-expansion teams, we look at how often teams score compared to how often they shoot the ball.
Based on the above scatter plot with regression lines for expansion and non-expansion teams, we see that there isn’t much difference between how well a new team can score compared to an existing team. Both the blue and the orange lines have a very similar slope. However, the orange line (representing non-expansion teams) goes a lot further, but does not change the slope of the graphs much. So, while some stats may differ between expansion and non-expansion teams, at the end of the day, for any kind of team, if they can shoot the ball toward the goal, they will score at very similar rates.
From our research questions, we were able to better understand what may contribute to a team’s success in a season. From our first research question, we found that if teams want to perform better in the league, they must have both a strong offense and a strong defense. This means creating a lot of offensive actions and limiting the opposing team from creating a lot of offensive actions. Teams that create a lot of scoring chances will earn more points per game, which helps league performance over the course of the season.
For our second research question, we compared the goal differential between the home and away teams for different locations, and saw general trend of teams further west having a higher goal differential (meaning a stronger home field advantage), compared to teams in the east. Differences in the goal differential across regions were confirmed by statistical tests.
From there, we tried to figure out our third research question, and see how time influences away and home teams and their goal scoring. Our analysis of the relationship between time and play in the NWSL observed trends in the first two seasons in our dataset, which were not visible for the following two seasons. We explored the idea that this change might be due to a redistribution of certain teams to other cities with the expansion after the 2017 season.
Finally, for the fourth research question we inspected expansion teams to see if they had any noticeable differences between non-expansion teams. While there were some stats where the expansion teams were lower than the non-expansion teams, they seemed to have no more trouble converting shots to goals compared to non-expansion teams.
These results show the factors related to scoring/converting goals and winning in the NWSL, which NWSL teams can potentially use to focus on player recruitment, salary decisions, or tactical strategy.