Introduction

The world of soccer is constantly growing, with millions of fans all over the globe. With the growing fanbase comes a growing betting community. Soccer accounts for 70% of all global sports bets placed. We focus specifically on the Premier League, as it is the most watched soccer league. A number of the Premier League clubs have betting companies as sponsors on their kits, which encourages supporters to place bets. Are the odds for each match’s outcome fair? Do the betting favorites tend to come out on top in the Premier League? Attempting to predict Premier League match outcomes will have real world implications on the betting market, journalism, and coaching decisions.

Previous studies have used Bayesian nets models to predict match results in Premier League seasons, with accuracies floating around 40%. The majority of the models were not good at predicting the matches. These suboptimal performances indicate the complexity associated with forecasting football match outcomes. Gambling organizations also prioritize match prediction probabilities to provide oddsmakers. By using the data from these models, oddsmakers may create fair and competitive betting settings for the general public by providing them with educated and balanced odds. This highlights the need for robust and innovative approaches in this area.

This paper aims to build a comprehensive predictive framework for forecasting match results in the Premier League season of 2018-2019. To build upon the already existing foundation of betting odds we enriched our predictive models by adding two key components: player evaluation metrics and the recent form of the teams based on their performances over the last five matches. The addition of player evaluation metrics is to give a better gauge of the strength of each team and where the strength lies in comparison to its opposition. We aim to contribute significant insights that can help sports analysts, coaches, and enthusiast alike in making more informed decisions and understanding the dynamics of competitive football matches in one of the most esteemed leagues in the world. This was done through a thorough evaluation of three different models: Generalized Additive Models, Random Forest and Multinomial Models.

Exploratory Data Analysis & Data Summary

The Data:

For our research, we utilized two publicly available data sets from the Premier League 2018-2019 season. The first data set was obtained from footstats.org, accessible under the “Football Stats Database to CSV and Excel” section on the website. This CSV file contained game-level data, with each entry representing a specific game from the 2018-2019 season. The data set comprised approximately 72 columns, providing detailed information about various events occurring in each match. These variables covered a wide range of aspects, including the names of the home and away teams, the full-time result of the match, expected goals, shots on target for the home team, as well as additional details like the referee’s name, the stadium where the match was played, and the timestamp of the game.

The second data set we used was betting data from football-data.co.uk, where we were able to obtain free soccer betting odds & results, match statistics, odds comparison, and more. Similar to the previous data set, we specifically downloaded data pertaining to the Premier League’s 2018-2019 season to ensure compatibility with our existing dataset. The football-data.co.uk data set differed from the previous one in that it contained betting odds from various betting sites. However, for the purpose of our research, we focused solely on the data from Bet 365 odds, allowing us to maintain consistency throughout our analysis.

After obtaining the two data sets, we merged them by the home team, away team, and the date of each match resulting in a data set with 136 columns and over 300 observations. Then, we dropped all the columns that were unnecessary for our analysis. Also, we cleaned the data by renaming column names and by fixing the format of the match date.

After processing the data, we added several additional variables:

  • expect_win: created a new expected win variable based on the betting data
  • expect_point_h: expected home points from the betting odds
    • We computed the expected home points based on casino.betmgm.com. It is important to note that the betting data split the odds into three columns for each game: home team odds, away team odds, and draw odds. So, our first step was to convert the betting odds to fractional odds. To do this, we created a new variable based on the home team winning odds, the away team winning odds, and the draw odds. The variables were computed by 1/betting odd. Then, by adding the three variables together, we computed the marginal total probability for each game. Next, to convert this number into a percentage, you divide the fractional odd by the total marginal probability, resulting in a percentage of each game outcome that reflects the betting data. Finally, to achieve the variable expect_point_h, you use this equation \(percent_H \cdot 3 + percent_D\).
  • Full_time_result_point: taking the actual outcome of the match and converting it into a dummy variable. The home team loses = 0, the home team draws = 1, the home team wins = 3.
  • expect_point_diff_bet: Difference between the expected for the home team minus the away team based on the betting data.
  • streak: the winning streak of each team from the last five games, weighted by historical data.

Although the merged datasets already grant us ample information to run our models effectively, we have taken innovative steps to enhance our prediction further. This entails the incorporation of the variable consisting of a team ranking. Here, we took player ranking data from FIFA.com for the 2018-19 season, filtered players from the Premier League only, and then computed the position ranking within each team. The ranking process involved several steps. First, we considered the number of minutes played by each player in the season. Players with higher minutes played received a higher weight since they are more likely to have a larger impact on the team. Secondly, we grouped the players into their respective position (forwards, midfielders, defenders, and goalkeepers), allowing us to further rank the positions on each team. From there, we were then able to compute an overall team ranking, which serves as a valuable metric in our analysis.

Exploratory Data Analysis (EDA)

Before modeling, the distributions of the expected goal differential within the basic data set and the betting data set were examined to better understand the relationships and the data.

As we can see from the distributions, the betting dataset does a better job of displaying the expected goal differential than the basic dataset. This is because curves from the basic data set are normal distributions, whereas the curves from the betting data are either skewed left or right, depending on if the home or away team wins. In the betting data, bookmakers adjust the odds based on the betting patterns to ensure a profit margin, which can lead to skewed distributions in the betting dataset. On the contrary, the basic data set has a more idealized view, ultimately leading to a normal distribution. Thus, by incorporating betting data in our models to predict outcomes of Premier League matches, we can achieve more precise predictions because of the skewness within the betting data.

Methods

Prior to applying all explanatory parameters to our models, we wanted to first test-out the predictability of match outcomes through betting-odds alone. As the dependent variable (match outcome) is an ordered categorical variable, the models we could utilize were inherently limited to main three models: Generalized Additive Models (GAMs), Multinomial regressions, and Random forest algorithms. After assembling the base models, we added more independent parameters to increase the accuracy prediction of our models. In total, we used four different explanatory variables, namely the streak_diff or the streak difference (the last five games’ results of each team leading up to that specific match), expected goals (XGs) for the home team computed by the betting odds (expect_point_H), XGs difference (xg_diff) from the basic dataframe, and XGs difference based on defense and offense rating of home/away teams (pos_rating_regression).

One of the main objectives of our project is to incorporate team rating based on player evaluation metrics to better the accuracy of match predictions. In accordance with our advisor Dr. Konstantinos Pelechrinis, we have decided to use a multiplicative regression of offense, defense ratings, and XGs. \[min\Sigma_i^n[(y_{hi}-y_{ai}) - (g_{h_i} \cdot o_h \cdot d_h - g_{a_i} \cdot o_a \cdot d_a)]\] The variables in the function are as follows:

  • \(y_{hi}\) means actual number of goals scored in game \(i\) while \(y_{ai}\) represents actual number of goals scored from away team in game \(i\).
  • \(g_{h_i}\) is expected goals in match \(i\) for the home team and \(g_{a_i}\) is expected goals in match \(i\) for the away team.
  • \(o_h\) is the offense rating for home team and \(o_a\) is the offense rating for the away team.
  • \(d\) serves as the defense rating for home and away team.

As the multiplicative regression function is minimizing the subtraction of actual number of goals and expected number of goals, we can convert the function into an equation. \[0 \approx (y_{hi}-y_{ai}) - (g_{h_i} \cdot o_h \cdot d_h - g_{a_i} \cdot o_a \cdot d_a)\] \[y_{hi}-y_{ai} \approx g_{h_i} \cdot o_h \cdot d_h - g_{a_i} \cdot o_a \cdot d_a\]

Please keep in mind that offense and defense ratings were computed by taking the average weighted rating of individual defensive and offensive players on the team. After setting up the variables, models were trained on matches from the start of the season until February of the 2018/19 season as it encompasses \(2/3\) of the season and allows machine learning to best adapt to the available data, which then can be used to test out the remaining 2018/19 season matches.

  • Multinomial regression: Firstly, we fit multinomial models as the response variable is a categorical data that requires using a linear combination of the observed variables to estimate the probability of each classified value of the dependent parameter. A compact version of the prediction function is:

\(f(k, i) = \beta_k + x_i\) where \(\beta_k\) is the set of regression coefficients associated with outcome k, and \(x_i\) (row vector) is the set of explanatory variables associated with observation \(i\).

  • Generalized Additive Models (GAMs): We fit GAMs with various combinations of explanatory variables since some parameters could result in the model overfitting on the training data. As mentioned previously, the response variable being an ordered, non-binary categorical data meant that we had to use multinomial GAMs with K+1 categories for K linear predictors. The likelihood of each response variable can be transcribed in the following way: \[\frac{e^{\eta_y}}{1 + \Sigma_j e^{\eta_i}}\] where \(\eta_y\) is the linear predictor of \(y\) category of the dependent variable.

  • Random forest algorithms: Lastly, we utilized the Random Forest to generate predictions of different categories of the dependent variable. To quickly implement random forest, we used the ranger package with classification method.

Results

In order to best test out the accuracy of model predictions, we computed each models’ brier score, an equivalence of the mean squared error as applied to predicted probabilities. The brier score takes on mutually exclusive discrete outcomes, and measures the mean squared difference between the predicted probabilities and actual outcomes, which means lower the brier score, the better the predictions are calibrated.

As shown below, the best model (lowest brier score) is the GAMs with XGs based on defense and offense ratings and XGs from betting data as independent variables. Although most models’ brier scores range between 0.25 and 0.27, models of random forest type have exceptionally high brier scores, averaging around 0.46. Thus, we can infer that the GAMs is the best applicable method, followed closely by the multinomial models. The result can be explained due to the smoothing function of GAMs on independent variables of the training data, which increases predictability.

Calibration Plots

Moreover, in order to best represent the model accuracy, we decided to use calibration plots, a visualization of disparity between the probability predicted by the model and the actual class probabilities in the data. The best model (init_logit_gam_pos) has its class lines predominantly closer to the 45 degree dashed line of perfect calibration. As for multinomial base models, it over-predicts match outcomes causing type II errors. Lastly, the base random forest has both type I and type II errors, which means the model is unfavorable for predicting match outcomes.

Multinomial Base

GAMs (best)

Base Random Forest

Conclusion

In this project we had the goal of enhancing match outcome predictions in the 2018-2019 English Premier League season by using a variety of predictive models and including more variables. We aimed to outperform the standard betting probabilities and used this as our baseline model. Our results showed that adding player and team evaluation to the baseline model, considerably increased the accuracy of match outcome forecasts. Through the utilization of generalized additive models, random forest and multinomial models, we gained valuable insights into the strength and weaknesses of each approach. The integration of player-specific information, such as their ratings and minutes played, allowed us to discern the impact of individual athletes on a team performance. However, the integration of team streak data and expected goals has negative impact on the accuracy of the prediction, due to variables over-predicting on the training data and duplicability between other variables.

Despite the promising results and contributions, there are limitations to this project, namely: the spontaneity of sports events, particularly soccer matches. While we incorporated recent form, player evaluations, and other factors captured through fixed-betting (injuries before the match, weather, etc.), unpredictable factors such as in-play injuries, lineup and unexpected tactical adjustments during the game have not been incorporated to our model . These dynamic elements may limit the accuracy of our predictions. Furthermore, the data is specific to one English Premier League season, future seasons may show different trends and dynamics.

Future development for this project could be in the form of adding new variables such as real-time data to account for the unpredictable factors such as injuries. Using these predictions to simulate league results and which teams are at more risk of relegation and which teams are more likely to make the Champions League. Betting odds are subject to bookmakers’ interests and market fluctuations so collecting data from more than one betting company and averaging would lead to more reliable results.

Acknowledgments

For their direction, inspiration and constant support, we are grateful to Dr. Konstantinos Pelechrinis, Dr. Ronald Yurko and Meg Ellingwood. Their assistance throughout this project has been immensely appreciated. We also extend our thanks to the entire Carnegie Mellon Sports Analytics Camp of 2023, the guest speakers and the Teaching Assistants. Finally, we are thankful to Carnegie Mellon University for funding this research camp and project.

Reference

Ulmer, Ben, Matthew Fernandez, and Michael Peterson. “Predicting soccer match results in the english premier league.” Doctoral dissertation, Doctoral dissertation, Ph. D. dissertation, Stanford (2013).

Razali, Nazim, et al. “Predicting football matches results using Bayesian networks for English Premier League (EPL).” Iop conference series: Materials science and engineering. Vol. 226. No. 1. IOP Publishing, 2017.

Benjamin Holmes, Ian G. McHale, Forecasting football match results using a player rating based model, International Journal of Forecasting, 2023.