The most popular sport in the world is Football. This game is loved by millions in every country across the globe. Since the first official international football match in 1872, thousands of professional matches have been played all over the world. However, due to a lack of data, the results of these matches have never been analyzed. Now, due to the recent creation of a new football match dataset, we now have access to data that covers the results of all of these matches and can conduct some of the first meaningful analysis on all of the official international football matches played up to the present. So in this report, we will be analyzing the results of all these professional games. We will be analyzing the International Football Results from 1872-2023 dataset. This dataset includes 45,315 results of international football matches starting from the very first official match in 1872 up to 2023. There are all types of matches recorded in this dataset, with matches ranging from FIFA World Cup to FIFI Wild Cup to regular friendly matches. The matches are strictly men’s full international games and the data does not include Olympic Games or matches where at least one of the teams was the nation’s B-team, U-23 or a league select team.
The International Football Results from 1872-2023 dataset can be found on Kaggle at this link: https://www.kaggle.com/datasets/martj42/international-football-results-from-1872-to-2017/?select=shootouts.csv. This dataset consists of 3 separate data files the include these variables of interest:
Categorical Variables: date: date of the match. home_team: the name of the home team. away_team: the name of the away team. tournament: the name of the tournament. city: the name of the city/town/administrative unit where the match was played. country: the name of the country where the match was played. neutral: TRUE/FALSE column indicating whether the match was played at a neutral venue. winner: winner of the penalty-shootout. first_shooter: the team that went first in the shootout. team: name of the team scoring the goal. scorer: name of the player scoring the goal. own_goal: whether the goal was an own-goal. penalty: whether the goal was a penalty.
Quantitative Variables: home_score: full-time home team score including extra time, not including penalty-shootouts. away_score: full-time away team score including extra time, not including penalty-shootouts. Spread: difference between home_score and away_score.
In this report, we will be addressing three research questions:
1. Is there a home team advantage? And if so, does this advantage change across time, tournaments, and location?
2. Is there a most dominant team of all time?
3. Are there teams that appear to be historically evenly matched?
To begin, we would like to investigate in these football matches, does a home team advantage exist, and if so does this advantage change based on the location of the match? To do this, we will be examining the variables, home_score, away_score, and neutral. First, we will observe the marginal distributions of home score, and how these distributions vary by if the game was played at a neutral site, as seen below:
From the marginal distributions above, we can see that first, that home score appears to have slightly higher counts for higher score values than that of away score. So it appears, on average, the home team score is slightly higher than the away score. Furthermore, we can see that when the home score and away score are colored by if they were played on a neutral site (a site that is neither of the teams “home field”), that when the the site wasn’t a neutral site (False), we see a similar trend to that of the marginal distributions: the home team score is slightly higher than the away score. However, for when the match is played at a neutral site (true), this trend changes. When the match is played at a neutral site, it appears that away scores and home scores have a more similar distribution of values, i.e the home score doesn’t have higher counts for higher score values. Thus, when the matches are played at a neutral site, home score decreases slightly and the away score increases slightly on average. So, when matches aren’t played at neutral site, the home score is higher and away score is lower, and this disparity between scores decreases when the match is played at a neutral site.
Now to further investigate our question of interest, we decided to create and observe a new variable named Spread. This variable is the home score minus the away score. So positive values are matches in which the home team won (higher home score), negative values are the matches the away team won (higher away score), and the value of 0 are matches that ended in ties. To observe this variable, we plotted the distribution of Spread, and colored it by if the game was played at a neutral site or not, as seen below:
From the marginal distribution of Spread above, we can see for matches not played at neutral site (false), the counts of positive values of Spread are larger than the counts for negative values, implying the home team won more matches on average. But for matches played at a neutral site (true), the positive and negative values of spread counts are much more equal, which implies the away teams won more games, and the home teams won less games at a neutral site on average. The spread variable follows a similar trend to what we observed above, home teams have better results not a neutral sites, but if the match is played at a neutral site, the away team has better results and the home team results worsen.
To further this analysis, we will conduct a KS-Test, to see if the distribution of Spread follows a normal distribution. As we can see above, the distribution of Spread appears to potentially be relatively normal. And if the distribution is normal, that in turn implies it is symmetric, meaning there are equal numbers of positive and negative Spread values, implying no advantage for any team. On the other hand if the distribution is not normal, that in turn implies that there are higher counts of positive or negative Spread values. The results of the K-S test are seen below:
##
## Asymptotic one-sample Kolmogorov-Smirnov test
##
## data: results_data$Spread
## D = 0.14317, p-value < 2.2e-16
## alternative hypothesis: two-sided
Since the p-value is approximately 0 (less than 0.05), we have sufficient evidence to reject the null that Spread follows a normal distribution. Thus, we can conclude Spread does not follow a normal distribution, and along with the previous analysis conducted, we can further conclude that there are higher counts of positive spread than negative. This means that there is a home team advantage, and this advantage could potentially be affected (lessened) by the match being played at a neutral site.
After seeing that a Home Team advantage exists and is affected by the location of the match (neutral site), we now want to learn about whether the home advantage exists across different tournaments in order to exam whether unfairness exists in major tournaments. To do this, we should examine the variables Home Score and Away Score across Tournaments, as seen below:
The above graph portrays a comparative analysis between home and away scores across major tournaments that have had over 200 occurrences up to present day. The reason we only take major tournaments into account is that the minor tournaments might introduce outliers to our data set. The utilization of blue and red bars represents the average home and away scores, respectively, across these diverse tournaments. The noteworthy trend apparent from this visualization indicates a prevalent home advantage, whereby, in a significant majority of these tournaments, the average home scores outpace the average away scores. This recurring pattern of higher home scores implies a consistent benefit for teams playing on their home turf, showcasing an inherent advantage in securing more goals. The existence of the home advantage in soccer often translates into multifaceted effects. Beyond the numerical difference in scores, it potentially influences the psychological and strategic dimensions of the game. Playing at home provides teams with familiar surroundings, support from home fans, reduced travel-related fatigue, and an inherent comfort level, fostering an environment conducive to better performance. Moreover, the home advantage extends beyond the pitch, impacting factors such as referee decisions, crowd influence on player morale, and tactical familiarity. These factors collectively contribute to an environment that can tilt the balance in favor of the home team. Understanding and quantifying the influence of the home advantage on match outcomes hold pivotal significance. It aids in predicting match results, shaping team strategies, and refining predictive models. Acknowledging the impact of this advantage facilitates informed decision-making in team tactics, match predictions, and strategic planning, thus enhancing the competitive edge in soccer tournaments.
We also want to better understand whether home advantage exists across time. To do this, we plotted Average Home and Away Scores Over Time using a line chart.
The presented graph offers a longitudinal view of average home and away scores spanning over 150 years. The visualization employs a blue line to represent the average away score and a red line to denote the average home score over time. Additionally, a green line supplements the graph, illustrating the score difference between the average home and away scores across the temporal spectrum. Consistently throughout the analyzed period, the graph reveals a prevailing pattern wherein the average home score tends to surpass the average away score, as evidenced by the green line predominantly hovering above the zero mark. This observation reinforces the notion that, for the majority of the time, teams playing on their home grounds tend to secure higher scores compared to their away counterparts. Moreover, an intriguing trend discernible from the graph is the decreasing and stabilizing nature of the score difference between average home and away scores over time. Starting from the 1800s and progressing through the 2000s, there is a noticeable reduction in the variance and a convergence of the green line toward the zero baseline. This trend signifies a diminishing home advantage across eras, implying a tendency towards a fairer and more balanced competition as time advances. The decreasing disparity and the diminishing variance in the score difference signify a potential attenuation of the home advantage over the years. This trend aligns with the notion that as soccer evolves, factors contributing to the home advantage may undergo alterations. Factors such as advancements in training, technology, tactical developments, or changes in match conditions might be influencing this trend, fostering a more equitable playing field in soccer competitions. Understanding the diminishing trend of the home advantage across time holds significance in assessing the evolving landscape of soccer competitions. This trend potentially indicates a greater equilibrium in match outcomes, emphasizing the pursuit of fairness and competitiveness in the realm of soccer tournaments.
The most important element to a question of this nature is how we define “dominant”. There are many interpretations of the term and just as many statistical processes which could justify the given conclusion. We’ll be taking a measured approach combining statistics and heuristics to reach our conclusion. For our purposes, we can divide this question into two sub-analyses: overall win-rate and FIFA World Cup performance. Beginning with the latter, we will leverage our first heuristic in restricting our considerations to the top ~5% of teams in terms of games played. This is justified on the grounds of not just sample size, but also our belief that historical performance is a critical part of the question. Charted over time, these are the cumulative win-rates of the teams with the aforementioned restrictions:
As seen in the above graphic, we can see that there are a variety of histories between teams and their respective win-rates, including the date of their first match. For example, England played their first official match in 1876; about twenty-five years before any of the others teams in consideration. Argentina, Hungary, Uruguay, and France started within two years of each other in 1903. Mexico began in about 1925, and Germany would come a few years after. The youngest team in consideration is South Korea, which started playing in 1949.
Brazil, Germany, and Uruguay have similar results, in that they begin with a relatively sharp rise in win-rate, then the rise begins to slow down and level off. However, Uruguay has been slowly rising for the past forty years. Mexico and France started the 1900s with a steep decline in cumulative win-rate, then they quickly turn it around in about 1950 with a steady increase up until today. Sweden began with a small decline, then improved until the 1960s where they currently remain at about the same level. Hungary and South Korea share a similar pattern: a rise in win-rate then a steady decline up until today, though Hungary’s history is about fifty years longer. As such, its line is stretched further than South Korea’s. In terms of trends, Argentina is the most varied with a steady increase for about sixty years, then a twenty year small decline, then ending with an rise.
Every team shown has a current cumulative win-rate above 0.5, meaning that they have won more matches than lost. Yet, Brazil and Germany are the only teams which consistently have had increasing win-rates year-over-year, since their inception. Brazil, only in the past couple of years, has seen a very slight decrease. That being the case, it currently holds a cumulative win-rate of ~0.8 and Germany holds ~0.75. For these reasons, we consider Brazil and Germany to be in very strong contention for the most dominant team.
We determined tournament performance to be a particularly important aspect of answering our question. We wanted to isolate for matches in which there were accolades on the line–as opposed to friendly matches. Leveraging our second heuristic, we chose to analyze just the FIFA World Cup because of its unanimous acknowledgement as the greatest achievement for a national team. Also this particular tournament registers teams from around the planet, thus we aren’t restricted to regional tournaments in which the best countries may never face each other. Of the teams prior mentioned, we measured the amount of placements each team had in the top 3 of the tournament, as shown here:
Summed across every country shown, there are forty-one total World Cup Top 3 finishes. Brazil and Germany alone make up the majority of those finishes. Argentina and France take up similar (though smaller) proportions. The rest of the countries take up smaller, less significant slices. Germany does have a slight margin over Brazil, however we can break down these top 3 finishes into their first, second, and third place finishes.
From this graph we can determine that Brazil has the most first place finishes at five, Germany with four, Argentina with three, France and Uruguay with 2, and England with one. Notably, there are two teams (Sweden and Hungary) who have never won a world cup, thus they are excluded from the graphic. We can examine second and third place finishes with the following:
Observing the left pie chart, we see an almost equal number of second place finishes with Germany (4) and Argentina (3). Brazil, France, and Hungary, all have two second place finishes and Sweden has just one. Uruguay and England are excluded. Observing the right pie chart, we see that Germany has exactly double the number of third place finishes as Brazil, France, or Sweden; each having two. Argentina, Hungary, England, and Uruguay are excluded on account of not ever finishing in third. In regards to our question, Germany holds the plurality in second and third place finishes, however Brazil holds plurality in the first. Though we’ve narrowed down our selection to two strong candidates, the discrepancy in placements makes answering the question difficult, and opinionated. Should Brazil’s one more first place finish supersede Germany’s two more second two more third place placements?
To ameliorate this question and, perhaps, reach a statistical middle ground, we will deploy the weighted average. Each team will have an average placement determined by summing the weight of their placements (1st~3, 2nd~2, 3rd~1) and dividing by the number of placements for that particular team. The results are shown in the following table:
## Weighted Average
## Uruguay 3.0
## England 3.0
## Argentina 2.5
## Brazil 2.3
## France 2.0
## Germany 2.0
## Hungary 2.0
## Sweden 1.3
Our weighted average returns a range of values for these particular countries, ranging from 3.0 to 1.3. England and Uruguay are evaluated at 3.0, Argentina at 2.5, Brazil at 2.3, France, Germany, and Hungary at 2.0, and Sweden at 1.3. This doesn’t tell the full story, however, since many of these countries have reached first, second, or third place at a significantly lesser rate than Germany and Brazil. This means that we should still hold Germany and Brazil as our possible answers, though Brazil holds the edge on this specific metric.
Accumulating the evidence collected thus far, not only can our question be narrowed down to two choices, but we believe the statistical edge goes to Brazil. We first examined the cumulative win-rates in a line plot. Though there were many shapes particular to each country, it is the case that Brazil has always maintained a higher win-rate than Germany. Additionally, Brazil maintains a 5% lead on Germany in cumulative win-rate today. Examining FIFA World Cup placements, the pie charts reveal that Brazil has the most first place finishes, though it is the case that Germany has more second and third place finishes. Utilizing the weighted average, we see that Brazil has a higher average placement than Germany. For these reasons, Brazil can reasonably be considered the most dominant.
In order to learn teams that appear to be historically evenly matched, we can start by examining any patterns or trends on the average number of goals every decade for each team, i.e. country. From the dataset, we noticed that there are hundreds of teams recorded. This suggests that directly studying average number of goal every decade for each country may not provide much information to our research question. Therefore, we decided to investigate decade tendencies on average number of goals for each continent, as seen in the line chart below:
To better visualize decadal trends on average number of goals for each continent, we graphed out the changes on average number of goals over time colored by different continents. We observe that before 2000, the average number of goals continuously increased for all continents: Among all the continents, Europe constantly ranked first except for a slight fall during the 1940s. Oceania ranked last all the time since the first record in the 1960s, which is reasonable due to the fact that there are only two major countries – New Zealand and Australia – in Oceania. We also noticed a fluctuation of the average number of goals in Americas between the 1930s and 1970s, which overlaps with the period when Europe experienced short fall on the goals. I suspect this coincidence is related to the outbreak and development of the World War II, when countries and teams did not put much effort and time in football training and competitions.
From the line chart above, the average number of goals reached peak at around 2000. Since then, the average number of goals decreased for all continents but Europe, who was capped at 2010. We considered players’ increased fitness levels and implementation of more effective strategies to be some of the possible reasons for this decrease. As goalkeepers advance in goalkeeping techniques, goalkeepers might be better at making crucial saves, reducing the overall number of goals scored. Despite listed potential factors above, this tendency is intriguing and worthy of further exploration.
After researching on the historical tendency of average number of goals for each continent, we would like to better understand how the share of goals change within each continent. We mainly focus on the changes from 2000s, the peak, to 2020s, fairly recently. We can see these changes in the choropleth maps below:
In order to answer this research question, we draw a choropleth map with each country’s continent-wise average goal proportion in 2000s and 2020s. From the above maps, not considering countries with NA values, we observe Canada and the United States roughly exchanged their share of average number of goals within Americas – an apparent increase in Canada’s share. There are also clear decreases in the proportion of goals of Russia and increase for Kazakhstan within Europe. Therefore, we believe that most of the countries preserved a approximately equal share within their corresponding continents in 2000s and 2020s. By investigating the possible related factors that influence the change on goal proportions, we are able to potentially come up with plans and ideas to better the performance of teams.
Our examination of historical trends in the average number of goals for football teams revealed insightful patterns and shifts, particularly when analyzed on a continental level. We mainly concentrated on continental trends, highlighting Europe’s consistent dominance in goal averages with occasional fluctuations. The observed peak in average goals around 2000, followed by a subsequent decline, sparked curiosity and raised questions about factors such as improved player fitness and goalkeeping techniques influencing this trend. Further exploration into the distribution of goal proportions within continents from the 2000s to the 2020s uncovered interesting dynamics, such as Canada and the United States exchanging goal shares in the Americas and shifts in Russia and Kazakhstan’s proportions in Europe. Overall, most of the countries demonstrated historically evenly matched characteristics, which, along with the analysis above, can provide insights into future football development.
In this report, we investigate three main research questions: 1. Is there a home team advantage? And if so, does this advantage change across time, tournaments, and location? 2. Is there a most dominant team of all time? 3. Are there teams that appear to be historically evenly matched? We analyzed the data from the results of international football matches starting from the very first official match in 1872 up to 2023 to the answer these questions. After our analysis, for the first research question we found that there is a home team advantage, and used a KS-Test to then confirm our answer. The home team advantage has existed across both times and tournaments, and this advantage can be potentially be affected (lessened) by the location of the game at a neutral site. For the second research question, we found the the best team in terms of win rate and tournament placement to be Brazil, with Germany as a close second, implying that it is likely Brazil that has been the most dominant team over all time. Finally, for the third research question, we found that according to many historically evenly matched characteristics, most countries are historically evenly matched.
The main limitation is our analysis is even though we have a lot of data on the results of international football matches, most of the observations are very old. There is not many observations of current results in the dataset, so many of our conclusions from our analysis could be slightly less accurate for applying these conclusions to present day.
There are a couple areas of our analysis in our report that could be analyzed further in depth. First being the degree to which hosting a match at a neutral site affects the home team advantage. In our analysis we saw having the match at a neutral site potentially lessens the home team advantage, but due to lack of neutral site specific data, we don’t know if it creates a truly equal playing field, if it now gives the away team an advantage, etc. Also, further investigation could occur into what gives a team a home team advantage. We discovered that an advantage exists, but don’t know what creates this advantage from lack of data. It could potentially be better team facilities, larger crowds, nicer weather, and many other factors that create this advantage for home teams. Finally, another area where further analysis could be conducted is into what factors of a country’s team make them dominant (or not) over the years. We found which teams were the best over time from our analysis, but not what about the team such as the coaching, funding, fans, etc made certain teams better than others over the years. With more specific and complete football data, further research could be conducted into these topics.