• 36-315 Final Project: Analysis of a F1 Dataset
  • Research Question 1
  • Research Question 2
  • Research Question 3
  • Research Question 4
  • Conclusion

36-315 Final Project: Analysis of a F1 Dataset

Dataset Introduction & Research Questions:

There is a contentious debate amongst Formula 1 fans over the role of how race factors impact races. But within motor-sports, this question is much more difficult to answer. Unlike most sports, there are dozens upon dozens of different factors that make up a race - from the obvious like the driver and their personal performance, to the constructors ability to develop a fast race car. Also, while fans debate over this every race weekends, teams also try to compile and understand this data - to them, understanding which factors impact race results a difference in millions of dollars in prize money. Enter the Formula 1 dataset. This is a dataset that encompass modern and historical data, from the birth of the sport. With thousands of data entries, and nearly 30 variables, there are a lot of performance metrics that can be derived.

The following is a list of the variables that are available and that will be used in our analysis:

  • “grid”
  • “position”
  • “points”
  • “laps”
  • “time”
  • “fastestLap”
  • “rank”
  • “fastestLapTime”
  • “fastestLapSpeed”
  • “year”
  • “round”
  • “trackName”
  • “date”
  • “raceStartTime”
  • “driverCode”
  • “forename”
  • “surname”
  • “dob”
  • “driverNationality”
  • “team”
  • “constructorNationality”
  • “qualifyingResult”
  • “q1”
  • “q2”
  • “q3”
  • “track”
  • “city”
  • “country”
  • “lat”
  • “lng”

Our dataset features different statistics drawn from Formula 1 racing. Each row in the dataset represents an individual’s performance in a particular race. Thus, there are multiple rows per race with each row representing a different individual. The research questions we will focus on are:

  1. Does the nationality of the driver predict success?
  • Word cloud of nationalities
  • Boxplot of fastestLapSpeed split by nationality
  1. Does driver performance impact race results more than vehicle performance?
  • Word cloud for which drivers have completed the most wins
  • Average point differential between drivers
  1. How has performance changed across the past 7 decades?
  • Time series plot of diff performance statistics, with a moving average
  • Track vs lap time
  • Lap vs Fastest Lap Time
  1. How does race location impact driver performance?
  1. Map graph of density of locations (how many races occur at each location)
  2. Choropleth of average fastest lap time over different countries
  3. Country EDA scatterplot


Research Question 1

Does the nationality of the driver predict success?

By assessing a word cloud of nationalities present in the dataset, we can see that the most prominent are British, French, Italian, and German. American and Japanese are also relatively frequent. The other nationalities are all of relatively similar sizes, so we can infer that they have frequencies that do not differ too much from each other.

We can also look at a box plot of fastestLapSpeed separated by nationality, it appears that a nationality of British is associated with the lowest FastestLapSpeed. However, the medians and interquartile ranges between nationalities do not appear to differ much, so we will hypothesize that nationality is not associated with fastest lap speed.

We can use Bartlett’s test of equal variances and a one-way analysis of means to test the null hypothesis that there is no difference in the mean fastest lap speed values across the different nationalities.

## 
##  Bartlett test of homogeneity of variances
## 
## data:  fastestLapSpeed by driverNationality
## Bartlett's K-squared = 22.955, df = 29, p-value = 0.7786
## 
##  One-way analysis of means
## 
## data:  fastestLapSpeed and driverNationality
## F = 3.3225, num df = 29, denom df = 7176, p-value = 4.4e-09

Bartlett’s test of equal variances yields a p-value of 0.7786, meaning we do not have sufficient evidence to reject the null hypothesis and thus cannot conclude that there is a difference in the variances of fastest lap speed values over the different nationalities. However, the one-way analysis of means, assuming equal variances, yields a p-value of 4.4e-9. This means we do have sufficient evidence to reject the null hypothesis that the means are equal across all the groups and can conclude that at least one of the mean fastest lap speeds is different.

Research Question 2

Does driver performance impact race results in more than vehicle performance?

We constructed a side by side word cloud of race winners by name and race winners by team. This is a good foundation so we have a sense of who the major players are in Formula 1. Notice, basically all of the big name winners in F1 drive for one of the big name teams. This is to be expected to some degree of course, as when the driver wins the team also wins. However, the fact that there are fringe names and teams in this word cloud shows that drivers that drive for smaller teams sometimes do win races.

Now let’s look within teams, and see if there is a difference between the drivers. To do this, we looked at data from 2005 and onwards, as before this time there were not standard driver codes to identify drivers, and teams did not always have 2 drivers per season. We took the average points scored in a single race over the course of each season, and then took the difference between those averages for each team. We are left with a time-series plot using running averages of 3 seasons colored by team. We can see that these point totals can be quite stark. In the most recent seasons, we see differences of over 5 points. This may not seem like a lot but this is averaged over every race in the season, which indicates that some drivers perform better than their peers by 5 points every race! This graph also shows us that the difference can and usually is quite low though, with the majority hovering around 1-2 point differences. This indicates that, most of the time, drivers on the same team perform (in terms of points) pretty much the same. Based on this graph, superb drivers can make a difference, but good drivers generally do not.

Research Question 3

How has performance changed across the past 7 decades?

In order to answer questions about how performance has changed over the modern era of f1, we can use a series of time plots. We can define the modern era of F1 as the shift to a more data driven period. This coincides with a major regulation change in the rules - drivers have a set number of tires they can use each race weekend. Because of this new regulations, teams relied more on data about tracks, information about track conditions, and vehicle data in order to make better strategic decisions about their drivers out on the track. We can use this recorded data to be able to form a basic understanding of how vehicle performance has improved. For our time series’ plots, we have the lap number mapped as our time on the x (since in F1, there are 20 fastest laps that are made throughout a race - all of those races take place on the same day, but we can use the lap number during each race to plot out our time series, making our time axis have meaning). As such, our data is looking at the last 2308 fastest laps, across the past 16 seasons. We can first look at vehicle performance from the lens of results on the racetrack. Looking at the time series plots for each racetrack, we see an interesting trend form. For nearly every track, we see that vehicle speeds tend to be at a peak on any given track. This is followed by a steady decline, until about the 1200th lap, where we see an increase in lap speed. From there, lap speeds gradually increases, until the 2000th lap, where we see a small decline.

0500100015002000150175200225250
track70th Anniversary Grand PrixAbu Dhabi Grand PrixAustralian Grand PrixAustrian Grand PrixAzerbaijan Grand PrixBahrain Grand PrixBelgian Grand PrixBrazilian Grand PrixBritish Grand PrixCanadian Grand PrixChinese Grand PrixDutch Grand PrixEifel Grand PrixEmilia Romagna Grand PrixEuropean Grand PrixFrench Grand PrixGerman Grand PrixHungarian Grand PrixIndian Grand PrixItalian Grand PrixJapanese Grand PrixKorean Grand PrixMalaysian Grand PrixMexican Grand PrixMexico City Grand PrixMiami Grand PrixMonaco Grand PrixPortuguese Grand PrixQatar Grand PrixRussian Grand PrixSakhir Grand PrixSan Marino Grand PrixSão Paulo Grand PrixSaudi Arabian Grand PrixSingapore Grand PrixSpanish Grand PrixStyrian Grand PrixTurkish Grand PrixTuscan Grand PrixUnited States Grand PrixTime index based on lap numberFastest acheived speed

We also see a difference in the overall trend of lap speeds based on tracks. For higher-speed tracks, we see that the fastest achievable speeds are not that different when looking from track to track. For the Italian Grand Prix, in 2006, cars were able to hit 250.635 miles an hour. In 2021, their top speed was 250.135 miles per hour. And, the fastest lap speed recorded at that track was 255.014 mph. If we look at some of the slower tracks however, at the Singapore Grand Prix, the first year it was on the calendar was 2008, where the cars were able to hit 169mph. But, at the most recent grand prix, that speed was increased to 178mph. This trend is similar for slower tracks. It seems that, while high-speed track top speeds have not changed much, the average speed at lower speed tracks has increased significantly. Without context, this may be confusing, since we would expect vehicles to increase in performance year after year. But, with some context, this actually does make sense. In the 2006 season, new engine regulations were introduced - they switched from V10 engines to a more efficient v8 engine. It was further modified in 2009, when they introduced “KERS” which allowed cars to regenerate and store energy in batteries, to be deployed on track. And in 2014, the regulations changed once more to introduce “turbo hybrid” engines, which were more efficient, and much more powerful than the V8 engines. For all of these changes, the power units becoming more efficient allows for the vehicles to perform at lower speeds. Additionally, as aerodynamic performance has improved, cars are able to take corners faster, and hold more speed through slower speed corners.

We can further understand the trend of vehicle performance by looking at a moving average time series plot. This is helpful, because it allows us to contextualize the analysis on individual tracks. While looking at individual tracks allows us to see how performance increases is split among the different locations, it doesn’t tell us much about how the whole of F1 has changed.

Here, we see that the moving average line reflects a similar trend. Generally, vehicles are slowly getting faster. With continually changing regulations, the max speed of an F1 car seems to have been reached, peeking at just over 250 miles an hour. This aligns with the two other time series plots made, as we see that for the faster tracks, track speeds have seemed to reach a max, whereas for the slower tracks, as modern regulations make cars faster in lower speed corners, those speeds have increased over the time series as a whole. Overall, we can see that vehicle performance has improved across the history of motor-sports, just in different ways. While the top speed of the vehicles have stayed pretty constant, we see that the average speed has come up regardless, with cars being able to go faster at lower speed corners on tracks. We also saw that cars achieve their fastest speeds later in the race, when their cars are lighter, since they have burned fuel, and the track conditions are typically better.

This is quite an interesting map, it can be seen that even over the vast expansion that is the history of Formula One we can see that there is a high amount of fastest laps around lap 50 and evenly distribute throughout the field. This is due to the fact that as the race draws on, cars get lighter as they draw low on fuel and the track surface is a lot more gripper as cars shred tires going laps and laps around. As the track evolves and cars lighten up, fastest laps starts piling in. The reason why there are not data points outside of 24 is due to the fact that there were only 24 cars on the grid at once you could not finish any lower. The data shows that the fastest lap of cars throughout the years did not move as tracks and race pace stayed the same over the years also meaning regulations and the competition ladder evolved through out the years but stayed around the same.

Research Question 4

How does race location impact driver performance?

Here, we are looking at all of the races that have happened with in Europe on a map, versus the globe. We see that, when focusing on Europe, we see that there are a lot of older races, which are centered around Italy and the United Kingdom (with Silverstone, Monaco, and Monza being some of the oldest tracks in Europe). This makes sense with our dataset, since F1 started primarily in Europe. When we expand the map to the rest of the world, we see that there are quite a few newer races around the world, notably in the Americas, where there are several races with less than a dozen races at them. The newest of these being in Miami, where the newest race addition to F1 happened, early this year. This plot gives us an interesting look into the history of the sport, because we can see how it has spread globally, and where it has been most popular - in Asia, we see that, while there were a number of races that took place there, they were mostly short lived.

Conclusion

Through the exploration of our research questions, using plots, graphics, and statistical analysis, we have found some interesting notions within the F1 dataset. When looking at the nationality of drivers, we saw that British, Italian, French, and German drivers were the most common throughout F1’s history. And, it would seem that a drivers performance can be associated with their nationality, based on statistical significance testing. It also appears that the race location can influence drivers performance. When looking across the tracks globally, we see that there is a statistically significant difference in a drivers performance depending on which track they are racing on. We see that for countries like Saudi Arabia and Italy, drivers tend to be able to get more performance out of the cars, on those tracks, whereas tracks like Monaco and Mexico City, drivers aren’t able to extract as much out of their vehicles. When looking at driver performance, we saw that this did tend to have an impact on race results. While the drivers who won the most often came from the best teams, we also saw that, when compared with their peers, there was not as much of a difference as initially expected. While some superb drivers are able to produce significantly higher results, the difference between teammate performances is generally low, with the disparities being somewhere around 1-2 points a race. We also saw that, while vehicle performance has been increasing, in the past two decades, there has been a lot more increase in the car’s lower speed abilities, being able to improve much more on lower speed tracks than higher speed tracks. This was particularly apparent when seeing how little the top speeds achieved in high-speed grand prix’ either improved by less than 3 miles an hour, or didn’t improve at all.

Future exploration:

While this dataset was very vast, there are a few key limiting factors that should be considered. The first thing is the missing gaps within the data. Because this dataset spans over 7 decades, a lot of the data comes from a time period where data analysis was not heavily used. As such, historical data was much more sparse than modern data. And, while there is some historical data that does exist (as manufacturers did collect information about their vehicles, even in some of the earliest f1 races), it is not publicly available, and was not able to be used in this exploration. Additionally, there were some limitations to certain attributes of columns. For example, the total race time column had a lot of discrepencies. For any car that did not finish the race (either because they were a lap down, or were too far behind), their times were marked as “NA” values. This meant that the data in this column could only be used to provide minimal information as to how this data was structured. If the data was cleaned, and the missing values for race completion time was entered (i.e. for competitors a lap down, adding the amount of time that they were behind to the winners time would have given an estimated completion time), much more in-depth analysis could be done when looking at how different factors impact race results through the total time it takes for a racecar to finish the race.

And, while we were able to identify a few different predictors for drivers and vehicle performance, we there are many factors that were not considered - things like aerodynamic components on the vehicle, horsepower, weight, driver’s past results (i.e. F2 champion vs F3 champion), whether practice at a certain track improves driver performance in a meaningful way, and many other potential considerations. All of these could be used for future research, some which could be explored with further analysis of this dataset, and many with additional data that was not available within this dataset.