Serving Up Visualizations: Examining Tennis Data

ATP Dataset

Our dataset, which comes from the ATP, gives us information about men’s singles professional tennis matches from 2016. An abundance of variables are recorded, including characteristics about the players (height, age, handedness) and about the match itself (who won and who lost, how long it was, how many aces were served). Each row in the data represents one match, with each column representing a variable. In total there are 3,004 rows (or matches) and 49 columns (or variables).

Questions to Explore

Our goal behind this presentation is to ascertain the effect of the different components of tennis on the competititiveness and outcomes of games. In order to do this we pose three main questions:

What factors (that we typically don’t think about) affect serves?
How does the court surface, something completely out of a player’s control, affect the outcomes of games?
What key metrics are important for the analysis of the competitiveness of tennis games?

## # A tibble: 6 x 49
##   tourney_id tourney_name surface draw_size tourney_level tourney_date match_num
##   <chr>      <chr>        <chr>       <dbl> <chr>                <dbl>     <dbl>
## 1 2016-M020  Brisbane     Hard           32 A                 20160104       300
## 2 2016-M020  Brisbane     Hard           32 A                 20160104       299
## 3 2016-M020  Brisbane     Hard           32 A                 20160104       298
## 4 2016-M020  Brisbane     Hard           32 A                 20160104       297
## 5 2016-M020  Brisbane     Hard           32 A                 20160104       296
## 6 2016-M020  Brisbane     Hard           32 A                 20160104       295
## # … with 42 more variables: winner_id <dbl>, winner_seed <dbl>,
## #   winner_entry <chr>, winner_name <chr>, winner_hand <chr>, winner_ht <dbl>,
## #   winner_ioc <chr>, winner_age <dbl>, winner_rank <dbl>,
## #   winner_rank_points <dbl>, loser_id <dbl>, loser_seed <dbl>,
## #   loser_entry <chr>, loser_name <chr>, loser_hand <chr>, loser_ht <dbl>,
## #   loser_ioc <chr>, loser_age <dbl>, loser_rank <dbl>,
## #   loser_rank_points <dbl>, score <chr>, best_of <dbl>, round <chr>,
## #   minutes <dbl>, w_ace <dbl>, w_df <dbl>, w_svpt <dbl>, w_1stIn <dbl>,
## #   w_1stWon <dbl>, w_2ndWon <dbl>, w_SvGms <dbl>, w_bpSaved <dbl>,
## #   w_bpFaced <dbl>, l_ace <dbl>, l_df <dbl>, l_svpt <dbl>, l_1stIn <dbl>,
## #   l_1stWon <dbl>, l_2ndWon <dbl>, l_SvGms <dbl>, l_bpSaved <dbl>,
## #   l_bpFaced <dbl>

The graph above shows the clear correlation between a player’s height & the number of aces they served, regardless of surface. This makes sense because taller players serve from a higher starting point and can put more power and speed into their serves which are hard to return, resulting in more aces. Additionally, when we facet by surface we see that in general grass courts had more aces than hard courts, which in turn had more aces than clay courts. This means it is easier to serve an ace on grass and harder to serve an ace on clay, regardless of height. This makes intuitive sense as well because balls do not bounce as high on grass, which makes returning a serve harder and therefore serving an ace easier. The opposite is true for clay courts, where balls tend to bounce higher.

The graph above shows the number of double faults by player height, faceted by court surface. We see there seems to be no correlation whatsoever between height and double faults, meaning taller and shorter players make mistakes while serving at roughly the same rate, regardless of surface. This means taller players serve more aces simply because they are able to put more power or speed, not because they are more accurate or make less mistakes. Additionally, all surfaces result in roughly equal rates of double faults. Since the ball usually never hits the court on double faults (only the net), this confirms our belief that the court itself – specifically how well a ball bounces on that surface – is directly affecting how often aces occur.

Based on the graph above, we can see that there seems to be a general linear trend between the number of first serve points won and number of aces for the winner of the matches in 2016. It is also interesting to note that in many matches, the winner did not hit any aces, and in some matches the winner hit over 40 aces (which is a large amount).

Additionally, we can see that, generally speaking, clay court matches have less aces and less first serve points won than grass matches. This makes sense since the clay court makes the ball travel slower which leads to more break points won and less aces overall. On the flip side, the grass court makes the ball travel the fastest out of all the courts, and we see more aces and more first serve points won as a result. The hard court falls somewhere in the middle, and we can see a wide variety of winners. Additionally, it is important to note that this variability in the data also shows that there is not “one way” to win a match. Some players have large amounts of aces and first serve points won and others have very little. In general, most players that won ATP matches in 2016 hit around 10 aces and won around 40 first serve points.

Based on the graph above, we can see that winners of matches tend to win a higher percentage of 1st serve points than losers. This makes a lot of sense because winning a higher percentage of points will lead to a higher chance of victory. Every winner won at least 50% of 1st serve points and some even managed to win 100% of their 1st serve points. For losers, the lowest 1st serve win percentage was around 20% and the highest win percentage was 92.5%.

This graph also shows many of the same conclusions as previous graphs about different court surfaces. For both winners and losers, grass courts led to winning more 1st serve points, clay courts led to winning less 1st serve points, and hard courts were somewhere in between. Like stated before, this is likely due to the fact that grass courts make returning serves more difficult while clay courts make returning serves easier. It is also important to note that the minimum and maximum for both winners and losers occur on hard courts. This is probably due to the fact that hard courts are the most common surface in tennis so there are more matches played on harf courts.

These plots show us the different key metrics in a tennis match and the average of the metrics for losers and winners. Decided to divide these metrics by the number of minutes since the number of minutes in a match can really affect the amount of aces or break points faced or first serves won, etc. We can see that one of the main metrics that losers and winners differentiate on are break points faced. This makes sense as breaking someone’s serve in tennis is extremely critical and even one break point won can change the course of the match. Other metrics such as first serve win percentage and second serve win percentage also seem to be very important for distinguishing between winners and losers. Additionally, it seems as if winners have significantly more aces than losers and losers have more double faults than winners and it is interesting that a metric such as first serve in percentage seems to be the same for both winners and losers.

It looks as if the number of aces are different for winners and losers and even for the different court surfaces. For example, on the clay surface, it seems as if the number of aces are around the same for winners and losers, but for the grass winners seem to hit more aces than losers. This is probably because the grass surface plays much faster and offers an advantage for players with faster serves. It also seems as if winners are consistently winning more first serve points across all of the court surfaces. However, for second serve points won, it seems as if on the clay court winners have a higher second serve win percentage and on grass and hard court, the second serve win percentage is around the same for winners and losers. The double faults seem to be consistent across all of the court surfaces. The break points faced also seem to be consistent across the court surfaces in that losers are consistently facing more break points and have more double faults than winners.

The above graph is intended to explore the general idea of competition. Our initial hypothesis when just thinking about tennis was that as you progress further into the tournament the matches should get more competitive, and if they get more competitive then it follows that the duration of matches should be longer.

This graph breaks down the distribution of the duration of matches as rounds progress, and it’s colored by the type of surface just to see if certain surfaces tend to have longer matches. What we actually find from this graph though is that the changing of the round, doesn’t necessarily lead to longer matches at all per say. Almost every single round has an average duration of about 100 minutes or so, which implies that the competitiveness of your average match doesn’t change as the rounds go on. What is interesting however, is that almost every single graph has a somewhat different shape. R128 follows a roughly normal distribution whereas R64, R16, R32, and QF all are skewed to the right. It also appears that every graph is unimodal, and additionally it looks to be that the surface doesn’t influence the duration of the match at all.

This graph was made with research question 3 in mind which analyzes factors that could determine competitiveness.

The idea behind this graph is simple, but it helps to answer the fascinating question of whether age difference is a huge factor in the outcome of matches. In this case every graph is bimodal and roughly symmetric. This tells us that youthful players dont necessarily have an advantage at all over older players even when there is an age difference of +-10. This is astounding to think of relatively due to the physical nature of tennis. The one graph that isn’t bimodal in this case is R16.

This answers the research question about factors that influence competitiveness.

Conclusions

After doing extensive data analysis and plotting various graphs, there are a couple of conclusions that we can draw. The first is that its unequivocally true that court surface does has a tangible impact on certain facets of the game such as first serve win percentages. The second is that physical characteristics don’t necessarily give a significant benefit to those who are more gifted in that facet. We see this through the lack of relationship between height and aces, as well as the lack of difference between winning percentage for those who are younger. The final conclusion that we can draw here is that there are just certain things that winners do better than losers. They win more first serves, have less double faults, more aces, and less breakpoints which shows us the tangible things that they can do to improve their games. All in all our project was very interesting, and we feel like we found several really cool insights.

Serving Up Visualizations: Examining Tennis Data

36-315 Final Project

Abhinav Maddineni, Samarth Gowda, Eric Liu, Daniel Liang

ATP Dataset

Questions to Explore

Conclusions