Name: Roberto Clemente


Into the fire with data visualization and R

Goals

The two purposes of this lab are to 1) think about what should be included in visualizations and 2) use R to visualize categorical and discrete 1D data.

Preliminaries

  1. Make sure you have the latest version of Rstudio and R downloaded and working. Please ask any of the instructors if you have any issues with this.

  2. Install the tidyverse set of packages using the following commands in the Rstudio console.

install.packages("tidyverse")

If prompted, select ‘yes’ if asked to download in your personal library. If asked to select a mirror, any should suffice but usually the closest location will be the quickest.

  1. Download this lab and open it in Rstudio.

Lab Questions

Vive la France

If COVID-19 pandemic wasn’t here, this week was supposed to be the final week of the 2020 French Open, one of the four grand slams in tennis. Common tennis questions people ask are:

  1. Who is the GOAT (greatest of all time)?

  2. Should player \(X\) play a base court game or approach the net?

  3. How important is holding serve?

  4. Is momentum a thing in tennis?

Question 1. Pick one of the above four questions. Describe, in words, some ideas of how you would answer this question. (E.g. I would look at who has the highest percentage of wins, taking into account the total number of games played to determine who is GOAT).

Answer 1. Discuss your thoughts with the team.

Question 2. The below picture is one of the the visualizations displayed on the 2019 French Open website for the mens’ final match (D.Theim vs R. Nadal).

  1. What sort of questions do you think the makers of this visualization wanted you to be able to answer?

  2. Do you think this is an effective visualization?

  3. Whom do you think is winning this match? Why?

  4. Name two pros and cons of this visualization.

Answer the same questions on another visualization of the same match:

Answer 2. Write your answer here.

Question 3. You are perhaps wondering why we are starting with a fairly niche sport, however, it is a good introduction for a variety of reasons.

  1. Although the topic of this summer research program is sports, we want you to be able to connect the ideas learned here to anything dealing with data: whether it be niche sports or astro-statistics.

  2. Tennis has a lot going on and a lot to visualize!

    1. Turn based play (discrete data). Discrete data can include winners, unforced errors. Also seen in baseball, football, arguably basketball, and curling, for instance.

    2. Continuous play (continuous data). Examples include service speed, length of points (in minutes), ball spin, and shot location. Also seen in soccer, baseball, football, etc.

    3. Influential points/‘Weird’ scoring, e.g. the fact that winning a set 6-0 makes no difference in the final result than winning 7-6 (11-9) in a tiebreaker. This makes us ask, are some points more important than others? Also seen in football (where different players can result in different number of points), archery, and the decathlon.

    4. Clustering. Which players are more similar to one another and why? Are there features (latent or otherwise) of players than make them more or less similar to one another?

The question is what sort of questions are you interested in exploring? Give two examples. They do not have to be about sports.

Answer 3. Write your answer here.

America’s passtime: baseball

Barplots

Ron Yurko, co-creator of the acclaimed nflscrapR package, has provided us with the following set of baseball data. Check out the description here.

We will now do some visualization of our own! We first need to learn about the data. Run the following commands in Rstudio.

mlb <- read.csv("https://raw.githubusercontent.com/ryurko/CMSACamp/master/data/intro_r/mlb_teams_data.csv?token=ADLVDGC6U6R4BK725WSOENC472ECQ")
str(mlb)
## 'data.frame':    2895 obs. of  22 variables:
##  $ year              : int  1871 1871 1871 1871 1871 1871 1871 1871 1871 1872 ...
##  $ league            : chr  NA NA NA NA ...
##  $ team_id           : chr  "BS1" "CH1" "CL1" "FW1" ...
##  $ team_name         : chr  "Boston Red Stockings" "Chicago White Stockings" "Cleveland Forest Citys" "Fort Wayne Kekiongas" ...
##  $ win_world_series  : chr  NA NA NA NA ...
##  $ final_rank        : int  3 2 8 7 5 1 9 6 4 2 ...
##  $ games_played      : int  31 28 29 19 33 28 25 29 32 58 ...
##  $ wins              : int  20 19 10 7 16 21 4 13 15 35 ...
##  $ losses            : int  10 9 19 12 17 7 21 15 15 19 ...
##  $ runs_scored       : int  401 302 249 137 302 376 231 351 310 617 ...
##  $ hits              : int  426 323 328 178 403 410 274 384 375 753 ...
##  $ at_bats           : int  1372 1196 1186 746 1404 1281 1036 1248 1353 2571 ...
##  $ walks             : int  60 60 26 33 33 46 38 49 48 29 ...
##  $ strikeouts        : int  19 22 25 9 15 23 30 19 13 28 ...
##  $ homeruns          : int  3 10 7 2 1 9 3 6 6 14 ...
##  $ hit_by_pitch      : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ sacrifice_flies   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ runs_allowed      : int  303 241 341 243 313 266 287 362 303 434 ...
##  $ hits_allowed      : int  367 308 346 261 373 329 315 431 371 573 ...
##  $ walks_allowed     : int  42 28 53 21 42 53 34 75 45 63 ...
##  $ strikeouts_against: int  23 22 34 17 22 16 16 12 13 77 ...
##  $ homeruns_allowed  : int  2 6 13 5 7 3 3 4 4 3 ...

Question 4.

  1. How many rows are in this data set? How many columns? (Hint: Look at the help text for the command ?dim.)

  2. What do you think the command head() does? What does tail() do? What does the number argument do?

head(mlb, 2)
##   year league team_id               team_name win_world_series final_rank
## 1 1871   <NA>     BS1    Boston Red Stockings             <NA>          3
## 2 1871   <NA>     CH1 Chicago White Stockings             <NA>          2
##   games_played wins losses runs_scored hits at_bats walks strikeouts homeruns
## 1           31   20     10         401  426    1372    60         19        3
## 2           28   19      9         302  323    1196    60         22       10
##   hit_by_pitch sacrifice_flies runs_allowed hits_allowed walks_allowed
## 1           NA              NA          303          367            42
## 2           NA              NA          241          308            28
##   strikeouts_against homeruns_allowed
## 1                 23                2
## 2                 22                6
tail(mlb, 5)
##      year league team_id            team_name win_world_series final_rank
## 2891 2018     NL     SLN  St. Louis Cardinals                N          3
## 2892 2018     AL     TBA       Tampa Bay Rays                N          3
## 2893 2018     AL     TEX        Texas Rangers                N          5
## 2894 2018     AL     TOR    Toronto Blue Jays                N          4
## 2895 2018     NL     WAS Washington Nationals                N          2
##      games_played wins losses runs_scored hits at_bats walks strikeouts
## 2891          162   88     74         759 1369    5498   525       1380
## 2892          162   90     72         716 1415    5475   540       1388
## 2893          162   67     95         737 1308    5453   555       1484
## 2894          162   73     89         709 1336    5477   499       1387
## 2895          162   82     80         771 1402    5517   631       1289
##      homeruns hit_by_pitch sacrifice_flies runs_allowed hits_allowed
## 2891      205           80              48          691         1354
## 2892      150          101              50          646         1236
## 2893      194           88              34          848         1516
## 2894      217           58              37          832         1476
## 2895      191           59              40          682         1320
##      walks_allowed strikeouts_against homeruns_allowed
## 2891           593               1337              144
## 2892           501               1421              164
## 2893           491               1121              222
## 2894           551               1298              208
## 2895           487               1417              198
  1. Let’s do some plotting! Don’t worry too much about the commands now, just run them and see what happens!
library(tidyverse)
ggplot(data = mlb, aes(x = final_rank)) + geom_bar()

Describe the above plot in words. How does it compare to the below command?

tab <- mlb %>% select(final_rank, win_world_series) %>% table
class(tab)
## [1] "table"
as.data.frame(tab)
##    final_rank win_world_series Freq
## 1           1                N  276
## 2           2                N  392
## 3           3                N  385
## 4           4                N  392
## 5           5                N  365
## 6           6                N  258
## 7           7                N  175
## 8           8                N  141
## 9           9                N   17
## 10         10                N   15
## 11         11                N    1
## 12         12                N    1
## 13         13                N    1
## 14          1                Y  112
## 15          2                Y    7
## 16          3                Y    0
## 17          4                Y    0
## 18          5                Y    0
## 19          6                Y    0
## 20          7                Y    0
## 21          8                Y    0
## 22          9                Y    0
## 23         10                Y    0
## 24         11                Y    0
## 25         12                Y    0
## 26         13                Y    0

Question 5. Let’s look at our home team, the Pittsburgh Pirates.

pirates <- mlb %>% filter(team_id == "PIT")
dim(pirates)
## [1] 132  22
ggplot(data = pirates, aes(x = final_rank)) + geom_bar()

  1. Would you say the Pirates are a successful franchise?

  2. Maybe they were in the past. Let us subset the data to look between the years 1960 and 1990?

pirates %>% filter(year >= 1960 & year <= 1990) %>% 
  ggplot(aes(x = final_rank )) + geom_bar()

  1. We can also look at percents rather than raw numbers.
pirates %>% filter(year >= 1990 & year <= 2010) %>% 
  ggplot(aes(x = final_rank, y = ..count.. / sum(..count..) * 100)) + geom_bar() + ylab("Percentage (%)")

d. Can you make a bar plot of percents of the final rank for the Pirates between 1991 and 2018? How do the two compare?

Radial charts

Question 6. We can also use radial graphs to display 1D categorical data. You are perhaps most familiar with pie charts. Did you know that a pie chart can be made from a bar plot?

## Bar plot
bar <- ggplot(data = mlb, aes(x =  factor(1), fill = factor(final_rank))) + geom_bar() +
  scale_fill_discrete("Final rank") +
  theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank())
bar

## Transform to pie chart
bar + coord_polar(theta = "y")  +
  theme(axis.title.y=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks.y=element_blank())

a. Compare and contrast the two plots. What is good (bad) about using the bar to compare category sizes? How about for the pie chart?

  1. Make a rose diagram. How does this plot compare to the pie chart? The bar plot ?
ggplot(data = mlb, aes(x = factor(final_rank))) + geom_bar() + 
  coord_polar() + xlab("")

  1. Make this plot.
ggplot(data = mlb, aes(x = final_rank)) + geom_bar() + 
  coord_polar(theta = "y")

Can you think of an instance where you would use such a plot?

  1. Consider the below quote. What do you think of radial charts?

“Death to pie charts” - Bill Eddy

Making graphs you want to look at

You may think the above graphs are “ugly.” You may be correct. As statisticians and data scientists, our main focus should be presenting the data, i.e. substance over “flash.” That said, bad aesthetics (e.g. titles, text size, data ink, background lines, colors, gradient, stripes, etc.) choices can certainly hinder the presentation of data and good aesthetic choices can help.

Question 8.

  1. Describe three ways good aesthetic choices can help increase one’s comprehension of a plot.

  2. Describe two ways bad aesthetic choices can make a plot incomprehensible.

At the very minimum, all your visualizations should contain the following features:

  1. A meaningful title

  2. Meaningful axis titles and legend titles

  3. One or two clear concepts (as opposed to trying to show everything about your data in one plot or being very repetitive in the features shown)

  4. Consideration for data ink (the concept that if something does not meaningfully contribute to a figure, is it really necessary?)

  1. Let’s fix some of our above plots by first adding better labels.
g <- ggplot(data = pirates, aes(x = final_rank)) + geom_bar() +
  labs(x = "Final rank at end of season",
       y = "Frequency",
       title = "Final rank of MLB teams",
       subtitle = "1871 - 2018")
g

  1. We can also better distinguish between the ranks if we choose different colors.
g <- ggplot(data = pirates, aes(x = factor(final_rank), fill = factor(final_rank))) + geom_bar() +
  labs(x = "Final rank at end of season",
       y = "Frequency",
       title = "Final rank of MLB teams",
       subtitle = "1871 - 2018")
g + scale_fill_brewer(type = "qual", guide = FALSE)

  1. It may make more sense to view the rankings as sequential values (order matters).
g <- ggplot(data = pirates, aes(x = factor(final_rank), fill = factor(final_rank),
                                col = final_rank)) + geom_bar() +
  labs(x = "Final rank at end of season",
       y = "Frequency",
       title = "Final rank of MLB teams",
       subtitle = "1871 - 2018")
g + scale_fill_brewer(type = "seq", guide = FALSE)

We will keep learning about colors over the next few weeks!

Next time

  • Chi square test visualizations

  • 2D data

Bonus exercises.

Where the bonus is your self-enlightenment.

  1. Repeat the above exercises with your favorite team instead of the Pittsburgh Pirates. Or failing that, the New York Yankees.

  2. Add meaningful titles and axes labels to the graphs you made in this lab.

  3. Change the colors in the radial graphs.