Name: Roberto Clemente
The two purposes of this lab are to 1) think about what should be included in visualizations and 2) use R
to visualize categorical and discrete 1D data.
Make sure you have the latest version of Rstudio and R downloaded and working. Please ask any of the instructors if you have any issues with this.
Install the tidyverse set of packages using the following commands in the Rstudio console.
install.packages("tidyverse")
If prompted, select ‘yes’ if asked to download in your personal library. If asked to select a mirror, any should suffice but usually the closest location will be the quickest.
If COVID-19 pandemic wasn’t here, this week was supposed to be the final week of the 2020 French Open, one of the four grand slams in tennis. Common tennis questions people ask are:
Who is the GOAT (greatest of all time)?
Should player \(X\) play a base court game or approach the net?
How important is holding serve?
Is momentum a thing in tennis?
Question 1. Pick one of the above four questions. Describe, in words, some ideas of how you would answer this question. (E.g. I would look at who has the highest percentage of wins, taking into account the total number of games played to determine who is GOAT).
Answer 1. Discuss your thoughts with the team.
Question 2. The below picture is one of the the visualizations displayed on the 2019 French Open website for the mens’ final match (D.Theim vs R. Nadal).
What sort of questions do you think the makers of this visualization wanted you to be able to answer?
Do you think this is an effective visualization?
Whom do you think is winning this match? Why?
Name two pros and cons of this visualization.
Answer the same questions on another visualization of the same match:
Answer 2. Write your answer here.
Question 3. You are perhaps wondering why we are starting with a fairly niche sport, however, it is a good introduction for a variety of reasons.
Although the topic of this summer research program is sports, we want you to be able to connect the ideas learned here to anything dealing with data: whether it be niche sports or astro-statistics.
Tennis has a lot going on and a lot to visualize!
Turn based play (discrete data). Discrete data can include winners, unforced errors. Also seen in baseball, football, arguably basketball, and curling, for instance.
Continuous play (continuous data). Examples include service speed, length of points (in minutes), ball spin, and shot location. Also seen in soccer, baseball, football, etc.
Influential points/‘Weird’ scoring, e.g. the fact that winning a set 6-0 makes no difference in the final result than winning 7-6 (11-9) in a tiebreaker. This makes us ask, are some points more important than others? Also seen in football (where different players can result in different number of points), archery, and the decathlon.
Clustering. Which players are more similar to one another and why? Are there features (latent or otherwise) of players than make them more or less similar to one another?
The question is what sort of questions are you interested in exploring? Give two examples. They do not have to be about sports.
Answer 3. Write your answer here.
Ron Yurko, co-creator of the acclaimed nflscrapR package, has provided us with the following set of baseball data. Check out the description here.
We will now do some visualization of our own! We first need to learn about the data. Run the following commands in Rstudio.
mlb <- read.csv("https://raw.githubusercontent.com/ryurko/CMSACamp/master/data/intro_r/mlb_teams_data.csv?token=ADLVDGC6U6R4BK725WSOENC472ECQ")
str(mlb)
## 'data.frame': 2895 obs. of 22 variables:
## $ year : int 1871 1871 1871 1871 1871 1871 1871 1871 1871 1872 ...
## $ league : chr NA NA NA NA ...
## $ team_id : chr "BS1" "CH1" "CL1" "FW1" ...
## $ team_name : chr "Boston Red Stockings" "Chicago White Stockings" "Cleveland Forest Citys" "Fort Wayne Kekiongas" ...
## $ win_world_series : chr NA NA NA NA ...
## $ final_rank : int 3 2 8 7 5 1 9 6 4 2 ...
## $ games_played : int 31 28 29 19 33 28 25 29 32 58 ...
## $ wins : int 20 19 10 7 16 21 4 13 15 35 ...
## $ losses : int 10 9 19 12 17 7 21 15 15 19 ...
## $ runs_scored : int 401 302 249 137 302 376 231 351 310 617 ...
## $ hits : int 426 323 328 178 403 410 274 384 375 753 ...
## $ at_bats : int 1372 1196 1186 746 1404 1281 1036 1248 1353 2571 ...
## $ walks : int 60 60 26 33 33 46 38 49 48 29 ...
## $ strikeouts : int 19 22 25 9 15 23 30 19 13 28 ...
## $ homeruns : int 3 10 7 2 1 9 3 6 6 14 ...
## $ hit_by_pitch : int NA NA NA NA NA NA NA NA NA NA ...
## $ sacrifice_flies : int NA NA NA NA NA NA NA NA NA NA ...
## $ runs_allowed : int 303 241 341 243 313 266 287 362 303 434 ...
## $ hits_allowed : int 367 308 346 261 373 329 315 431 371 573 ...
## $ walks_allowed : int 42 28 53 21 42 53 34 75 45 63 ...
## $ strikeouts_against: int 23 22 34 17 22 16 16 12 13 77 ...
## $ homeruns_allowed : int 2 6 13 5 7 3 3 4 4 3 ...
Question 4.
How many rows are in this data set? How many columns? (Hint: Look at the help text for the command ?dim
.)
What do you think the command head()
does? What does tail()
do? What does the number argument do?
head(mlb, 2)
## year league team_id team_name win_world_series final_rank
## 1 1871 <NA> BS1 Boston Red Stockings <NA> 3
## 2 1871 <NA> CH1 Chicago White Stockings <NA> 2
## games_played wins losses runs_scored hits at_bats walks strikeouts homeruns
## 1 31 20 10 401 426 1372 60 19 3
## 2 28 19 9 302 323 1196 60 22 10
## hit_by_pitch sacrifice_flies runs_allowed hits_allowed walks_allowed
## 1 NA NA 303 367 42
## 2 NA NA 241 308 28
## strikeouts_against homeruns_allowed
## 1 23 2
## 2 22 6
tail(mlb, 5)
## year league team_id team_name win_world_series final_rank
## 2891 2018 NL SLN St. Louis Cardinals N 3
## 2892 2018 AL TBA Tampa Bay Rays N 3
## 2893 2018 AL TEX Texas Rangers N 5
## 2894 2018 AL TOR Toronto Blue Jays N 4
## 2895 2018 NL WAS Washington Nationals N 2
## games_played wins losses runs_scored hits at_bats walks strikeouts
## 2891 162 88 74 759 1369 5498 525 1380
## 2892 162 90 72 716 1415 5475 540 1388
## 2893 162 67 95 737 1308 5453 555 1484
## 2894 162 73 89 709 1336 5477 499 1387
## 2895 162 82 80 771 1402 5517 631 1289
## homeruns hit_by_pitch sacrifice_flies runs_allowed hits_allowed
## 2891 205 80 48 691 1354
## 2892 150 101 50 646 1236
## 2893 194 88 34 848 1516
## 2894 217 58 37 832 1476
## 2895 191 59 40 682 1320
## walks_allowed strikeouts_against homeruns_allowed
## 2891 593 1337 144
## 2892 501 1421 164
## 2893 491 1121 222
## 2894 551 1298 208
## 2895 487 1417 198
library(tidyverse)
ggplot(data = mlb, aes(x = final_rank)) + geom_bar()
Describe the above plot in words. How does it compare to the below command?
tab <- mlb %>% select(final_rank, win_world_series) %>% table
class(tab)
## [1] "table"
as.data.frame(tab)
## final_rank win_world_series Freq
## 1 1 N 276
## 2 2 N 392
## 3 3 N 385
## 4 4 N 392
## 5 5 N 365
## 6 6 N 258
## 7 7 N 175
## 8 8 N 141
## 9 9 N 17
## 10 10 N 15
## 11 11 N 1
## 12 12 N 1
## 13 13 N 1
## 14 1 Y 112
## 15 2 Y 7
## 16 3 Y 0
## 17 4 Y 0
## 18 5 Y 0
## 19 6 Y 0
## 20 7 Y 0
## 21 8 Y 0
## 22 9 Y 0
## 23 10 Y 0
## 24 11 Y 0
## 25 12 Y 0
## 26 13 Y 0
Question 5. Let’s look at our home team, the Pittsburgh Pirates.
pirates <- mlb %>% filter(team_id == "PIT")
dim(pirates)
## [1] 132 22
ggplot(data = pirates, aes(x = final_rank)) + geom_bar()
Would you say the Pirates are a successful franchise?
Maybe they were in the past. Let us subset the data to look between the years 1960 and 1990?
pirates %>% filter(year >= 1960 & year <= 1990) %>%
ggplot(aes(x = final_rank )) + geom_bar()
pirates %>% filter(year >= 1990 & year <= 2010) %>%
ggplot(aes(x = final_rank, y = ..count.. / sum(..count..) * 100)) + geom_bar() + ylab("Percentage (%)")
d. Can you make a bar plot of percents of the final rank for the Pirates between 1991 and 2018? How do the two compare?
Question 6. We can also use radial graphs to display 1D categorical data. You are perhaps most familiar with pie charts. Did you know that a pie chart can be made from a bar plot?
## Bar plot
bar <- ggplot(data = mlb, aes(x = factor(1), fill = factor(final_rank))) + geom_bar() +
scale_fill_discrete("Final rank") +
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank())
bar
## Transform to pie chart
bar + coord_polar(theta = "y") +
theme(axis.title.y=element_blank(),
axis.text.y=element_blank(),
axis.ticks.y=element_blank())
a. Compare and contrast the two plots. What is good (bad) about using the bar to compare category sizes? How about for the pie chart?
ggplot(data = mlb, aes(x = factor(final_rank))) + geom_bar() +
coord_polar() + xlab("")
ggplot(data = mlb, aes(x = final_rank)) + geom_bar() +
coord_polar(theta = "y")
Can you think of an instance where you would use such a plot?
“Death to pie charts” - Bill Eddy
You may think the above graphs are “ugly.” You may be correct. As statisticians and data scientists, our main focus should be presenting the data, i.e. substance over “flash.” That said, bad aesthetics (e.g. titles, text size, data ink, background lines, colors, gradient, stripes, etc.) choices can certainly hinder the presentation of data and good aesthetic choices can help.
Question 8.
Describe three ways good aesthetic choices can help increase one’s comprehension of a plot.
Describe two ways bad aesthetic choices can make a plot incomprehensible.
At the very minimum, all your visualizations should contain the following features:
A meaningful title
Meaningful axis titles and legend titles
One or two clear concepts (as opposed to trying to show everything about your data in one plot or being very repetitive in the features shown)
Consideration for data ink
(the concept that if something does not meaningfully contribute to a figure, is it really necessary?)
g <- ggplot(data = pirates, aes(x = final_rank)) + geom_bar() +
labs(x = "Final rank at end of season",
y = "Frequency",
title = "Final rank of MLB teams",
subtitle = "1871 - 2018")
g
g <- ggplot(data = pirates, aes(x = factor(final_rank), fill = factor(final_rank))) + geom_bar() +
labs(x = "Final rank at end of season",
y = "Frequency",
title = "Final rank of MLB teams",
subtitle = "1871 - 2018")
g + scale_fill_brewer(type = "qual", guide = FALSE)
g <- ggplot(data = pirates, aes(x = factor(final_rank), fill = factor(final_rank),
col = final_rank)) + geom_bar() +
labs(x = "Final rank at end of season",
y = "Frequency",
title = "Final rank of MLB teams",
subtitle = "1871 - 2018")
g + scale_fill_brewer(type = "seq", guide = FALSE)
We will keep learning about colors over the next few weeks!
Chi square test visualizations
2D data
Where the bonus is your self-enlightenment.
Repeat the above exercises with your favorite team instead of the Pittsburgh Pirates. Or failing that, the New York Yankees.
Add meaningful titles and axes labels to the graphs you made in this lab.
Change the colors in the radial graphs.