Data Visualization

class: center, middle, inverse, title-slide

# Data Visualization
## Visualizing 2D categorical and continuous by categorical
### June 8th, 2021

---

## Revisiting MVP Jose Abreu's batted balls in 2020

Created dataset of batted balls by the American League MVP Jose Abreu in 2020 season using [`baseballr`](http://billpetti.github.io/baseballr/)

```r
library(tidyverse)
abreu_batted_balls <- 
  read_csv("http://www.stat.cmu.edu/cmsac/sure/2021/materials/data/xy_examples/abreu_2020_batted_balls.csv")
head(abreu_batted_balls)
```

```
## # A tibble: 6 x 7
##   pitch_type batted_ball_type  hit_x hit_y exit_velocity launch_angle outcome                  
##   <chr>      <chr>             <dbl> <dbl>         <dbl>        <dbl> <chr>                    
## 1 SL         ground_ball       -8.93  56.2          88.1          -17 grounded_into_double_play
## 2 SL         line_drive       -83.8  103.          116.            16 double                   
## 3 FC         ground_ball       -5     57.6          72.1           -3 field_error              
## 4 FC         ground_ball      -22.4   29.7          85.2          -17 field_out                
## 5 SI         ground_ball       -9.97  72.6          97.1          -19 field_out                
## 6 FC         fly_ball          30.9  151.           97.9           33 field_out
```

- each row / observation is a batted ball from Abreu's 2020 season
- __Categorical__ / qualitative variables: `pitch_type`, `batted_ball_type`, `outcome`
- __Continuous__ / quantitative variables: `hit_x`, `hit_y`, `exit_velocity`, `launch_angle`

---

## First - more fun with [`forcats`](https://forcats.tidyverse.org/)

Variables of interest: [`pitch_type`](https://library.fangraphs.com/pitch-type-abbreviations-classifications/) and `batted_ball_type` - but how many levels does `pitch_type` have?

```r
table(abreu_batted_balls$pitch_type)
```

```
## 
## CH CU FC FF FS KC SI SL 
## 20 15 14 51  2  2 47 44
```

We can manually [`fct_recode`](https://forcats.tidyverse.org/reference/fct_recode.html) `pitch_type` (see [Chapter 15 of `R` for Data Science](https://r4ds.had.co.nz/factors.html) for more on factors)

```r
abreu_batted_balls <- abreu_batted_balls %>%
  filter(pitch_type != "null") %>% 
* mutate(pitch_type = fct_recode(pitch_type, "Changeup" = "CH", "Breaking ball" = "CU",
*                     "Fastball" = "FC", "Fastball" = "FF", "Fastball" = "FS",
*                     "Breaking ball" = "KC",  "Fastball" = "SI",  "Breaking ball" = "SL"))
table(abreu_batted_balls$pitch_type)
```

```
## 
##      Changeup Breaking ball      Fastball 
##            20            61           114
```

---

## 2D Categorical visualization (== more bar charts!)

.pull-left[

__Stacked__: a bar chart of _spine_ charts

```r
abreu_batted_balls %>%
  ggplot(aes(x = batted_ball_type,
*            fill = pitch_type)) +
  geom_bar() + theme_bw()
```

]
.pull-right[

__Side-by-Side__: a bar chart _of bar charts_

```r
abreu_batted_balls %>%
  ggplot(aes(x = batted_ball_type,
             fill = pitch_type)) + 
* geom_bar(position = "dodge") + theme_bw()
```

<img src="04-2dcatcolorfacet_files/figure-html/side-by-side-bars-1.png" width="504" />
]

---

## Which do you prefer?

.pull-left[

]
.pull-right[

<img src="04-2dcatcolorfacet_files/figure-html/unnamed-chunk-2-1.png" width="504" />
]

- Stacked bar charts emphasize __marginal__ distribution of `x` variable, 
  - e.g. `$P$` (`batted_ball_type` = fly_ball)

- Side-by-side bar charts are useful to show the __conditional__ distribution of `fill` variable given `x`,
  - e.g. `$P$` (`pitch_type` = Fastball | `batted_ball_type` = fly_ball)

---

### Brief review of joint, marginal, and conditional probabilities

__Joint distribution__: frequency of intersection, `$P(X = x, Y = y)$`

```r
library(gt)
abreu_batted_balls %>%
  group_by(batted_ball_type, pitch_type) %>%
  summarize(joint_prob = n() / nrow(abreu_batted_balls)) %>%
  pivot_wider(names_from = batted_ball_type, values_from = joint_prob,
              values_fill = 0) %>%
  gt()
```

#xzspjjanme .gt_table {
  display: table;
  border-collapse: collapse;
  margin-left: auto;
  margin-right: auto;
  color: #333333;
  font-size: 16px;
  background-color: #FFFFFF;
  width: auto;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #A8A8A8;
  border-right-style: none;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #A8A8A8;
  border-left-style: none;
  border-left-width: 2px;
  border-left-color: #D3D3D3;
}

#xzspjjanme .gt_heading {
  background-color: #FFFFFF;
  text-align: center;
  border-bottom-color: #FFFFFF;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
}

#xzspjjanme .gt_title {
  color: #333333;
  font-size: 125%;
  font-weight: initial;
  padding-top: 4px;
  padding-bottom: 4px;
  border-bottom-color: #FFFFFF;
  border-bottom-width: 0;
}

#xzspjjanme .gt_subtitle {
  color: #333333;
  font-size: 85%;
  font-weight: initial;
  padding-top: 0;
  padding-bottom: 4px;
  border-top-color: #FFFFFF;
  border-top-width: 0;
}

#xzspjjanme .gt_bottom_border {
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
}

#xzspjjanme .gt_col_headings {
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
}

#xzspjjanme .gt_col_heading {
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: normal;
  text-transform: inherit;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
  vertical-align: bottom;
  padding-top: 5px;
  padding-bottom: 6px;
  padding-left: 5px;
  padding-right: 5px;
  overflow-x: hidden;
}

#xzspjjanme .gt_column_spanner_outer {
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: normal;
  text-transform: inherit;
  padding-top: 0;
  padding-bottom: 0;
  padding-left: 4px;
  padding-right: 4px;
}

#xzspjjanme .gt_column_spanner_outer:first-child {
  padding-left: 0;
}

#xzspjjanme .gt_column_spanner_outer:last-child {
  padding-right: 0;
}

#xzspjjanme .gt_column_spanner {
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  vertical-align: bottom;
  padding-top: 5px;
  padding-bottom: 6px;
  overflow-x: hidden;
  display: inline-block;
  width: 100%;
}

#xzspjjanme .gt_group_heading {
  padding: 8px;
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: initial;
  text-transform: inherit;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
  vertical-align: middle;
}

#xzspjjanme .gt_empty_group_heading {
  padding: 0.5px;
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: initial;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  vertical-align: middle;
}

#xzspjjanme .gt_striped {
  background-color: rgba(128, 128, 128, 0.05);
}

#xzspjjanme .gt_from_md > :first-child {
  margin-top: 0;
}

#xzspjjanme .gt_from_md > :last-child {
  margin-bottom: 0;
}

#xzspjjanme .gt_row {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  margin: 10px;
  border-top-style: solid;
  border-top-width: 1px;
  border-top-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
  vertical-align: middle;
  overflow-x: hidden;
}

#xzspjjanme .gt_stub {
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: initial;
  text-transform: inherit;
  border-right-style: solid;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
  padding-left: 12px;
}

#xzspjjanme .gt_summary_row {
  color: #333333;
  background-color: #FFFFFF;
  text-transform: inherit;
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
}

#xzspjjanme .gt_first_summary_row {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
}

#xzspjjanme .gt_grand_summary_row {
  color: #333333;
  background-color: #FFFFFF;
  text-transform: inherit;
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
}

#xzspjjanme .gt_first_grand_summary_row {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  border-top-style: double;
  border-top-width: 6px;
  border-top-color: #D3D3D3;
}

#xzspjjanme .gt_table_body {
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
}

#xzspjjanme .gt_footnotes {
  color: #333333;
  background-color: #FFFFFF;
  border-bottom-style: none;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 2px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
}

#xzspjjanme .gt_footnote {
  margin: 0px;
  font-size: 90%;
  padding: 4px;
}

#xzspjjanme .gt_sourcenotes {
  color: #333333;
  background-color: #FFFFFF;
  border-bottom-style: none;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 2px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
}

#xzspjjanme .gt_sourcenote {
  font-size: 90%;
  padding: 4px;
}

#xzspjjanme .gt_left {
  text-align: left;
}

#xzspjjanme .gt_center {
  text-align: center;
}

#xzspjjanme .gt_right {
  text-align: right;
  font-variant-numeric: tabular-nums;
}

#xzspjjanme .gt_font_normal {
  font-weight: normal;
}

#xzspjjanme .gt_font_bold {
  font-weight: bold;
}

#xzspjjanme .gt_font_italic {
  font-style: italic;
}

#xzspjjanme .gt_super {
  font-size: 65%;
}

#xzspjjanme .gt_footnote_marks {
  font-style: italic;
  font-size: 65%;
}
</style>
<div id="xzspjjanme" style="overflow-x:auto;overflow-y:auto;width:auto;height:auto;"><table class="gt_table">
  
  <thead class="gt_col_headings">
    <tr>
      <th class="gt_col_heading gt_columns_bottom_border gt_center" rowspan="1" colspan="1">pitch_type</th>
      <th class="gt_col_heading gt_columns_bottom_border gt_right" rowspan="1" colspan="1">fly_ball</th>
      <th class="gt_col_heading gt_columns_bottom_border gt_right" rowspan="1" colspan="1">ground_ball</th>
      <th class="gt_col_heading gt_columns_bottom_border gt_right" rowspan="1" colspan="1">line_drive</th>
      <th class="gt_col_heading gt_columns_bottom_border gt_right" rowspan="1" colspan="1">popup</th>
    </tr>
  </thead>
  <tbody class="gt_table_body">
    <tr>
      <td class="gt_row gt_center">Changeup</td>
      <td class="gt_row gt_right">0.03076923</td>
      <td class="gt_row gt_right">0.04102564</td>
      <td class="gt_row gt_right">0.03076923</td>
      <td class="gt_row gt_right">0.00000000</td>
    </tr>
    <tr>
      <td class="gt_row gt_center">Breaking ball</td>
      <td class="gt_row gt_right">0.05128205</td>
      <td class="gt_row gt_right">0.16410256</td>
      <td class="gt_row gt_right">0.08717949</td>
      <td class="gt_row gt_right">0.01025641</td>
    </tr>
    <tr>
      <td class="gt_row gt_center">Fastball</td>
      <td class="gt_row gt_right">0.11282051</td>
      <td class="gt_row gt_right">0.25641026</td>
      <td class="gt_row gt_right">0.18461538</td>
      <td class="gt_row gt_right">0.03076923</td>
    </tr>
  </tbody>
  
  
</table></div>

--
__Marginal distribution__: row / column sums, e.g. `$P(X = \text{popup}) = \sum_{y \in \text{pitch types}} P(X = \text{popup}, Y = y)$`

--
__Conditional distribution__: probability event `$X$` __given__ second event `$Y$`, 
- e.g. `$P(X = \text{popup} | Y = \text{Fastball}) = \frac{P(X = \text{popup}, Y = \text{Fastball})}{P(Y = \text{Fastball})}$`

---

## Categorical heatmaps

.pull-left[

```r
abreu_batted_balls %>%
  group_by(batted_ball_type, pitch_type) %>%
  summarize(count = n(),
            joint_prob = count / nrow(abreu_batted_balls)) %>%
  ggplot(aes(x = batted_ball_type, y = pitch_type)) +
* geom_tile(aes(fill = count), color = "white") +
* geom_text(aes(label = round(joint_prob, digits = 2)),
*           color = "white") +
* scale_fill_viridis_b() +
  theme_bw() +
  theme(legend.position = "bottom")
```

- Use [`geom_tile`](https://ggplot2.tidyverse.org/reference/geom_tile.html) to display joint distribution of two categorical variables

- Annotate tiles with labels of percentages using [`geom_text`](https://ggplot2.tidyverse.org/reference/geom_text.html)
]

.pull-right[
<img src="04-2dcatcolorfacet_files/figure-html/unnamed-chunk-3-1.png" width="504" />
]

---

## What about independence? Can we visualize it?

--
Two variables are __independent__ if knowing the level of one tells us nothing about the other
- i.e.  `$P(X = x | Y = y) = P(X = x)$`, and that `$P(X = x, Y = y) = P(X = x) \times P(Y = y)$`

.pull-left[

Create a __mosaic__ plot using [`vcd`](https://cran.r-project.org/web/packages/vcdExtra/vignettes/vcd-tutorial.pdf) package

```r
library(vcd)
mosaic(~ pitch_type + batted_ball_type, 
       data = abreu_batted_balls)
```

- spine chart _of spine charts_

- height = marginal distribution of `pitch_type`

- width = conditional distribution of `batted_ball_type` | `pitch_type`

- area = joint distribution

__[`ggmosaic`](https://github.com/haleyjeppson/ggmosaic) has issues...__
]
.pull-right[
<img src="04-2dcatcolorfacet_files/figure-html/unnamed-chunk-4-1.png" width="504" />
]

---

## Continuous by categorical: side-by-side and color

.pull-left[

```r
abreu_batted_balls %>%
* ggplot(aes(x = pitch_type,
             y = exit_velocity)) +
  geom_violin() +
  geom_boxplot(width = .2) +
  theme_bw()
```

<img src="04-2dcatcolorfacet_files/figure-html/unnamed-chunk-5-1.png" width="504" />
  
]
.pull-right[

```r
abreu_batted_balls %>%
  ggplot(aes(x = exit_velocity,
*            color = pitch_type)) +
  stat_ecdf() + 
  theme_bw() +
  theme(legend.position = "bottom")
```

<img src="04-2dcatcolorfacet_files/figure-html/unnamed-chunk-6-1.png" width="504" />
]

---

## What about for histograms?

.pull-left[

```r
abreu_batted_balls %>%
  ggplot(aes(x = exit_velocity,
*            fill = pitch_type)) +
  geom_histogram() +
  theme_bw() +
  theme(legend.position = "bottom")
```

<img src="04-2dcatcolorfacet_files/figure-html/unnamed-chunk-7-1.png" width="504" />
  
]
.pull-right[

```r
abreu_batted_balls %>%
  ggplot(aes(x = exit_velocity,
*            color = pitch_type)) +
  geom_freqpoly() +
  theme_bw() +
  theme(legend.position = "bottom")
```

<img src="04-2dcatcolorfacet_files/figure-html/unnamed-chunk-8-1.png" width="504" />
]

---

## We can always facet instead...

.pull-left[

```r
abreu_batted_balls %>%
  ggplot(aes(x = exit_velocity)) + 
  geom_histogram() +
  theme_bw() +
* facet_wrap(~ pitch_type, ncol = 2)
```

<img src="04-2dcatcolorfacet_files/figure-html/unnamed-chunk-9-1.png" width="504" />
  
]
.pull-right[

```r
abreu_batted_balls %>%
  ggplot(aes(x = exit_velocity)) + 
  geom_histogram() +
  theme_bw() +
* facet_grid(pitch_type ~., margins = TRUE)
```

<img src="04-2dcatcolorfacet_files/figure-html/unnamed-chunk-10-1.png" width="504" />
]

---

## Facets make it easy to move beyond 2D

```r
abreu_batted_balls %>%
  ggplot(aes(x = pitch_type,
             fill = batted_ball_type)) + 
  geom_bar() + theme_bw() +
  facet_wrap(~ outcome, ncol = 5) +
  theme(legend.position = "bottom")
```