Data Visualization

class: center, middle, inverse, title-slide

# Data Visualization
## Visualizing 2D categorical and continuous by categorical
### June 10th, 2022

---

## Revisiting MVP Shohei Ohtani's batted balls in 2021

Created dataset of batted balls by the American League MVP Shohei Ohtani in 2021 season using [`baseballr`](http://billpetti.github.io/baseballr/)

```r
library(tidyverse)
ohtani_batted_balls <- 
  read_csv("http://www.stat.cmu.edu/cmsac/sure/2022/materials/data/sports/xy_examples/ohtani_2021_batted_balls.csv")
head(ohtani_batted_balls)
```

```
## # A tibble: 6 x 7
##   pitch_type batted_ball_type  hit_x hit_y exit_velocity launch_angle outcome  
##   <chr>      <chr>             <dbl> <dbl>         <dbl>        <dbl> <chr>    
## 1 FC         line_drive        89.7  144.          113.            20 home_run 
## 2 CH         fly_ball           3.35  83.9          83.9           55 field_out
## 3 CH         fly_ball         -65.6  126.          102.            38 field_out
## 4 CU         ground_ball       39.2   50.4          82.5            8 field_out
## 5 FC         fly_ball         -37.6  138.          101.            23 field_out
## 6 KC         popup            -51.9   41.6          84             65 field_out
```

- each row / observation is a batted ball from Ohtani's 2021 season
- __Categorical__ / qualitative variables: `pitch_type`, `batted_ball_type`, `outcome`
- __Continuous__ / quantitative variables: `hit_x`, `hit_y`, `exit_velocity`, `launch_angle`

---

## First - more fun with [`forcats`](https://forcats.tidyverse.org/)

Variables of interest: [`pitch_type`](https://library.fangraphs.com/pitch-type-abbreviations-classifications/) and `batted_ball_type` - but how many levels does `pitch_type` have?

```r
table(ohtani_batted_balls$pitch_type)
```

```
## 
## CH CU FC FF FS KC SI SL 
## 62 37 30 87  8 11 57 62
```

We can manually [`fct_recode`](https://forcats.tidyverse.org/reference/fct_recode.html) `pitch_type` (see [Chapter 15 of `R` for Data Science](https://r4ds.had.co.nz/factors.html) for more on factors)

```r
ohtani_batted_balls <- ohtani_batted_balls %>%
  filter(pitch_type != "null") %>% 
* mutate(pitch_type = fct_recode(pitch_type, "Changeup" = "CH", "Breaking ball" = "CU",
*                     "Fastball" = "FC", "Fastball" = "FF", "Fastball" = "FS",
*                     "Breaking ball" = "KC",  "Fastball" = "SI",  "Breaking ball" = "SL"))
table(ohtani_batted_balls$pitch_type)
```

```
## 
##      Changeup Breaking ball      Fastball 
##            62           110           182
```

---

## Inference for categorical data

The main test used for categorical data is the __chi-square test__:

- __Null hypothesis__: `$H_0: p_1 = p_2 = \cdots = p_K$` and we compute the __test statistic__:

$$
\chi^2 = \sum_{j=1}^K \frac{(O_j - E_j)^2}{E_j}
$$

- `$O_j$`: observed counts in category `$j$`

- `$E_j$`: expected counts under `$H_0$` (i.e., `$\frac{n}{K}$` or each category is equally likely to occur)

```r
*chisq.test(table(ohtani_batted_balls$pitch_type))
```

```
## 
## 	Chi-squared test for given probabilities
## 
## data:  table(ohtani_batted_balls$pitch_type)
## X-squared = 61.831, df = 2, p-value = 3.747e-14
```

---

## Statistical inference in general

.pull-left[
Computing `$p$`-values works like this:

- Choose a test statistic.

- Compute the test statistic in your dataset.

- Is test statistic "unusual" compared to what I would expect under `$H_0$`?

- Compare `$p$`-value to __target error rate__ `$\alpha$` (typically referred to as target level `$\alpha$` )

- Typically choose `$\alpha = 0.05$` 
]
.pull-right[

]

---

## 2D Categorical visualization (== more bar charts!)

.pull-left[

__Stacked__: a bar chart of _spine_ charts

```r
ohtani_batted_balls %>%
  ggplot(aes(x = batted_ball_type,
*            fill = pitch_type)) +
  geom_bar() + theme_bw()
```

]
.pull-right[

__Side-by-Side__: a bar chart _of bar charts_

```r
ohtani_batted_balls %>%
  ggplot(aes(x = batted_ball_type,
             fill = pitch_type)) + 
* geom_bar(position = "dodge") + theme_bw()
```

<img src="04-2dcatcolorfacet_files/figure-html/side-by-side-bars-1.png" width="504" />
]

---

## Which do you prefer?

.pull-left[

]
.pull-right[

<img src="04-2dcatcolorfacet_files/figure-html/unnamed-chunk-4-1.png" width="504" />
]

- Stacked bar charts emphasize __marginal__ distribution of `x` variable, 
  - e.g. `$P$` (`batted_ball_type` = fly_ball)

- Side-by-side bar charts are useful to show the __conditional__ distribution of `fill` variable given `x`,
  - e.g. `$P$` (`pitch_type` = Fastball | `batted_ball_type` = fly_ball)

---

## Contingency tables

Can provide `table()` with more than one variable

```r
table("Pitch type" = ohtani_batted_balls$pitch_type, 
      "Batted ball type" = ohtani_batted_balls$batted_ball_type)
```

```
##                Batted ball type
## Pitch type      fly_ball ground_ball line_drive popup
##   Changeup            27          19         11     5
##   Breaking ball       44          38         22     6
##   Fastball            52          84         40     6
```

Easily compute `proportions()`:

```r
*proportions(table(ohtani_batted_balls$pitch_type, ohtani_batted_balls$batted_ball_type))
```

```
##                
##                   fly_ball ground_ball line_drive      popup
##   Changeup      0.07627119  0.05367232 0.03107345 0.01412429
##   Breaking ball 0.12429379  0.10734463 0.06214689 0.01694915
##   Fastball      0.14689266  0.23728814 0.11299435 0.01694915
```

---

## Review of joint, marginal, and conditional probabilities

__Joint distribution__: frequency of intersection, `$P(X = x, Y = y)$`

```r
proportions(table(ohtani_batted_balls$pitch_type, ohtani_batted_balls$batted_ball_type))
```

__Marginal distribution__: row / column sums, e.g. `$P(X = \text{popup}) = \sum_{y \in \text{pitch types}} P(X = \text{popup}, Y = y)$`

__Conditional distribution__: probability event `$X$` __given__ second event `$Y$`, 
- e.g. `$P(X = \text{popup} | Y = \text{Fastball}) = \frac{P(X = \text{popup}, Y = \text{Fastball})}{P(Y = \text{Fastball})}$`

---

### BONUS: `pivot_wider` example

Manually construct this table for practice...

```r
library(gt)
ohtani_batted_balls %>%
  group_by(batted_ball_type, pitch_type) %>%
  summarize(joint_prob = n() / nrow(ohtani_batted_balls)) %>%
  pivot_wider(names_from = batted_ball_type, values_from = joint_prob,
              values_fill = 0) %>%
  gt()
```

#fosskfjmol .gt_table {
  display: table;
  border-collapse: collapse;
  margin-left: auto;
  margin-right: auto;
  color: #333333;
  font-size: 16px;
  background-color: #FFFFFF;
  width: auto;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #A8A8A8;
  border-right-style: none;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #A8A8A8;
  border-left-style: none;
  border-left-width: 2px;
  border-left-color: #D3D3D3;
}

#fosskfjmol .gt_heading {
  background-color: #FFFFFF;
  text-align: center;
  border-bottom-color: #FFFFFF;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
}

#fosskfjmol .gt_title {
  color: #333333;
  font-size: 125%;
  font-weight: initial;
  padding-top: 4px;
  padding-bottom: 4px;
  border-bottom-color: #FFFFFF;
  border-bottom-width: 0;
}

#fosskfjmol .gt_subtitle {
  color: #333333;
  font-size: 85%;
  font-weight: initial;
  padding-top: 0;
  padding-bottom: 4px;
  border-top-color: #FFFFFF;
  border-top-width: 0;
}

#fosskfjmol .gt_bottom_border {
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
}

#fosskfjmol .gt_col_headings {
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
}

#fosskfjmol .gt_col_heading {
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: normal;
  text-transform: inherit;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
  vertical-align: bottom;
  padding-top: 5px;
  padding-bottom: 6px;
  padding-left: 5px;
  padding-right: 5px;
  overflow-x: hidden;
}

#fosskfjmol .gt_column_spanner_outer {
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: normal;
  text-transform: inherit;
  padding-top: 0;
  padding-bottom: 0;
  padding-left: 4px;
  padding-right: 4px;
}

#fosskfjmol .gt_column_spanner_outer:first-child {
  padding-left: 0;
}

#fosskfjmol .gt_column_spanner_outer:last-child {
  padding-right: 0;
}

#fosskfjmol .gt_column_spanner {
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  vertical-align: bottom;
  padding-top: 5px;
  padding-bottom: 6px;
  overflow-x: hidden;
  display: inline-block;
  width: 100%;
}

#fosskfjmol .gt_group_heading {
  padding: 8px;
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: initial;
  text-transform: inherit;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
  vertical-align: middle;
}

#fosskfjmol .gt_empty_group_heading {
  padding: 0.5px;
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: initial;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  vertical-align: middle;
}

#fosskfjmol .gt_striped {
  background-color: rgba(128, 128, 128, 0.05);
}

#fosskfjmol .gt_from_md > :first-child {
  margin-top: 0;
}

#fosskfjmol .gt_from_md > :last-child {
  margin-bottom: 0;
}

#fosskfjmol .gt_row {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  margin: 10px;
  border-top-style: solid;
  border-top-width: 1px;
  border-top-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
  vertical-align: middle;
  overflow-x: hidden;
}

#fosskfjmol .gt_stub {
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: initial;
  text-transform: inherit;
  border-right-style: solid;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
  padding-left: 12px;
}

#fosskfjmol .gt_summary_row {
  color: #333333;
  background-color: #FFFFFF;
  text-transform: inherit;
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
}

#fosskfjmol .gt_first_summary_row {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
}

#fosskfjmol .gt_grand_summary_row {
  color: #333333;
  background-color: #FFFFFF;
  text-transform: inherit;
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
}

#fosskfjmol .gt_first_grand_summary_row {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  border-top-style: double;
  border-top-width: 6px;
  border-top-color: #D3D3D3;
}

#fosskfjmol .gt_table_body {
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
}

#fosskfjmol .gt_footnotes {
  color: #333333;
  background-color: #FFFFFF;
  border-bottom-style: none;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 2px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
}

#fosskfjmol .gt_footnote {
  margin: 0px;
  font-size: 90%;
  padding: 4px;
}

#fosskfjmol .gt_sourcenotes {
  color: #333333;
  background-color: #FFFFFF;
  border-bottom-style: none;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 2px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
}

#fosskfjmol .gt_sourcenote {
  font-size: 90%;
  padding: 4px;
}

#fosskfjmol .gt_left {
  text-align: left;
}

#fosskfjmol .gt_center {
  text-align: center;
}

#fosskfjmol .gt_right {
  text-align: right;
  font-variant-numeric: tabular-nums;
}

#fosskfjmol .gt_font_normal {
  font-weight: normal;
}

#fosskfjmol .gt_font_bold {
  font-weight: bold;
}

#fosskfjmol .gt_font_italic {
  font-style: italic;
}

#fosskfjmol .gt_super {
  font-size: 65%;
}

#fosskfjmol .gt_footnote_marks {
  font-style: italic;
  font-size: 65%;
}
</style>
<div id="fosskfjmol" style="overflow-x:auto;overflow-y:auto;width:auto;height:auto;"><table class="gt_table">
  
  <thead class="gt_col_headings">
    <tr>
      <th class="gt_col_heading gt_columns_bottom_border gt_center" rowspan="1" colspan="1">pitch_type</th>
      <th class="gt_col_heading gt_columns_bottom_border gt_right" rowspan="1" colspan="1">fly_ball</th>
      <th class="gt_col_heading gt_columns_bottom_border gt_right" rowspan="1" colspan="1">ground_ball</th>
      <th class="gt_col_heading gt_columns_bottom_border gt_right" rowspan="1" colspan="1">line_drive</th>
      <th class="gt_col_heading gt_columns_bottom_border gt_right" rowspan="1" colspan="1">popup</th>
    </tr>
  </thead>
  <tbody class="gt_table_body">
    <tr>
      <td class="gt_row gt_center">Changeup</td>
      <td class="gt_row gt_right">0.07627119</td>
      <td class="gt_row gt_right">0.05367232</td>
      <td class="gt_row gt_right">0.03107345</td>
      <td class="gt_row gt_right">0.01412429</td>
    </tr>
    <tr>
      <td class="gt_row gt_center">Breaking ball</td>
      <td class="gt_row gt_right">0.12429379</td>
      <td class="gt_row gt_right">0.10734463</td>
      <td class="gt_row gt_right">0.06214689</td>
      <td class="gt_row gt_right">0.01694915</td>
    </tr>
    <tr>
      <td class="gt_row gt_center">Fastball</td>
      <td class="gt_row gt_right">0.14689266</td>
      <td class="gt_row gt_right">0.23728814</td>
      <td class="gt_row gt_right">0.11299435</td>
      <td class="gt_row gt_right">0.01694915</td>
    </tr>
  </tbody>
  
  
</table></div>

---

## Inference for 2D categorical data

We AGAIN use the __chi-square test__:

- __Null hypothesis__: `$H_0$`: Variables `$A$` and `$B$` are independent,

- e.g., `batted_ball_type` and `pitch_type` are independent of each other, no relationship

- And now we compute the __test statistic__ as:

`$$\chi^2 = \sum_i^{k_1} \sum_j^{k_2} \frac{(O_{ij} - E_{ij})^2}{E_{ij}}$$`

- `$O_{ij}$`: observed counts in contingency table `$j$`

- `$E_{ij}$`: expected counts under `$H_0$` where __under the null__:

$$
`\begin{aligned}
E_{ij} &= n \cdot P(A = a_i, B = b_j) \\
&= n \cdot P(A = a_i) P(B = b_j) \\
&= n \cdot \left( \frac{n_{i \cdot}}{n} \right) \left( \frac{ n_{\cdot j}}{n} \right)
\end{aligned}`
$$
  
---

## Inference for 2D categorical data

We AGAIN use the __chi-square test__:

- __Null hypothesis__: `$H_0$`: Variables `$A$` and `$B$` are independent,

- e.g., `batted_ball_type` and `pitch_type` are independent of each other, no relationship

- And now we compute the __test statistic__ as:

`$$\chi^2 = \sum_i^{k_1} \sum_j^{k_2} \frac{(O_{ij} - E_{ij})^2}{E_{ij}}$$`

```r
*chisq.test(table(ohtani_batted_balls$pitch_type, ohtani_batted_balls$batted_ball_type))
```

```
## 
## 	Pearson's Chi-squared test
## 
## data:  table(ohtani_batted_balls$pitch_type, ohtani_batted_balls$batted_ball_type)
## X-squared = 10.928, df = 6, p-value = 0.09062
```

---

## Can we visualize independence?

--
Two variables are __independent__ if knowing the level of one tells us nothing about the other
- i.e.  `$P(X = x | Y = y) = P(X = x)$`, and that `$P(X = x, Y = y) = P(X = x) \times P(Y = y)$`

.pull-left[

Create a __mosaic__ plot using __base `R`__

```r
mosaicplot(table(ohtani_batted_balls$pitch_type, ohtani_batted_balls$batted_ball_type),
           main = "Relationship between batted ball and pitch type?")
```

- spine chart _of spine charts_

- width `$\propto$` marginal distribution of `pitch_type`

- height `$\propto$` conditional distribution of `batted_ball_type` | `pitch_type`

- area `$\propto$` joint distribution

__[`ggmosaic`](https://github.com/haleyjeppson/ggmosaic) has issues...__
]
.pull-right[
<img src="04-2dcatcolorfacet_files/figure-html/unnamed-chunk-9-1.png" width="504" />
]

---

## Shade by _Pearson residuals_

- The __test statistic__ is:

`$$\chi^2 = \sum_i^{k_1} \sum_j^{k_2} \frac{(O_{ij} - E_{ij})^2}{E_{ij}}$$`

- Define the _Pearson residuals_ as:

`$$r_{ij} = \frac{O_{ij} - E_{ij}}{\sqrt{E_{ij}}}$$`

- Sidenote: In general, Pearson residuals are `$\frac{\text{residuals}}{\sqrt{\text{variance}}}$`

- `$r_{ij} \approx 0 \rightarrow$` observed counts are close to expected counts

- `$|r_{ij}| > 2 \rightarrow$` "significant" at level `$\alpha = 0.05$`.

- Very positive `$r_{ij} \rightarrow$` more than expected, while very negative `$r_{ij} \rightarrow$` fewer than expected

- Mosaic plots: Color by Pearson residuals to tell us which combos are much bigger/smaller than expected.

---

## Shade by _Pearson residuals_

```r
mosaicplot(table(ohtani_batted_balls$pitch_type, ohtani_batted_balls$batted_ball_type),
*          shade = TRUE, main = "Relationship between batted ball and pitch type?")
```

---

## Continuous by categorical: side-by-side and color

.pull-left[

```r
ohtani_batted_balls %>%
* ggplot(aes(x = pitch_type,
             y = exit_velocity)) +
  geom_violin() +
  geom_boxplot(width = .2) +
  theme_bw()
```

<img src="04-2dcatcolorfacet_files/figure-html/unnamed-chunk-10-1.png" width="504" />
  
]
.pull-right[

```r
ohtani_batted_balls %>%
  ggplot(aes(x = exit_velocity,
*            color = pitch_type)) +
  stat_ecdf() + 
  theme_bw() +
  theme(legend.position = "bottom")
```

<img src="04-2dcatcolorfacet_files/figure-html/unnamed-chunk-11-1.png" width="504" />
]

---

## What about for histograms?

.pull-left[

```r
ohtani_batted_balls %>%
  ggplot(aes(x = exit_velocity,
*            fill = pitch_type)) +
  geom_histogram() +
  theme_bw() + theme(legend.position = "bottom")
```

<img src="04-2dcatcolorfacet_files/figure-html/unnamed-chunk-12-1.png" width="504" />
  
]
.pull-right[

```r
ohtani_batted_balls %>%
  ggplot(aes(x = exit_velocity,
             fill = pitch_type)) + 
* geom_histogram(alpha = .25, position = "identity") +
  theme_bw() + theme(legend.position = "bottom")
```

<img src="04-2dcatcolorfacet_files/figure-html/unnamed-chunk-13-1.png" width="504" />
]

---

## We can always facet instead...

.pull-left[

```r
ohtani_batted_balls %>%
  ggplot(aes(x = exit_velocity)) + 
  geom_histogram() +
  theme_bw() +
* facet_wrap(~ pitch_type, ncol = 2)
```

<img src="04-2dcatcolorfacet_files/figure-html/unnamed-chunk-14-1.png" width="504" />
  
]
.pull-right[

```r
ohtani_batted_balls %>%
  ggplot(aes(x = exit_velocity)) + 
  geom_histogram() +
  theme_bw() +
* facet_grid(pitch_type ~., margins = TRUE)
```

<img src="04-2dcatcolorfacet_files/figure-html/unnamed-chunk-15-1.png" width="504" />
]

---

## Facets make it easy to move beyond 2D

```r
ohtani_batted_balls %>%
  ggplot(aes(x = pitch_type,
             fill = batted_ball_type)) + 
  geom_bar() + theme_bw() +
  facet_wrap(~ outcome, ncol = 5) +
  theme(legend.position = "bottom")
```