class: center, middle, inverse, title-slide # Data Visualization ## Visualizing 2D categorical and continuous by categorical ### June 8th, 2021 --- ## Revisiting MVP Jose Abreu's batted balls in 2020 Created dataset of batted balls by the American League MVP Jose Abreu in 2020 season using [`baseballr`](http://billpetti.github.io/baseballr/) ```r library(tidyverse) abreu_batted_balls <- read_csv("http://www.stat.cmu.edu/cmsac/sure/2021/materials/data/xy_examples/abreu_2020_batted_balls.csv") head(abreu_batted_balls) ``` ``` ## # A tibble: 6 x 7 ## pitch_type batted_ball_type hit_x hit_y exit_velocity launch_angle outcome ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> ## 1 SL ground_ball -8.93 56.2 88.1 -17 grounded_into_double_play ## 2 SL line_drive -83.8 103. 116. 16 double ## 3 FC ground_ball -5 57.6 72.1 -3 field_error ## 4 FC ground_ball -22.4 29.7 85.2 -17 field_out ## 5 SI ground_ball -9.97 72.6 97.1 -19 field_out ## 6 FC fly_ball 30.9 151. 97.9 33 field_out ``` - each row / observation is a batted ball from Abreu's 2020 season - __Categorical__ / qualitative variables: `pitch_type`, `batted_ball_type`, `outcome` - __Continuous__ / quantitative variables: `hit_x`, `hit_y`, `exit_velocity`, `launch_angle` --- ## First - more fun with [`forcats`](https://forcats.tidyverse.org/) Variables of interest: [`pitch_type`](https://library.fangraphs.com/pitch-type-abbreviations-classifications/) and `batted_ball_type` - but how many levels does `pitch_type` have? ```r table(abreu_batted_balls$pitch_type) ``` ``` ## ## CH CU FC FF FS KC SI SL ## 20 15 14 51 2 2 47 44 ``` We can manually [`fct_recode`](https://forcats.tidyverse.org/reference/fct_recode.html) `pitch_type` (see [Chapter 15 of `R` for Data Science](https://r4ds.had.co.nz/factors.html) for more on factors) ```r abreu_batted_balls <- abreu_batted_balls %>% filter(pitch_type != "null") %>% * mutate(pitch_type = fct_recode(pitch_type, "Changeup" = "CH", "Breaking ball" = "CU", * "Fastball" = "FC", "Fastball" = "FF", "Fastball" = "FS", * "Breaking ball" = "KC", "Fastball" = "SI", "Breaking ball" = "SL")) table(abreu_batted_balls$pitch_type) ``` ``` ## ## Changeup Breaking ball Fastball ## 20 61 114 ``` --- ## 2D Categorical visualization (== more bar charts!) .pull-left[ __Stacked__: a bar chart of _spine_ charts ```r abreu_batted_balls %>% ggplot(aes(x = batted_ball_type, * fill = pitch_type)) + geom_bar() + theme_bw() ``` <img src="04-2dcatcolorfacet_files/figure-html/stacked-bars-1.png" width="504" /> ] .pull-right[ __Side-by-Side__: a bar chart _of bar charts_ ```r abreu_batted_balls %>% ggplot(aes(x = batted_ball_type, fill = pitch_type)) + * geom_bar(position = "dodge") + theme_bw() ``` <img src="04-2dcatcolorfacet_files/figure-html/side-by-side-bars-1.png" width="504" /> ] --- ## Which do you prefer? .pull-left[ <img src="04-2dcatcolorfacet_files/figure-html/unnamed-chunk-1-1.png" width="504" /> ] .pull-right[ <img src="04-2dcatcolorfacet_files/figure-html/unnamed-chunk-2-1.png" width="504" /> ] -- - Stacked bar charts emphasize __marginal__ distribution of `x` variable, - e.g. `\(P\)` (`batted_ball_type` = fly_ball) - Side-by-side bar charts are useful to show the __conditional__ distribution of `fill` variable given `x`, - e.g. `\(P\)` (`pitch_type` = Fastball | `batted_ball_type` = fly_ball) --- ### Brief review of joint, marginal, and conditional probabilities __Joint distribution__: frequency of intersection, `\(P(X = x, Y = y)\)` ```r library(gt) abreu_batted_balls %>% group_by(batted_ball_type, pitch_type) %>% summarize(joint_prob = n() / nrow(abreu_batted_balls)) %>% pivot_wider(names_from = batted_ball_type, values_from = joint_prob, values_fill = 0) %>% gt() ```
pitch_type
fly_ball
ground_ball
line_drive
popup
Changeup
0.03076923
0.04102564
0.03076923
0.00000000
Breaking ball
0.05128205
0.16410256
0.08717949
0.01025641
Fastball
0.11282051
0.25641026
0.18461538
0.03076923
-- __Marginal distribution__: row / column sums, e.g. `\(P(X = \text{popup}) = \sum_{y \in \text{pitch types}} P(X = \text{popup}, Y = y)\)` -- __Conditional distribution__: probability event `\(X\)` __given__ second event `\(Y\)`, - e.g. `\(P(X = \text{popup} | Y = \text{Fastball}) = \frac{P(X = \text{popup}, Y = \text{Fastball})}{P(Y = \text{Fastball})}\)` --- ## Categorical heatmaps .pull-left[ ```r abreu_batted_balls %>% group_by(batted_ball_type, pitch_type) %>% summarize(count = n(), joint_prob = count / nrow(abreu_batted_balls)) %>% ggplot(aes(x = batted_ball_type, y = pitch_type)) + * geom_tile(aes(fill = count), color = "white") + * geom_text(aes(label = round(joint_prob, digits = 2)), * color = "white") + * scale_fill_viridis_b() + theme_bw() + theme(legend.position = "bottom") ``` - Use [`geom_tile`](https://ggplot2.tidyverse.org/reference/geom_tile.html) to display joint distribution of two categorical variables - Annotate tiles with labels of percentages using [`geom_text`](https://ggplot2.tidyverse.org/reference/geom_text.html) ] .pull-right[ <img src="04-2dcatcolorfacet_files/figure-html/unnamed-chunk-3-1.png" width="504" /> ] --- ## What about independence? Can we visualize it? -- Two variables are __independent__ if knowing the level of one tells us nothing about the other - i.e. `\(P(X = x | Y = y) = P(X = x)\)`, and that `\(P(X = x, Y = y) = P(X = x) \times P(Y = y)\)` -- .pull-left[ Create a __mosaic__ plot using [`vcd`](https://cran.r-project.org/web/packages/vcdExtra/vignettes/vcd-tutorial.pdf) package ```r library(vcd) mosaic(~ pitch_type + batted_ball_type, data = abreu_batted_balls) ``` - spine chart _of spine charts_ - height = marginal distribution of `pitch_type` - width = conditional distribution of `batted_ball_type` | `pitch_type` - area = joint distribution __[`ggmosaic`](https://github.com/haleyjeppson/ggmosaic) has issues...__ ] .pull-right[ <img src="04-2dcatcolorfacet_files/figure-html/unnamed-chunk-4-1.png" width="504" /> ] --- ## Continuous by categorical: side-by-side and color .pull-left[ ```r abreu_batted_balls %>% * ggplot(aes(x = pitch_type, y = exit_velocity)) + geom_violin() + geom_boxplot(width = .2) + theme_bw() ``` <img src="04-2dcatcolorfacet_files/figure-html/unnamed-chunk-5-1.png" width="504" /> ] .pull-right[ ```r abreu_batted_balls %>% ggplot(aes(x = exit_velocity, * color = pitch_type)) + stat_ecdf() + theme_bw() + theme(legend.position = "bottom") ``` <img src="04-2dcatcolorfacet_files/figure-html/unnamed-chunk-6-1.png" width="504" /> ] --- ## What about for histograms? .pull-left[ ```r abreu_batted_balls %>% ggplot(aes(x = exit_velocity, * fill = pitch_type)) + geom_histogram() + theme_bw() + theme(legend.position = "bottom") ``` <img src="04-2dcatcolorfacet_files/figure-html/unnamed-chunk-7-1.png" width="504" /> ] .pull-right[ ```r abreu_batted_balls %>% ggplot(aes(x = exit_velocity, * color = pitch_type)) + geom_freqpoly() + theme_bw() + theme(legend.position = "bottom") ``` <img src="04-2dcatcolorfacet_files/figure-html/unnamed-chunk-8-1.png" width="504" /> ] --- ## We can always facet instead... .pull-left[ ```r abreu_batted_balls %>% ggplot(aes(x = exit_velocity)) + geom_histogram() + theme_bw() + * facet_wrap(~ pitch_type, ncol = 2) ``` <img src="04-2dcatcolorfacet_files/figure-html/unnamed-chunk-9-1.png" width="504" /> ] .pull-right[ ```r abreu_batted_balls %>% ggplot(aes(x = exit_velocity)) + geom_histogram() + theme_bw() + * facet_grid(pitch_type ~., margins = TRUE) ``` <img src="04-2dcatcolorfacet_files/figure-html/unnamed-chunk-10-1.png" width="504" /> ] --- ## Facets make it easy to move beyond 2D ```r abreu_batted_balls %>% ggplot(aes(x = pitch_type, fill = batted_ball_type)) + geom_bar() + theme_bw() + facet_wrap(~ outcome, ncol = 5) + theme(legend.position = "bottom") ``` <img src="04-2dcatcolorfacet_files/figure-html/stacked-bars-facet-1.png" width="864" style="display: block; margin: auto;" />