class: center, middle, inverse, title-slide # Data Visualization ## Visualizing 2D categorical and continuous by categorical, plus facets ### Ron Yurko ### 06/04/2020 --- ## Crimes against bar charts .center[![](https://pbs.twimg.com/media/DiaDinUX0AQ2Hel?format=jpg&name=large)] _Please do NOT never make something like this_... --- ## Revisiting MVP Mike Trout's batted balls in 2019 Created dataset of batted balls by the American League MVP Mike Trout in 2019 season using [`baseballr`](http://billpetti.github.io/baseballr/) ```r library(tidyverse) trout_batted_balls <- read_csv("http://www.stat.cmu.edu/cmsac/sure/materials/data/xy_examples/trout_2019_batted_balls.csv") head(trout_batted_balls) ``` ``` ## # A tibble: 6 x 7 ## pitch_type batted_ball_type hit_x hit_y exit_velocity launch_angle ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 FF fly_ball 10.2 150. 91 30.2 ## 2 FF fly_ball -70.6 82.7 82.2 49.7 ## 3 FF ground_ball -21.6 18.5 49.1 13.8 ## 4 SI fly_ball -23.5 191. 111. 25.5 ## 5 FF ground_ball -14.5 41.8 82.1 -6 ## 6 FF fly_ball 27.1 114. 96.8 52 ## # … with 1 more variable: outcome <chr> ``` - each row / observation is a batted ball from Trout's 2019 season - __Categorical__ / qualitative variables: `pitch_type`, `batted_ball_type`, `outcome` - __Continuous__ / quantitative variables: `hit_x`, `hit_y`, `exit_velocity`, `launch_angle` --- ## First - more fun with [`forcats`](https://forcats.tidyverse.org/) Variables of interest: [`pitch_type`](https://library.fangraphs.com/pitch-type-abbreviations-classifications/) and `batted_ball_type` - but how many levels does `pitch_type` have? ```r table(trout_batted_balls$pitch_type) ``` ``` ## ## CH CU FC FF FS FT KC null SI SL ## 38 23 21 139 1 33 1 6 29 63 ``` We can manually [`fct_recode`](https://forcats.tidyverse.org/reference/fct_recode.html) `pitch_type` (see [Chapter 15 of `R` for Data Science](https://r4ds.had.co.nz/factors.html) for more on factors) ```r trout_batted_balls <- trout_batted_balls %>% filter(pitch_type != "null") %>% * mutate(pitch_type = fct_recode(pitch_type, "Changeup" = "CH", "Breaking ball" = "CU", * "Fastball" = "FC", "Fastball" = "FF", "Fastball" = "FS", "Fastball" = "FT", * "Breaking ball" = "KC", "Fastball" = "SI", "Breaking ball" = "SL")) table(trout_batted_balls$pitch_type) ``` ``` ## ## Changeup Breaking ball Fastball ## 38 87 223 ``` --- ## 2D Categorical visualization (== more bar charts!) .pull-left[ __Stacked__: a bar chart of _spine_ charts ```r trout_batted_balls %>% ggplot(aes(x = batted_ball_type, * fill = pitch_type)) + geom_bar() + theme_bw() ``` <img src="04-2dcatcolorfacet_files/figure-html/stacked-bars-1.png" width="504" /> ] .pull-right[ __Side-by-Side__: a bar chart _of bar charts_ ```r trout_batted_balls %>% ggplot(aes(x = batted_ball_type, fill = pitch_type)) + * geom_bar(position = "dodge") + theme_bw() ``` <img src="04-2dcatcolorfacet_files/figure-html/side-by-side-bars-1.png" width="504" /> ] --- ## Which do you prefer? .pull-left[ <img src="04-2dcatcolorfacet_files/figure-html/unnamed-chunk-1-1.png" width="504" /> ] .pull-right[ <img src="04-2dcatcolorfacet_files/figure-html/unnamed-chunk-2-1.png" width="504" /> ] -- - Stacked bar charts emphasize __marginal__ distribution of `x` variable, - e.g. `\(P\)` (`batted_ball_type` = fly_ball) - Side-by-side bar charts are useful to show the __conditional__ distribution of `fill` variable given `x`, - e.g. `\(P\)` (`pitch_type` = Fastball | `batted_ball_type` = fly_ball) --- ## Brief review of joint, marginal, and conditional probabilities __Joint distribution__: frequency of intersection, `\(P(X = x, Y = y)\)` ```r library(gt) trout_batted_balls %>% group_by(batted_ball_type, pitch_type) %>% summarize(joint_prob = n() / nrow(trout_batted_balls)) %>% pivot_wider(names_from = batted_ball_type, values_from = joint_prob) %>% gt() ```
pitch_type
fly_ball
ground_ball
line_drive
popup
Changeup
0.04022989
0.03448276
0.02586207
0.00862069
Breaking ball
0.10344828
0.04597701
0.06896552
0.03160920
Fastball
0.22126437
0.17241379
0.19827586
0.04885057
-- __Marginal distribution__: row / column sums, e.g. `\(P(X = \text{popup}) = \sum_{y \in \text{pitch types}} P(X = \text{popup}, Y = y)\)` -- __Conditional distribution__: probability event `\(X\)` __given__ second event `\(Y\)`, - e.g. `\(P(X = \text{popup} | Y = \text{Fastball}) = \frac{P(X = \text{popup}, Y = \text{Fastball})}{P(Y = \text{Fastball})}\)` --- ## What about independence? Can we visualize it? -- Two variables are __independent__ if knowing the level of one tells us nothing about the other - i.e. `\(P(X = x | Y = y) = P(X = x)\)`, and that `\(P(X = x, Y = y) = P(X = x) \times P(Y = y)\)` -- .pull-left[ Create a __mosaic__ plot using [`vcd`](https://cran.r-project.org/web/packages/vcdExtra/vignettes/vcd-tutorial.pdf) package ```r library(vcd) mosaic(~ pitch_type + batted_ball_type, data = trout_batted_balls) ``` - spine chart _of spine charts_ - height = marginal distribution of `pitch_type` - width = conditional distribution of `batted_ball_type` | `pitch_type` - area = joint distribution __[`ggmosaic`](https://github.com/haleyjeppson/ggmosaic) has issues...__ ] .pull-right[ <img src="04-2dcatcolorfacet_files/figure-html/unnamed-chunk-3-1.png" width="504" /> ] --- ## Continuous by categorical: side-by-side and color .pull-left[ ```r trout_batted_balls %>% * ggplot(aes(x = pitch_type, y = exit_velocity)) + geom_violin() + geom_boxplot(width = .2) + theme_bw() ``` <img src="04-2dcatcolorfacet_files/figure-html/unnamed-chunk-4-1.png" width="504" /> ] .pull-right[ ```r trout_batted_balls %>% ggplot(aes(x = exit_velocity, * color = pitch_type)) + stat_ecdf() + theme_bw() + theme(legend.position = "bottom") ``` <img src="04-2dcatcolorfacet_files/figure-html/unnamed-chunk-5-1.png" width="504" /> ] --- ## What about for histograms? .pull-left[ ```r trout_batted_balls %>% ggplot(aes(x = exit_velocity, * fill = pitch_type)) + geom_histogram() + theme_bw() + theme(legend.position = "bottom") ``` <img src="04-2dcatcolorfacet_files/figure-html/unnamed-chunk-6-1.png" width="504" /> ] .pull-right[ ```r trout_batted_balls %>% ggplot(aes(x = exit_velocity, * color = pitch_type)) + geom_freqpoly() + theme_bw() + theme(legend.position = "bottom") ``` <img src="04-2dcatcolorfacet_files/figure-html/unnamed-chunk-7-1.png" width="504" /> ] --- ## We can always facet instead... .pull-left[ ```r trout_batted_balls %>% ggplot(aes(x = exit_velocity)) + geom_histogram() + theme_bw() + * facet_wrap(~ pitch_type, ncol = 2) ``` <img src="04-2dcatcolorfacet_files/figure-html/unnamed-chunk-8-1.png" width="504" /> ] .pull-right[ ```r trout_batted_balls %>% ggplot(aes(x = exit_velocity)) + geom_histogram() + theme_bw() + * facet_grid(pitch_type ~., margins = TRUE) ``` <img src="04-2dcatcolorfacet_files/figure-html/unnamed-chunk-9-1.png" width="504" /> ] --- ## Facets make it easy to move beyond 2D ```r trout_batted_balls %>% ggplot(aes(x = pitch_type, fill = batted_ball_type)) + geom_bar() + theme_bw() + facet_wrap(~ outcome, ncol = 5) + theme(legend.position = "bottom") ``` <img src="04-2dcatcolorfacet_files/figure-html/stacked-bars-facet-1.png" width="864" style="display: block; margin: auto;" />