class: center, middle, inverse, title-slide # Data Visualization ## Density estimation ### June 13th, 2022 --- ## New dataset - Stephen Curry's shots Created dataset of shot attempts by the Stephen Curry in 2021-2022 season using [`nbastatR`](http://asbcllc.com/nbastatR/) ```r library(tidyverse) curry_shots <- read_csv("http://www.stat.cmu.edu/cmsac/sure/2022/materials/data/sports/xy_examples/curry_2022_shots.csv") head(curry_shots) ``` ``` ## # A tibble: 6 x 8 ## shot_x shot_y shot_distance is_shot_made period fg_type shot_zone shot_type ## <dbl> <dbl> <dbl> <lgl> <dbl> <chr> <chr> <chr> ## 1 -109 260 28 FALSE 1 3PT Fiel… Above the… Pullup Jump … ## 2 48 257 26 FALSE 1 3PT Fiel… Above the… Running Pull… ## 3 -165 189 25 TRUE 1 3PT Fiel… Above the… Jump Shot ## 4 -13 12 1 FALSE 1 2PT Fiel… Restricte… Driving Fing… ## 5 -15 22 2 FALSE 1 2PT Fiel… Restricte… Layup Shot ## 6 18 16 2 FALSE 1 2PT Fiel… Restricte… Driving Layu… ``` - each row / observation is a shot attempt by Curry in the 2021 season - __Categorical__ / qualitative variables: `is_shot_made`, `fg_type`, `shot_zone`, `shot_type` - __Continuous__ / quantitative variables: `shot_x`, `shot_y`, `shot_distance` --- ## Back to [histograms](https://ggplot2.tidyverse.org/reference/geom_histogram.html)... -- .pull-left[ ```r fd_bw <- 2 * IQR(curry_shots$shot_distance) / length(curry_shots$shot_distance)^(1/3) curry_shots %>% ggplot(aes(x = shot_distance)) + geom_histogram(binwidth = fd_bw) + theme_bw() ``` - Split observed data into __bins__ - __Count__ number of observations in each bin __Need to choose the number of bins__, adjust with: - `bins` - number of bins (default is 30) - `binwidth` - literally the width of bins (overrides `bins`), various [rules of thumb](https://en.wikipedia.org/wiki/Histogram) - e.g., see `fd_bw` for [Freedman–Diaconis rule](https://en.wikipedia.org/wiki/Freedman%E2%80%93Diaconis_rule) - `breaks` - vector of bin boundaries (overrides both `bins` and `binwidth`) ] .pull-right[ <img src="05-Dens-estim_files/figure-html/unnamed-chunk-1-1.png" width="504" /> ] --- ## Adjusting the bin width .pull-left[ __Small__ `binwidth` `\(\rightarrow\)` _"undersmooth"_ / spiky ```r curry_shots %>% ggplot(aes(x = shot_distance)) + * geom_histogram(binwidth = 1) + theme_bw() ``` <img src="05-Dens-estim_files/figure-html/shot-dist-hist-small-1.png" width="504" /> ] .pull-right[ __Large__ `binwidth` `\(\rightarrow\)` _"oversmooth"_ / flat ```r curry_shots %>% ggplot(aes(x = shot_distance)) + * geom_histogram(binwidth = 25) + theme_bw() ``` <img src="05-Dens-estim_files/figure-html/shot-dist-hist-large-1.png" width="504" /> ] __Try several approaches, the `R` / `ggplot2` default is NOT guaranteed to be an optimal choice__ --- ### A subtle point about the histogram code... .pull-left[ By default the bins are centered on the integers... - left-closed, right-open intervals - starting at -0.5 to 0.5, 0.5 to 1.5, ... ```r curry_shots %>% ggplot(aes(x = shot_distance)) + * geom_histogram(binwidth = 1) + theme_bw() ``` <img src="05-Dens-estim_files/figure-html/unnamed-chunk-2-1.png" width="504" /> ] .pull-right[ __Specify center of one bin__ (e.g. 0.5) - Reminder to use `closed = "left"`... ```r curry_shots %>% ggplot(aes(x = shot_distance)) + * geom_histogram(binwidth = 1, center = 0.5, * closed = "left") + theme_bw() ``` <img src="05-Dens-estim_files/figure-html/shot-dist-hist-shift-1.png" width="504" /> ] --- ### How do histograms relate to the PDF and CDF? __Remember__: we use the __probability density function (PDF)__ to provide a __relative likelihood__ -- - PDF is the __derivative__ of the cumulative distribution function (CDF) -- - Histograms approximate the PDF with bins, and __points are equally likely within a bin__ .pull-left[ <img src="05-Dens-estim_files/figure-html/shot-dist-hist-left-1.png" width="504" /> ] .pull-right[ <img src="05-Dens-estim_files/figure-html/shot-dist-ecdf-right-1.png" width="504" /> ] -- __What can say about the relative likelihood of data we have not observed?__ - we want __non-zero density__ between our observations, e.g., just beyond 20 feet --- ## Kernel density estimation __Goal__: estimate the PDF `\(f(x)\)` for all possible values (assuming it is continuous / smooth) -- $$ \text{Kernel density estimate: } \hat{f}(x) = \frac{1}{n} \sum_{i=1}^n \frac{1}{h} K_h(x - x_i) $$ -- - `\(n =\)` sample size, `\(x =\)` new point to estimate `\(f(x)\)` (does NOT have to be in dataset!) -- - `\(h =\)` __bandwidth__, analogous to histogram bin width, ensures `\(\hat{f}(x)\)` integrates to 1 - `\(x_i =\)` `\(i\)`th observation in dataset -- - `\(K_h(x - x_i)\)` is the __Kernel__ function, creates __weight__ given distance of `\(i\)`th observation from new point - as `\(|x - x_i| \rightarrow \infty\)` then `\(K_h(x - x_i) \rightarrow 0\)`, i.e. further apart `\(i\)`th row is from `\(x\)`, smaller the weight - as __bandwidth__ `\(h \uparrow\)` weights are more evenly spread out (as `\(h \downarrow\)` more concentrated around `\(x\)`) - typically use [__Gaussian__ / Normal](https://en.wikipedia.org/wiki/Normal_distribution) kernel: `\(\propto e^{-(x - x_i)^2 / 2h^2}\)` - `\(K_h(x - x_i)\)` is large when `\(x_i\)` is close to `\(x\)` --- ## [Wikipedia example](https://en.wikipedia.org/wiki/Kernel_density_estimation) .center[![](https://upload.wikimedia.org/wikipedia/commons/thumb/4/41/Comparison_of_1D_histogram_and_KDE.png/1000px-Comparison_of_1D_histogram_and_KDE.png)] --- ## How do we compute and display the density estimate? .pull-left[ - We make __kernel density estimates__ with [`geom_density()`](https://ggplot2.tidyverse.org/reference/geom_density.html) ```r curry_shots %>% ggplot(aes(x = shot_distance)) + * geom_density() + geom_rug(alpha = 0.3) + theme_bw() ``` - __Pros__: - Displays full shape of distribution - Can easily layer - Add categorical variable with color - __Cons__: - Need to pick bandwidth and kernel... ] .pull-right[ <img src="05-Dens-estim_files/figure-html/unnamed-chunk-3-1.png" width="504" /> ] --- ## What about the bandwidth? See [Chapter 14 for more...](https://www.stat.cmu.edu/~cshalizi/ADAfaEPoV/) Use __Gaussian reference rule__ (_rule-of-thumb_) `\(\approx 1.06 \cdot \sigma \cdot n^{-1/5}\)`, where `\(\sigma\)` is the observed standard deviation Modify the bandwidth using the `adjust` argument - __value to multiply default bandwidth by__ .pull-left[ ```r curry_shots %>% ggplot(aes(x = shot_distance)) + * geom_density(adjust = 0.5) + geom_rug(alpha = 0.3) + theme_bw() ``` <img src="05-Dens-estim_files/figure-html/curve-noisy-1.png" width="504" /> ] .pull-right[ ```r curry_shots %>% ggplot(aes(x = shot_distance)) + * geom_density(adjust = 2) + geom_rug(alpha = 0.3) + theme_bw() ``` <img src="05-Dens-estim_files/figure-html/curve-smooth-1.png" width="504" /> ] --- ## Use density curves and ECDFs together <img src="05-Dens-estim_files/figure-html/shot-dist-curve-ecdf-1.png" width="1152" style="display: block; margin: auto;" /> --- ## Code interlude: easy way to arrange multiple figures Use the new [`patchwork`](https://patchwork.data-imaginist.com/index.html) package to easily arrange your plots (see also [`cowplot`](https://cran.r-project.org/web/packages/cowplot/vignettes/introduction.html)) ```r *library(patchwork) curry_shot_dens <- curry_shots %>% ggplot(aes(x = shot_distance)) + geom_density() + geom_rug(alpha = 0.3) + theme_bw() + labs(x = "Shot distance (in feet)", y = "Number of shot attempts") curry_shot_ecdf <- curry_shots %>% ggplot(aes(x = shot_distance)) + stat_ecdf() + geom_rug(alpha = 0.3) + theme_bw() + labs(x = "Shot distance (in feet)", y = "Proportion of Curry shot attempts") *curry_shot_dens + curry_shot_ecdf ``` --- ## Use density curves and ECDFs together <img src="05-Dens-estim_files/figure-html/shot-dist-curve-ecdf-color-1.png" width="1152" style="display: block; margin: auto;" /> --- ## Another code interlude: [collect the legends](https://patchwork.data-imaginist.com/articles/guides/layout.html#controlling-guides) ```r curry_shot_dens_made <- curry_shots %>% ggplot(aes(x = shot_distance, color = is_shot_made)) + geom_density() + geom_rug(alpha = 0.3) + theme_bw() + labs(x = "Shot distance (in feet)", y = "Number of shot attempts") curry_shot_ecdf_made <- curry_shots %>% ggplot(aes(x = shot_distance, color = is_shot_made)) + stat_ecdf() + geom_rug(alpha = 0.3) + theme_bw() + labs(x = "Shot distance (in feet)", y = "Proportion of Curry shot attempts") *curry_shot_dens_made + curry_shot_ecdf_made + plot_layout(guides = 'collect') ``` --- ## Alternative to violins - ridge plots .pull-left[ - Check out the [`ggridges`](https://cran.r-project.org/web/packages/ggridges/vignettes/introduction.html) package for a variety of customization options ```r library(ggridges) curry_shots %>% ggplot(aes(x = shot_distance, * y = shot_type)) + * geom_density_ridges(rel_min_height = 0.01) + theme_bw() ``` - Useful to display conditional distributions across many levels ] .pull-right[ <img src="05-Dens-estim_files/figure-html/unnamed-chunk-6-1.png" width="504" /> ] --- ## What about for 2D? (two continuous variables) We can visualize all of the shot locations: (`shot_x`, `shot_y`) .pull-left[ ```r curry_shots %>% # Modify the shot coordinates mutate(shot_x = -shot_x / 10, shot_y = shot_y / 10) %>% * ggplot(aes(x = shot_x, y = shot_y)) + * geom_point(alpha = 0.3) + theme_bw() ``` - Adjust transparency with `alpha` for overlapping points ] .pull-right[ <img src="05-Dens-estim_files/figure-html/unnamed-chunk-7-1.png" width="504" /> ] --- ## Create contours of 2D kernel density estimate (KDE) .pull-left[ - We make 2D KDE __contour__ plots using [`geom_density2d()`](https://ggplot2.tidyverse.org/reference/geom_density_2d.html) ```r curry_shots %>% # Modify the shot coordinates mutate(shot_x = -shot_x / 10, shot_y = shot_y / 10) %>% filter(shot_y <= 30) %>% ggplot(aes(x = shot_x, y = shot_y)) + geom_point(alpha = 0.3) + * geom_density2d() + theme_bw() + theme(legend.position = "bottom") + * coord_fixed() ``` - Extend KDE for joint density estimates in 2D (see [section 14.4.2 for details](https://www.stat.cmu.edu/~cshalizi/ADAfaEPoV/)) - `coord_fixed()` forced a fixed ratio ] .pull-right[ <img src="05-Dens-estim_files/figure-html/unnamed-chunk-8-1.png" width="504" /> ] --- ## Create contours of 2D kernel density estimate (KDE) .pull-left[ - We make 2D KDE __contour__ plots using [`geom_density2d()`](https://ggplot2.tidyverse.org/reference/geom_density_2d.html) ```r curry_shots %>% # Modify the shot coordinates mutate(shot_x = -shot_x / 10, shot_y = shot_y / 10) %>% # Remove the outlier shots: filter(shot_y <= 30) %>% ggplot(aes(x = shot_x, y = shot_y)) + geom_point(alpha = 0.3) + * geom_density2d(adjust = 0.1) + theme_bw() + theme(legend.position = "bottom") + coord_fixed() ``` - Can use `adjust` to modify the multivariate bandwidth ] .pull-right[ <img src="05-Dens-estim_files/figure-html/unnamed-chunk-9-1.png" width="504" /> ] --- ## Contours are difficult... let's make a heatmap instead .pull-left[ - We make 2D KDE __heatmap__ plots using [`stat_density_2d()`](https://ggplot2.tidyverse.org/reference/geom_density_2d.html) and the `..` or [`after_stat()`](https://ggplot2.tidyverse.org/reference/aes_eval.html) function ```r curry_shots %>% mutate(shot_x = -shot_x / 10, shot_y = shot_y / 10) %>% filter(shot_y <= 30) %>% ggplot(aes(x = shot_x, y = shot_y)) + * stat_density2d(h = 0.5, bins = 60, * aes(fill = after_stat(level)), * geom = "polygon") + * scale_fill_gradient(low = "darkblue", * high = "darkorange") + theme_bw() + theme(legend.position = "bottom") + coord_fixed() ``` __Multivariate density estimation can be difficult__ ] .pull-right[ <img src="05-Dens-estim_files/figure-html/unnamed-chunk-10-1.png" width="504" /> ] --- ## Turn off contours and use tiles instead .pull-left[ - We make 2D KDE __heatmap__ plots using [`stat_density_2d()`](https://ggplot2.tidyverse.org/reference/geom_density_2d.html) and the `..` or [`after_stat()`](https://ggplot2.tidyverse.org/reference/aes_eval.html) function ```r curry_shots %>% mutate(shot_x = -shot_x / 10, shot_y = shot_y / 10) %>% filter(shot_y <= 30) %>% ggplot(aes(x = shot_x, y = shot_y)) + stat_density2d(h = 0.5, bins = 60, * contour = FALSE, * aes(fill = after_stat(density)), * geom = "raster") + * scale_fill_gradient(low = "darkblue", * high = "darkorange") + theme_bw() + theme(legend.position = "bottom") + coord_fixed() ``` ] .pull-right[ <img src="05-Dens-estim_files/figure-html/unnamed-chunk-11-1.png" width="504" /> ] --- ## Best alternative? Hexagonal binning .pull-left[ - We make __hexagonal heatmap__ plots using [`geom_hex()`](https://ggplot2.tidyverse.org/reference/geom_hex.html) - Need to have the [`hexbin`](https://cran.r-project.org/web/packages/hexbin/index.html) package installed ```r curry_shots %>% mutate(shot_x = -shot_x / 10, shot_y = shot_y / 10) %>% filter(shot_y <= 30) %>% ggplot(aes(x = shot_x, y = shot_y)) + * geom_hex(binwidth = c(1, 1)) + scale_fill_gradient(low = "darkblue", high = "darkorange") + theme_bw() + theme(legend.position = "bottom") + coord_fixed() ``` - Can specify `binwidth` in both directions - Avoids limitations from smoothing ] .pull-right[ <img src="05-Dens-estim_files/figure-html/unnamed-chunk-12-1.png" width="504" /> ] --- ## What about his shooting efficiency? - Can compute a function of another variable inside hexagons with [`stat_summary_hex()`](https://ggplot2.tidyverse.org/reference/stat_summary_2d.html) - Check out [BallR](https://github.com/toddwschneider/ballr) for code examples to make shot charts and drawing courts .pull-left[ ```r curry_shots %>% mutate(shot_x = -shot_x / 10, shot_y = shot_y / 10) %>% filter(shot_y <= 30) %>% ggplot(aes(x = shot_x, y = shot_y, * z = is_shot_made, * group = -1)) + * stat_summary_hex(binwidth = c(2, 2), * color = "black", * fun = mean) + scale_fill_gradient(low = "darkblue", high = "darkorange") + theme_bw() + theme(legend.position = "bottom") + coord_fixed() ``` ] .pull-right[ <img src="05-Dens-estim_files/figure-html/unnamed-chunk-13-1.png" width="504" /> ]