class: center, middle, inverse, title-slide # Data Visualization ## The grammar of graphics and ggplot2 ### June 8th, 2022 --- ## Do these [datasets](http://www.thefunctionalart.com/2016/08/download-datasaurus-never-trust-summary.html) have anything in common? .center[![](https://d2f99xq7vri1nk.cloudfront.net/DataDino-600x455.gif)] --- ## __"Graphics reveal data"__- [Edward Tufte](https://www.edwardtufte.com/tufte/) .center[![](https://www.researchgate.net/profile/Arch_Woodside2/publication/285672900/figure/fig4/AS:305089983074309@1449750528742/Anscombes-quartet-of-different-XY-plots-of-four-data-sets-having-identical-averages.png)] __Always visualize your data__ before analyzing / modeling it --- ## Florence Nightingale's rose diagram <img src="https://daily.jstor.org/wp-content/uploads/2020/08/florence_nightingagle_data_visualization_visionary_1050x700.jpg" width="75%" style="display: block; margin: auto;" /> --- <img src="https://github.com/ryurko/SURE22-examples/blob/main/figures/lecture_examples/nyt_ex.png?raw=true" width="110%" style="display: block; margin: auto;" /> --- ## Quick review of yesterday's material .pull-left[ We are exploring MLB hitting data from [`Lahman`](https://cran.r-project.org/web/packages/Lahman/index.html) using the [`tidyverse`](https://www.tidyverse.org/) ```r library(tidyverse) library(Lahman) Batting <- as_tibble(Batting) year_batting_summary <- Batting %>% filter(lgID %in% c("AL", "NL")) %>% group_by(yearID) %>% summarize_at(vars(H, HR, SO, BB, AB), sum, na.rm = TRUE) %>% mutate(batting_avg = H / AB) ``` __But how do we make data visualizations in `R`?__ What steps do we take to make this figure? ] .pull-right[ <img src="02-Ggplot-intro_files/figure-html/year-batting-plot-1.png" width="504" /> ] --- ### [The Grammar of Graphics](https://www.amazon.com/gp/product/0387245448/ref=as_li_ss_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=0387245448&linkCode=as2&tag=civilstatis-20) Principled data visualization framework introduced by [Leland Wilkinson](https://www.cs.uic.edu/~wilkinson/) - __start with the raw data__ -> visualize transformations, summaries, etc. of the data -- [Hadley Wickham](http://hadley.nz/) expanded upon this foundation with a [layered](http://vita.had.co.nz/papers/layered-grammar.pdf) `R` implementation via [`ggplot2`](https://ggplot2.tidyverse.org/) -- 1. __`data`__ - one or more datasets (in tidy tabular format) -- 2. __`geom`__ - one or more geometric objects to visually represent the data (e.g. points, lines, bars, etc.) -- 3. __`aes`__ - mappings of columns / variables to visual properties (i.e. _aesthetics_) of the geometric objects -- 4. __`scale`__ - one scale for each variable displayed (e.g. axis limits, log scale, colors, etc.) -- 5. __`facet`__ - similar subplots (i.e. _facets_) for subsets of the same data using a categorical variable -- 6. __`stat`__ - statistical transformations and summaries (e.g. identity, count, smooth, quantile, etc.) -- 7. __`coord`__ - one or more coordinate systems (e.g. __cartesian__, polar, map projection) -- 8. __`labs`__ - labels/guides for each variable and other parts of the plot (e.g. title, subtitle, caption, etc.) -- 9. __`theme`__ - customization of plot layout (e.g. text size, alignment, legend position, etc.) --- ## Start with the data... .pull-left[ ```r *ggplot(data = year_batting_summary) ``` __or equivalently using the `%>%`__ ```r year_batting_summary %>% ggplot() ``` __but nothing is displayed!__ ] .pull-right[ <img src="02-Ggplot-intro_files/figure-html/unnamed-chunk-3-1.png" width="504" /> ] --- ## What variables and geometric object? .pull-left[ ```r year_batting_summary %>% ggplot() + * geom_point(aes(x = yearID, y = HR)) ``` - adding (`+`) a geometric layer of points to the plot - map `yearID` to the x-axis and `HR` to the y-axis via `aes()` - implicitly using `coord_cartesian()` ```r year_batting_summary %>% ggplot() + geom_point(aes(x = yearID, y = HR)) + * coord_cartesian() ``` ] .pull-right[ <img src="02-Ggplot-intro_files/figure-html/unnamed-chunk-4-1.png" width="504" /> ] --- ## Can we add another geometric layer? .pull-left[ ```r year_batting_summary %>% ggplot() + geom_point(aes(x = yearID, y = HR)) + * geom_line(aes(x = yearID, y = HR)) ``` - adding (`+`) a line geometric layer - Include mappings shared across geometric layers inside `ggplot()` ```r year_batting_summary %>% * ggplot(aes(x = yearID, y = HR)) + geom_point() + geom_line() ``` ] .pull-right[ <img src="02-Ggplot-intro_files/figure-html/unnamed-chunk-5-1.png" width="504" /> ] --- ## What about the scales? .pull-left[ ```r year_batting_summary %>% ggplot(aes(x = yearID, y = HR)) + geom_point() + geom_line() + * scale_x_continuous() + * scale_y_continuous() ``` - `yearID` and `HR` are continuous variables, resulting in continuous scales by default ] .pull-right[ <img src="02-Ggplot-intro_files/figure-html/unnamed-chunk-6-1.png" width="504" /> ] --- ## We can customize the scale limits .pull-left[ ```r year_batting_summary %>% ggplot(aes(x = yearID, y = HR)) + geom_point() + geom_line() + * scale_x_continuous(limits = c(2000, 2018)) ``` ] .pull-right[ <img src="02-Ggplot-intro_files/figure-html/unnamed-chunk-7-1.png" width="504" /> ] --- ## We can customize the label breaks .pull-left[ ```r year_batting_summary %>% ggplot(aes(x = yearID, y = HR)) + geom_point() + geom_line() + * scale_y_continuous(breaks = seq(0, 6000, * by = 1000)) ``` ] .pull-right[ <img src="02-Ggplot-intro_files/figure-html/unnamed-chunk-8-1.png" width="504" /> ] --- ## We can use different scales! .pull-left[ ```r year_batting_summary %>% ggplot(aes(x = yearID, y = HR)) + geom_point() + geom_line() + * scale_x_reverse() + * scale_y_log10() ``` __You can easily adjust variable scales without modifying the columns directly__ ] .pull-right[ <img src="02-Ggplot-intro_files/figure-html/unnamed-chunk-9-1.png" width="504" /> ] --- ## Add a statistical summary? .pull-left[ ```r year_batting_summary %>% ggplot(aes(x = yearID, y = HR)) + geom_point() + geom_line() + * stat_smooth() ``` - Smoothing regression summary (will cover later) using `yearID` and `HR` - Geometric layers implicitly use a default statistical summary - Technically we are already using `geom_point(stat = "identity")` ```r year_batting_summary %>% ggplot(aes(x = yearID, y = HR)) + geom_point() + geom_line() + * geom_smooth() ``` ] .pull-right[ <img src="02-Ggplot-intro_files/figure-html/unnamed-chunk-10-1.png" width="504" /> ] --- ## Map additional variables? .pull-left[ ```r year_batting_summary %>% ggplot(aes(x = yearID, y = HR, * color = SO, * size = BB)) + geom_point() + geom_line() ``` - `HR`, `SO`, and `BB` are all displayed! - `color` and `size` are being shared across layers - This is a bit odd to look at... ] .pull-right[ <img src="02-Ggplot-intro_files/figure-html/unnamed-chunk-11-1.png" width="504" /> ] --- ## Customize mappings by layer .pull-left[ ```r year_batting_summary %>% ggplot(aes(x = yearID, y = HR)) + * geom_point(aes(color = SO, * size = BB)) + geom_line() ``` - Now mapping `SO` and `BB` to `color` and `size` of __only__ the point layer ] .pull-right[ <img src="02-Ggplot-intro_files/figure-html/unnamed-chunk-12-1.png" width="504" /> ] --- ## Can change aesthetics without mapping variables .pull-left[ ```r year_batting_summary %>% ggplot(aes(x = yearID, y = HR)) + geom_point(aes(color = SO, size = BB)) + * geom_line(color = "darkred", * linetype = "dashed") ``` - Set manual values to the `color` and `linetype` of the line layer ] .pull-right[ <img src="02-Ggplot-intro_files/figure-html/unnamed-chunk-13-1.png" width="504" /> ] --- ## Remember: one scale for each mapped variable .pull-left[ ```r year_batting_summary %>% ggplot(aes(x = yearID, y = HR)) + geom_point(aes(color = SO, size = BB)) + geom_line(color = "darkred", linetype = "dashed") + * scale_color_gradient(low = "darkblue", * high = "darkorange") + * scale_size_continuous(breaks = * seq(0, 20000, * 2500)) ``` ] .pull-right[ <img src="02-Ggplot-intro_files/figure-html/unnamed-chunk-14-1.png" width="504" /> ] --- ## You MUST label your plots! .pull-left[ ```r year_batting_summary %>% ggplot(aes(x = yearID, y = HR)) + geom_point(aes(color = SO, size = BB)) + geom_line(color = "darkred", linetype = "dashed") + scale_color_gradient(low = "darkblue", high = "darkorange") + * labs(x = "Year", y = "Homeruns", * color = "Strikeouts", * size = "Walks", * title = "The rise of MLB's three true outcomes", * caption = "Data courtesy of Lahman") ``` - Each mapped aesthetic can be labelled (_what happened to the legend order?_) ] .pull-right[ <img src="02-Ggplot-intro_files/figure-html/unnamed-chunk-15-1.png" width="504" /> ] --- ## Custom theme .pull-left[ ```r year_batting_summary %>% ggplot(aes(x = yearID, y = HR)) + geom_point(aes(color = SO, size = BB)) + geom_line(color = "darkred", linetype = "dashed") + scale_color_gradient(low = "darkblue", high = "darkorange") + labs(x = "Year", y = "Homeruns", color = "Strikeouts", size = "Walks", title = "The rise of MLB's three true outcomes", caption = "Data courtesy of Lahman") + * theme_bw() + * theme(legend.position = "bottom", * plot.title = element_text(size = 15), * axis.title = element_text(size = 10)) ``` - `theme_bw()` is a popular default, but check out [`ggthemes`](https://jrnold.github.io/ggthemes/index.html) for more, ex: `theme_fivethirtyeight()` ] .pull-right[ <img src="02-Ggplot-intro_files/figure-html/unnamed-chunk-16-1.png" width="504" /> ] --- ## A lesson about data visualization... .center[![](https://media1.tenor.com/images/02e615bbff46a6f4ed05735360532142/tenor.gif?itemid=4683433)] -- __Simpler is better__ - instead create three separate plots for `HR`, `SO`, and `BB` with each mapped to `y` -- _How do we do this without repeating the same code three times?_ --- ## Pivot the data! __Remember__: we need the data in tidy format -- Within the `tidyverse`, the [`tidyr`](https://tidyr.tidyverse.org/index.html) package has functions that allow us to easily _realign_ our data - [`pivot_longer`](https://tidyr.tidyverse.org/reference/pivot_longer.html) - gathers information spread out across variables, increase `nrow()` but decrease `ncol()` - [`pivot_wider`](https://tidyr.tidyverse.org/reference/pivot_wider.html) - spreads information out from observations, decrease `nrow()` but increase `ncol()` -- .pull-left[ ```r year_batting_summary %>% select(yearID, HR, SO, BB) %>% rename(HRs = HR, Strikeouts = SO, Walks = BB) %>% * pivot_longer(HRs:Walks, * names_to = "stat", * values_to = "value") ``` We have now created a new variable `stat` ] .pull-right[ ``` ## # A tibble: 6 x 3 ## yearID stat value ## <int> <chr> <int> ## 1 1876 HRs 40 ## 2 1876 Strikeouts 589 ## 3 1876 Walks 336 ## 4 1877 HRs 24 ## 5 1877 Strikeouts 726 ## 6 1877 Walks 345 ``` ] --- ## Use facets to create subplots with categorical variables .pull-left[ ```r year_batting_summary %>% select(yearID, HR, SO, BB) %>% rename(HRs = HR, Strikeouts = SO, Walks = BB) %>% pivot_longer(HRs:Walks, names_to = "stat", values_to = "value") %>% ggplot(aes(x = yearID, y = value)) + geom_line(color = "darkblue") + geom_point(color = "darkblue") + * facet_wrap(~ stat, * scales = "free_y", ncol = 1) + labs(x = "Year", y = "Total of statistic", title = "The rise of MLB's three true outcomes", caption = "Data courtesy of Lahman") + theme_bw() + theme(strip.background = element_blank()) ``` ] .pull-right[ <img src="02-Ggplot-intro_files/figure-html/unnamed-chunk-18-1.png" width="504" /> ]