class: center, middle, inverse, title-slide # Exploring data ## Into the tidyverse ### June 7th, 2022 --- ## Data Science workflow According to [Hadley Wickham](http://hadley.nz/) in [`R` for Data Science](https://r4ds.had.co.nz/): .center[![](https://d33wubrfki0l68.cloudfront.net/571b056757d68e6df81a3e3853f54d3c76ad6efc/32d37/diagrams/data-science.png)] -- __First two weeks__: data wrangling and visualization -- Aspects of data __wrangling__: -- - __import__: reading in data (e.g. `read_csv()`) -- - __tidy__: rows = observations, columns = variables (i.e. __tabular__ data) -- - __transform__: filter observations, create new variables, summarize, etc. --- ## What is Exploratory Data Analysis (EDA)? -- _(broadly speaking)_ EDA = questions about data + wrangling + visualization -- [`R` for Data Science](https://r4ds.had.co.nz/): _"EDA is a state of mind"_, an iterative cycle: - generate questions - answer via transformations and visualizations -- Example of questions? -- - What type of __variation__ do the variables display? - What type of __relationships__ exist between variables? -- __Goal__: develop understanding and become familiar with your data -- EDA is __NOT__ a replacement for statistical inference and learning EDA is an __important__ and __necessary__ step to build intuition -- Now for an example... --- ## Exploring MLB batting statistics __Import__ `Batting` table of historical MLB statistics from the [`Lahman` package](https://cran.r-project.org/web/packages/Lahman/index.html), explore using the [`tidyverse`](https://www.tidyverse.org/) ```r library(tidyverse) # Load the tidyverse suite of packages library(Lahman) # Load the Lahman package to access its datasets Batting <- as_tibble(Batting) # Initialize the Batting dataset ``` -- Basic info about the `Batting` dataset: ```r dim(Batting) # displays same info as c(nrow(Batting), ncol(Batting)) ``` ``` ## [1] 110495 22 ``` ```r class(Batting) ``` ``` ## [1] "tbl_df" "tbl" "data.frame" ``` -- `tbl` (pronounced `tibble`) is the `tidyverse` way of storing tabular data, like a spreadsheet or `data.frame` --- __Always look at your data__: view the first 6 (by default) rows with `head()` ```r head(Batting) # Try just typing Batting into your console, what happens? ``` ``` ## # A tibble: 6 x 22 ## playerID yearID stint teamID lgID G AB R H X2B X3B HR RBI ## <chr> <int> <int> <fct> <fct> <int> <int> <int> <int> <int> <int> <int> <int> ## 1 abercda01 1871 1 TRO NA 1 4 0 0 0 0 0 0 ## 2 addybo01 1871 1 RC1 NA 25 118 30 32 6 0 0 13 ## 3 allisar01 1871 1 CL1 NA 29 137 28 40 4 5 0 19 ## 4 allisdo01 1871 1 WS3 NA 27 133 28 44 10 2 2 27 ## 5 ansonca01 1871 1 RC1 NA 25 120 29 39 11 3 0 16 ## 6 armstbo01 1871 1 FW1 NA 12 49 9 11 2 1 0 5 ## # … with 9 more variables: SB <int>, CS <int>, BB <int>, SO <int>, IBB <int>, ## # HBP <int>, SH <int>, SF <int>, GIDP <int> ``` Is our `Batting` dataset __tidy__? -- - Each row = a player's season stint with a team (i.e. players can play for multiple teams in year) - Each column = different measurement or recording about the player-team-season observation (can print out column names directly with `colnames(Batting)` or `names(Batting)`) -- __Can we explore how baseball has changed over time with `Batting`?__ --- ## Let the data wrangling begin... __Summarize__ _continuous_ (e.g. `yearID`, `AB`) and _categorical_ (e.g. `teamID`, `lgID`) variables in different ways -- Compute __summary statistics__ for _continuous_ variables with the `summary()` function: ```r summary(Batting$yearID) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 1871 1938 1977 1968 2002 2021 ``` -- Compute __counts__ of _categorical_ variables with `table()` function: ```r table("Leagues" = Batting$lgID) # be careful it ignores NA values! ``` ``` ## Leagues ## AA AL FL NA NL PL UA ## 1893 50965 472 737 55945 149 334 ``` -- _How do we remove the other leagues?_ --- [`dplyr`](https://dplyr.tidyverse.org/) is a package within the `tidyverse` with functions for data wrangling _"Grammar of data manipulation"_: `dplyr` functions are __verbs__, datasets are __nouns__ -- - __We can [`filter()`](https://dplyr.tidyverse.org/reference/filter.html) our dataset to choose observations meeting conditions__ ```r mlb_batting <- filter(Batting, lgID %in% c("AL", "NL")) nrow(Batting) - nrow(mlb_batting) # Difference in rows ``` ``` ## [1] 3585 ``` -- - __We can [`select()`](https://dplyr.tidyverse.org/reference/select.html) variables of interest__ ```r sel_batting <- select(Batting, yearID, lgID, G, AB, R, H, HR, BB, SO) head(sel_batting, n = 3) ``` ``` ## # A tibble: 3 x 9 ## yearID lgID G AB R H HR BB SO ## <int> <fct> <int> <int> <int> <int> <int> <int> <int> ## 1 1871 NA 1 4 0 0 0 0 0 ## 2 1871 NA 25 118 30 32 0 4 0 ## 3 1871 NA 29 137 28 40 0 2 5 ``` --- - __We can [`arrange()`](https://dplyr.tidyverse.org/reference/arrange.html) our dataset to sort observations by variables__ ```r hr_batting <- arrange(Batting, desc(HR)) # use desc() for descending order head(hr_batting, n = 3) ``` ``` ## # A tibble: 3 x 22 ## playerID yearID stint teamID lgID G AB R H X2B X3B HR RBI ## <chr> <int> <int> <fct> <fct> <int> <int> <int> <int> <int> <int> <int> <int> ## 1 bondsba01 2001 1 SFN NL 153 476 129 156 32 2 73 137 ## 2 mcgwima01 1998 1 SLN NL 155 509 130 152 21 0 70 147 ## 3 sosasa01 1998 1 CHN NL 159 643 134 198 20 0 66 158 ## # … with 9 more variables: SB <int>, CS <int>, BB <int>, SO <int>, IBB <int>, ## # HBP <int>, SH <int>, SF <int>, GIDP <int> ``` -- - __We can [`summarize()`](https://dplyr.tidyverse.org/reference/summarise.html) our dataset to one row based on functions of variables__ ```r summarize(Batting, max(stint), median(AB)) ``` ``` ## # A tibble: 1 x 2 ## `max(stint)` `median(AB)` ## <int> <int> ## 1 5 45 ``` --- - __We can [`mutate()`](https://dplyr.tidyverse.org/reference/mutate.html) our dataset to create new variables__ (mutate is a weird name...) ```r new_batting <- mutate(Batting, batting_avg = H / AB, so_to_bb = SO / BB) head(new_batting, n = 1) ``` ``` ## # A tibble: 1 x 24 ## playerID yearID stint teamID lgID G AB R H X2B X3B HR RBI ## <chr> <int> <int> <fct> <fct> <int> <int> <int> <int> <int> <int> <int> <int> ## 1 abercda01 1871 1 TRO NA 1 4 0 0 0 0 0 0 ## # … with 11 more variables: SB <int>, CS <int>, BB <int>, SO <int>, IBB <int>, ## # HBP <int>, SH <int>, SF <int>, GIDP <int>, batting_avg <dbl>, so_to_bb <dbl> ``` -- How do we perform several of these actions? ```r head(arrange(select(mutate(Batting, BA = H / AB), playerID, BA), desc(BA)), n = 1) ``` ``` ## # A tibble: 1 x 2 ## playerID BA ## <chr> <dbl> ## 1 snowch01 1 ``` That's awfully annoying to do, and also difficult to read... --- ## Enter the pipeline The `%>%` (_pipe_) operator is used in the `tidyverse` (from [`magrittr`](https://magrittr.tidyverse.org/articles/magrittr.html)) to chain commands together `%>%` directs the __data analyis pipeline__: output of one function pipes into input of the next function ```r Batting %>% filter(lgID %in% c("AL", "NL"), AB > 300) %>% mutate(batting_avg = H / AB) %>% arrange(desc(batting_avg)) %>% select(playerID, yearID, batting_avg) %>% head(n = 5) ``` ``` ## # A tibble: 5 x 3 ## playerID yearID batting_avg ## <chr> <int> <dbl> ## 1 duffyhu01 1894 0.440 ## 2 barnero01 1876 0.429 ## 3 lajoina01 1901 0.426 ## 4 keelewi01 1897 0.424 ## 5 hornsro01 1924 0.424 ``` --- ## More pipeline actions! Instead of `head()`, __we can [`slice()`](https://dplyr.tidyverse.org/reference/slice.html) our dataset to choose the observations based on the position__ ```r Batting %>% filter(lgID %in% c("AL", "NL"), AB > 300) %>% mutate(so_to_bb = SO / BB) %>% arrange(so_to_bb) %>% select(playerID, yearID, so_to_bb) %>% slice(c(1, 2, 10, 100)) ``` ``` ## # A tibble: 4 x 3 ## playerID yearID so_to_bb ## <chr> <int> <dbl> ## 1 roweja01 1882 0 ## 2 seweljo01 1932 0.0536 ## 3 holloch01 1922 0.0862 ## 4 collied01 1918 0.178 ``` --- ## Grouped operations __We [`group_by()`](https://dplyr.tidyverse.org/reference/group_by.html) to split our dataset into groups based on a variable's values__ ```r Batting %>% filter(lgID %in% c("AL", "NL")) %>% group_by(yearID) %>% summarize(hr = sum(HR), so = sum(SO), bb = sum(BB)) %>% arrange(desc(hr)) %>% slice(1:5) ``` ``` ## # A tibble: 5 x 4 ## yearID hr so bb ## <int> <int> <int> <int> ## 1 2019 6776 42823 15895 ## 2 2017 6105 40104 15829 ## 3 2021 5944 42145 15794 ## 4 2000 5693 31356 18237 ## 5 2016 5610 38982 15088 ``` `group_by()` is only useful in a pipeline (e.g. with `summarize()`), and pay attention to its behavior `ungroup()` can solve your problems afterwards --- ## Putting it all together... We'll create a __tidy__ dataset where each row = a year with the following variables: - total HRs (homeruns), SOs (strikeouts), and BBs (walks) - year's BA = total H / total AB - only want AL and NL leagues ```r year_batting_summary <- Batting %>% filter(lgID %in% c("AL", "NL")) %>% group_by(yearID) %>% summarize(total_hits = sum(H, na.rm = TRUE), total_hrs = sum(HR, na.rm = TRUE), total_sos = sum(SO, na.rm = TRUE), total_walks = sum(BB, na.rm = TRUE), total_atbats = sum(AB, na.rm = TRUE)) %>% mutate(batting_avg = total_hits / total_atbats) head(year_batting_summary, n = 2) ``` ``` ## # A tibble: 2 x 7 ## yearID total_hits total_hrs total_sos total_walks total_atbats batting_avg ## <int> <int> <int> <int> <int> <int> <dbl> ## 1 1876 5338 40 589 336 20121 0.265 ## 2 1877 3705 24 726 345 13667 0.271 ``` --- Top three years with the most HRs? -- ```r year_batting_summary %>% arrange(desc(total_hrs)) %>% slice(1:3) ``` ``` ## # A tibble: 3 x 7 ## yearID total_hits total_hrs total_sos total_walks total_atbats batting_avg ## <int> <int> <int> <int> <int> <int> <dbl> ## 1 2019 42039 6776 42823 15895 166651 0.252 ## 2 2017 42215 6105 40104 15829 165567 0.255 ## 3 2021 39484 5944 42145 15794 161941 0.244 ``` -- Top three years with highest batting average? -- ```r year_batting_summary %>% arrange(desc(batting_avg)) %>% slice(1:3) ``` ``` ## # A tibble: 3 x 7 ## yearID total_hits total_hrs total_sos total_walks total_atbats batting_avg ## <int> <int> <int> <int> <int> <int> <dbl> ## 1 1894 17809 629 3333 5870 57577 0.309 ## 2 1895 16827 488 3621 5120 56788 0.296 ## 3 1930 25597 1565 7934 7654 86571 0.296 ``` --- Best and worst strikeout to walk ratios? -- ```r year_batting_summary %>% mutate(so_to_bb = total_sos / total_walks) %>% arrange(so_to_bb) %>% slice(c(1, n())) ``` ``` ## # A tibble: 2 x 8 ## yearID total_hits total_hrs total_sos total_walks total_atbats batting_avg so_to_bb ## <int> <int> <int> <int> <int> <int> <dbl> <dbl> ## 1 1893 15913 460 3341 6143 56898 0.280 0.544 ## 2 1879 6171 58 1843 508 24155 0.255 3.63 ``` -- _We can make better looking tables..._ -- __[`rename()`](https://dplyr.tidyverse.org/reference/select.html) variables in our dataset__ ```r year_batting_summary %>% select(yearID, batting_avg) %>% rename(Year = yearID, `Batting AVG` = batting_avg) %>% slice(c(1, n())) ``` ``` ## # A tibble: 2 x 2 ## Year `Batting AVG` ## <int> <dbl> ## 1 1876 0.265 ## 2 2021 0.244 ``` --- ## Grammar of tables preview We can go one step further - __and use the new [`gt`](https://gt.rstudio.com/articles/intro-creating-gt-tables.html) package__ to create a nice-looking table for presentation .pull-left[ ```r library(gt) year_batting_summary %>% select(yearID, batting_avg) %>% rename(Year = yearID, `Batting AVG` = batting_avg) %>% arrange(desc(`Batting AVG`)) %>% slice(c(1:3, (n()-2):n())) %>% gt() %>% tab_header( title = "Best / worst MLB Seasons by AVG", subtitle = "Top / bottom three are presented" ) ``` ] .pull-right[
Best / worst MLB Seasons by AVG
Top / bottom three are presented
Year
Batting AVG
1894
0.3093075
1895
0.2963126
1930
0.2956764
1908
0.2389593
1888
0.2387601
1968
0.2366924
] _Note the `gt` display is different in these slides due to the [`xaringan`](https://github.com/yihui/xaringan) package formatting_ -- __Enough with tables!__ --- ## Data visualization .pull-left[ _"The simple graph has brought more information to the data analyst’s mind than any other device."_ — Tukey - __TOMORROW__: the __grammar of graphics__ - Use [`ggplot2`](https://ggplot2.tidyverse.org/) to visually explore our data - More intuitive than base `R` plotting! - Will walkthrough different types of visualizations for 1D, 2D, continuous, categorical, facetting, etc. - `tidyverse` verbs and `%>%` leads to natural pipeline for EDA ] .pull-right[ <img src="01-Exploring-data_files/figure-html/year-batting-plot-1.png" width="504" /> ]