Statistical Computing, 36-350
Monday November 4, 2019
%>% operator)dplyr is a package for data wrangling, with several key verbs (functions)slice() and filter(): subset rows based on numbers or conditionsselect() and pull(): select columns or a single column as a vectorarrange(): order rows by one or multiple columnsrename() and mutate(): rename or create columnsmutate_at(): apply a function to given columns%>%)Tidyverse functions are at their best when composed together using the pipe (%>%) operator!
##    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## 1 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## 2 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2Shortcut: use ctrl + shift + m in RStudio as a shortcut for typing %>%
Mastering the tidyr verbs
tidyr verbsOur tidyr journey starts of with learning the following verbs (functions):
pivot_longer(): make “wide” data longerpivot_wider(): make “long” data widerseparate(): split a single column into multiple columnsunite(): combine multiple columns into a single columnKey takeaway: as with dplyr, think of data frames as nouns and tidyr verbs as actions that you apply to manipulate them—especially natural when using pipes
pivot_longer()Use pivot_longer() to make “wide” data longer:
# devtools::install_github("rstudio/EDAWR")
library(EDAWR) # Load some nice data sets
EDAWR::cases %>% 
  head(3)##   country  2011  2012  2013
## 1      FR  7000  6900  7000
## 2      DE  5800  6000  6200
## 3      US 15000 14000 13000## # A tibble: 5 x 3
##   country year      n
##   <chr>   <chr> <dbl>
## 1 FR      2011   7000
## 2 FR      2012   6900
## 3 FR      2013   7000
## 4 DE      2011   5800
## 5 DE      2012   6000year columnntidyr did all the heavy lifting of the transposing work# Different approach to do the same thing
EDAWR::cases %>% 
  pivot_longer(names_to = "year", values_to = "n", -country) %>% 
  head(5)## # A tibble: 5 x 3
##   country year      n
##   <chr>   <chr> <dbl>
## 1 FR      2011   7000
## 2 FR      2012   6900
## 3 FR      2013   7000
## 4 DE      2011   5800
## 5 DE      2012   6000pivot_wider()Use pivot_wider() to make “long” data wider:
##       city  size amount
## 1 New York large     23
## 2 New York small     14
## 3   London large     22
## 4   London small     16
## 5  Beijing large    121## # A tibble: 3 x 3
##   city     large small
##   <chr>    <dbl> <dbl>
## 1 New York    23    14
## 2 London      22    16
## 3 Beijing    121    56tidyr did all the heavy lifting againpivot_wider() and pivot_longer() are inversesggplot (pivot_longer)pivot_wider)Additionally, if you find yourself getting stuck (with nuance situations) they are more complicated functions like pivot_wider_spec, etc. for these cases (see Manual Specs)
separate()Use separate() to split a single column into multiple ones:
##     storm wind pressure       date
## 1 Alberto  110     1007 2000-08-03
## 2    Alex   45     1009 1998-07-27
## 3 Allison   65     1005 1995-06-03unite()Use unite() to combine multiple columns into a single column:
## # A tibble: 6 x 4
##   storm    wind pressure date      
##   <chr>   <int>    <int> <chr>     
## 1 Alberto   110     1007 2000-08-03
## 2 Alex       45     1009 1998-07-27
## 3 Allison    65     1005 1995-06-03
## 4 Ana        40     1013 1997-06-30
## 5 Arlene     50     1010 1999-06-11
## 6 Arthur     45     1010 1996-06-17Note that unite() and separate() are inverse operations
dplyr: group_by() and summarize()
dplyr and SQLdplyr you should find SQL very natural, and vice versa!group_by() and summarize(), which are used to aggregate data (now)left_join() and inner_join() verbs (shortly)group_by()Use group_by() to define a grouping of rows based on a column:
## # A tibble: 4 x 11
## # Groups:   cyl [2]
##     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1  21       6   160   110  3.9   2.62  16.5     0     1     4     4
## 2  21       6   160   110  3.9   2.88  17.0     0     1     4     4
## 3  22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
## 4  21.4     6   258   110  3.08  3.22  19.4     1     0     3     1## [1] "grouped_df" "tbl_df"     "tbl"        "data.frame"dplyr verbs actsummarize()Use summarise() (or summarize() for us Americans) to apply functions to rows—ungrouped or grouped—of a data frame:
##        mpg       hp
## 1 20.09062 146.6875# Grouped by number of cylinders
mtcars %>%
  group_by(cyl) %>%
  summarize(mpg = mean(mpg),
            hp = mean(hp))## # A tibble: 3 x 3
##     cyl   mpg    hp
##   <dbl> <dbl> <dbl>
## 1     4  26.7  82.6
## 2     6  19.7 122. 
## 3     8  15.1 209.mtcars %>%
  group_by(cyl) %>%
  summarize(mpg_mean = mean(mpg),
            mpg_max = max(mpg),
            hp_mean = mean(hp),
            hp_max = max(hp))## # A tibble: 3 x 5
##     cyl mpg_mean mpg_max hp_mean hp_max
##   <dbl>    <dbl>   <dbl>   <dbl>  <dbl>
## 1     4     26.7    33.9    82.6    113
## 2     6     19.7    21.4   122.     175
## 3     8     15.1    19.2   209.     335ungroup()Use ungroup() to remove groupings structure from a data frame:
## # A tibble: 1 x 2
##      hp   mpg
##   <dbl> <dbl>
## 1  147.  20.1summarize() (bonus material)summarize() in dplyr only allows you to calculate single numerical summaries, but you could image that you want more than that.
For example:
## $`4`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   52.00   65.50   91.00   82.64   96.00  113.00 
## 
## $`6`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   105.0   110.0   110.0   122.3   123.0   175.0 
## 
## $`8`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   150.0   176.2   192.5   209.2   241.2   335.0There are tools to do this in tidyverse, specifically nest/unnesting data into columns and purrr::map function. (please see the first section of this lecture (Rmd, html) for some notes). These are pretty useful if you are using a lot of the tidyverse paradigm
mtcars %>%
  group_by(cyl) %>%
  nest() %>% # creates a column with the data conditional of subset
  mutate(sum = purrr::map(data, function(df) summary(df$hp)),
         sum_df = purrr::map(sum, broom::tidy)) %>% # unravel things to be data.frames
  select(cyl, sum_df) %>%
  unnest(cols = sum_df) ## # A tibble: 3 x 7
## # Groups:   cyl [3]
##     cyl minimum    q1 median  mean    q3 maximum
##   <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>   <dbl>
## 1     6     105 110     110  122.   123      175
## 2     4      52  65.5    91   82.6   96      113
## 3     8     150 176.    192. 209.   241.     335dplyr: left_join() and inner_join()
A “join” operation in database terminology is a merging of two data frames for us. There are 4 types of joins:
Column values that cannot be filled in are assigned NA values
It helps to visualize the join types:
 
 
 
has_kids_tab1 <- data.frame(name = c("Robert Downey, Jr", "Scarlett Johansson", "Chris Hemsworth"), 
                  children = c(3, 1, 3),
                  stringsAsFactors = FALSE)
americans_tab2 <- data.frame(name = c("Chris Evans", "Robert Downey, Jr", "Scarlett Johansson"),
                  age = c(38, 54, 34),
                  stringsAsFactors = FALSE)
has_kids_tab1##                 name children
## 1  Robert Downey, Jr        3
## 2 Scarlett Johansson        1
## 3    Chris Hemsworth        3##                 name age
## 1        Chris Evans  38
## 2  Robert Downey, Jr  54
## 3 Scarlett Johansson  34inner_join()Suppose we want to join tab1 and tab2 by name, but keep only actors in intersection (aka in both tables):
##                 name children age
## 1  Robert Downey, Jr        3  54
## 2 Scarlett Johansson        1  34left_join()Suppose we want to join tab1 and tab2 by name, but keep all actors from tab1:
##                 name children age
## 1  Robert Downey, Jr        3  54
## 2 Scarlett Johansson        1  34
## 3    Chris Hemsworth        3  NAright_join()Suppose we want to join tab1 and tab2 by name, but keep all actors from tab2:
##                 name children age
## 1        Chris Evans       NA  38
## 2  Robert Downey, Jr        3  54
## 3 Scarlett Johansson        1  34full_join()Finally, suppose we want to join tab1 and tab2 by name, and keep all actors from both:
##                 name children age
## 1  Robert Downey, Jr        3  54
## 2 Scarlett Johansson        1  34
## 3    Chris Hemsworth        3  NA
## 4        Chris Evans       NA  38##    pol gender state  IQ
## 1    D      F    PA  85
## 2    D      F    PA  94
## 3    R      F    PA 100
## 4    D      F    AZ 103
## 5    R      M    AZ  74
## 6    D      M    PA 110
## 7    D      M    PA  98
## 8    D      F    PA 107
## 9    D      F    PA  88
## 10   R      M    PA  97## [1] 10  4##          senator pol gender STATE
## 1 Kyrsten Sinema   D      F    AZ
## 2 Martha McSally   R      F    AZ
## 3     Pat Toomey   R      M    PA
## 4  Boy Casey Jr.   D      M    PA## [1] 4 4%T>% is a special pipe that passes my_peeps into print() as a “side-effect” and then also passes my_peeps onto the rest of the chain (which in this case is just dim())##   pol gender.x state  IQ        senator gender.y
## 1   D        F    PA  85  Boy Casey Jr.        M
## 2   D        F    PA  94  Boy Casey Jr.        M
## 3   R        F    PA 100     Pat Toomey        M
## 4   D        F    AZ 103 Kyrsten Sinema        F
## 5   R        M    AZ  74 Martha McSally        F
## 6   D        M    PA 110  Boy Casey Jr.        MNotice:
gender, not used in “by”, becomes gender.x and gender.y indicating columns for the first (x) and second (y) data framestidyr is a package for manipulating the structure of data framespivot_longer(): make wide data longerpivot_wider(): make long data widerunite() and separate(): combine or split columnsdplyr has advanced functionality that mirrors SQLgroup_by(): create groups of rows according to a conditionsummarize(): apply computations across groups of rows*_join() where * = inner, left, right, or full: join two data frames together according to common values in certain columns, and * indicates how many rows to keepnest(), map(), …