Last week: Purrr and a bit of dplyr

The tidyverse is a collection of packages for common data science tasks
purrr is one such package that provides a consistent family of iteration functions
map(): list in, list out
map_dbl(), map_lgl(), map_chr(): list in, vector out (of a particular data type)
map_dfr(), map_dfc(): list in, data frame out (row-binded or column-binded)
dplyr is another such package that provides functions for data frame computations
filter(): subset rows based on a condition
group_by(): define groups of rows according to a condition
summarize(): apply computations across groups of rows

Part I

Motivation: tidyverse, revisited

What is the tidyverse?

The tidyverse is a coherent collection of packages in R for data science (and tidyverse is itself a actually package that loads all its constituent packages). Packages include:

Data wrangling: dplyr, tidyr, readr
Iteration: purrr
Visualization: ggplot2

Last week we covered purrr and a bit of dplyr. This week we’ll do more dplyr, and some tidyr. (Many of you will learn ggplot2 in Statistical Graphics 36-315)

Loading the tidyverse so that we can get all this functionality (plus more):

library(tidyverse)

Why the tidyverse?

Packages have a very consistent API
Very active developer and user community
Function names and commands follow a focused grammar
Powerful and fast when working with data frames and lists (matrices, not so much, yet!)
Pipes (%>% operator) allows us to fluidly glue functionality together
At its best, tidyverse code can be read like a story using the pipe operator!

Data wrangling the tidy way

Packages dplyr and tidyr are going to be our main workhorses for data wrangling
Main structure these packages use is the data frame (or tibble, but we won’t go there)
Learning pipes %>% will facilitate learning the dplyr and tidyr verbs (functions)
dplyr functions are analogous to SQL counterparts, so learn dplyr and get SQL for free!

Part II

Mastering the pipe

All behold the glorius pipe

Tidyverse functions are at their best when composed together using the pipe operator
It looks like this: %>%. Shortcut: use ctrl + shift + m in RStudio
This operator actually comes from the magrittr package (automatically included in dplyr)
Piping at its most basic level:

Take one return value and automatically feed it in as an input to another function, to form a flow of results
In unix and related systems, we also have pipes, as in:

ls -l | grep tidy | wc -l

How to read pipes: single arguments

Passing a single argument through pipes, we interpret something like:

x %>% f %>% g %>% h

as h(g(f(x)))

Key takeaway: in your mind, when you see %>%, read this as “and then”

Simple example

We can write exp(1) with pipes as 1 %>% exp, and log(exp(1)) as 1 %>% exp %>% log

exp(1)

## [1] 2.718282

1 %>% exp

## [1] 2.718282

1 %>% exp %>% log

## [1] 1

How to read pipes: multiple arguments

Now for multi-arguments functions, we interpret something like:

x %>% f(y)

as f(x,y)

Simple example

mtcars %>% head(4)

And what’s the “old school” (base R) way?

head(mtcars, 4)

Notice that, with pipes:

Your code is more readable (arguably)
You can run partial commands more easily

The dot

The command x %>% f(y) can be equivalently written in dot notation as:

x %>% f(., y)

What’s the advantage of using dots? Sometimes you want to pass in a variable as the second or third (say, not first) argument to a function, with a pipe. As in:

x %>% f(y, .)

which is equivalent to f(y,x)

Simple example

Again, see if you can interpret the code below without running it, then run it in your R console as a way to check your understanding:

state_df = data.frame(state.x77)
state.region %>% 
  tolower %>%
  tapply(state_df$Income, ., summary)

A more complicated example:

x = "Prof Tibs really loves piping"
x %>% 
  strsplit(split = " ") %>% 
  .[[1]] %>% # indexing, could also use `[[`(1)
  nchar %>% 
  max

## [1] 6

Part III

dyplr verbs

`dplyr` verbs

Some of the most important dplyr verbs (functions):

filter(): subset rows based on a condition
group_by(): define groups of rows according to a condition
summarize(): apply computations across groups of rows
arrange(): order rows by value of a column
select(): pick out given columns
mutate(): create new columns
mutate_at(): apply a function to given columns

We’ve learned filter(), group_by(), summarize() in the last lecture. (Go back and rewrite the examples using pipes!)

`arrange()`: order rows by values of a column

mtcars %>% 
  arrange(mpg) %>% 
  head(4)

##                      mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Cadillac Fleetwood  10.4   8  472 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8  460 215 3.00 5.424 17.82  0  0    3    4
## Camaro Z28          13.3   8  350 245 3.73 3.840 15.41  0  0    3    4
## Duster 360          14.3   8  360 245 3.21 3.570 15.84  0  0    3    4

# Base R
mpg_inds = order(mtcars$mpg)
head(mtcars[mpg_inds, ], 4)

##                      mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Cadillac Fleetwood  10.4   8  472 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8  460 215 3.00 5.424 17.82  0  0    3    4
## Camaro Z28          13.3   8  350 245 3.73 3.840 15.41  0  0    3    4
## Duster 360          14.3   8  360 245 3.21 3.570 15.84  0  0    3    4

We can ask for descending order:

mtcars %>% 
  arrange(desc(mpg)) %>% 
  head(4)

##    mpg cyl disp  hp drat    wt  qsec vs am gear carb    hp_wt   mpg_wt
## 1 33.9   4 71.1  65 4.22 1.835 19.90  1  1    4    1 35.42234 18.47411
## 2 32.4   4 78.7  66 4.08 2.200 19.47  1  1    4    1 30.00000 14.72727
## 3 30.4   4 75.7  52 4.93 1.615 18.52  1  1    4    2 32.19814 18.82353
## 4 30.4   4 95.1 113 3.77 1.513 16.90  1  1    5    2 74.68605 20.09253

# Base R
mpg_inds_decr = order(mtcars$mpg, decreasing = TRUE)
head(mtcars[mpg_inds_decr, ], 4)

##                 mpg cyl disp  hp drat    wt  qsec vs am gear carb    hp_wt   mpg_wt
## Toyota Corolla 33.9   4 71.1  65 4.22 1.835 19.90  1  1    4    1 35.42234 18.47411
## Fiat 128       32.4   4 78.7  66 4.08 2.200 19.47  1  1    4    1 30.00000 14.72727
## Honda Civic    30.4   4 75.7  52 4.93 1.615 18.52  1  1    4    2 32.19814 18.82353
## Lotus Europa   30.4   4 95.1 113 3.77 1.513 16.90  1  1    5    2 74.68605 20.09253

We can order by multiple columns too:

mtcars %>% 
  arrange(desc(gear), desc(hp)) %>%
  head(8)

##                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Maserati Bora  15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Merc 280       19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C      17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Mazda RX4      21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4

`select()`: pick out given columns

mtcars %>% 
  select(cyl, disp, hp) %>% 
  head(2)

##               cyl disp  hp
## Mazda RX4       6  160 110
## Mazda RX4 Wag   6  160 110

# Base R
head(mtcars[, c("cyl", "disp", "hp")], 2)

##               cyl disp  hp
## Mazda RX4       6  160 110
## Mazda RX4 Wag   6  160 110

Some handy `select()` helpers

mtcars %>% 
  select(starts_with("d")) %>% 
  head(2)

##               disp drat
## Mazda RX4      160  3.9
## Mazda RX4 Wag  160  3.9

# Base R (yikes!)
d_colnames = grep(x = colnames(mtcars), pattern = "^d")
head(mtcars[, d_colnames], 2)

##               disp drat
## Mazda RX4      160  3.9
## Mazda RX4 Wag  160  3.9

We can do many other things as well:

mtcars %>% select(ends_with('t')) %>% head(2)

##               drat    wt
## Mazda RX4      3.9 2.620
## Mazda RX4 Wag  3.9 2.875

mtcars %>% select(ends_with('yl')) %>% head(2)

##               cyl
## Mazda RX4       6
## Mazda RX4 Wag   6

mtcars %>% select(contains('ar')) %>% head(2)

##               gear carb
## Mazda RX4        4    4
## Mazda RX4 Wag    4    4

(If you’re interested go and read more here)

`mutate()`: create one or several columns

mtcars = mtcars %>% 
  mutate(hp_wt = hp/wt, 
         mpg_wt = mpg/wt) 

# Base R
mtcars$hp_wt = mtcars$hp/mtcars$wt
mtcars$mpg_wt = mtcars$mpg/mtcars$wt

Newly created variables are useable immediately:

mtcars = mtcars %>% 
  mutate(hp_wt_again = hp/wt,
         hp_wt_cyl = hp_wt_again/cyl) 

# Base R
mtcars$hp_wt_again = mtcars$hp/mtcars$wt
mtcars$hp_wt_cyl = mtcars$hp_wt_again/mtcars$cyl

`mutate_at()`: apply a function to one or several columns

mtcars = mtcars %>% 
  mutate_at(c("hp_wt", "mpg_wt"), log) 

# Base R
mtcars$hp_wt = log(mtcars$hp_wt)
mtcars$mpg_wt = log(mtcars$mpg_wt)

Important note

Calling dplyr verbs always outputs a new data frame, it does not alter the existing data frame

So to keep the changes, we have to reassign the data frame to be the output of the pipe! (Look back at the examples for mutate() and mutate_at())

`dplyr` and SQL

Once you learn dplyr you should find SQL very natural, and vice versa!
For example, select is SELECT, filter is WHERE, arrange is ORDER BY etc.
This will make it much easier for tasks that require using both R and SQL to munge data and build statistical models
One major link is through powerful verbs like group_by() and summarize(), which are used to aggregate data (next lecture)
Another major link to SQL is through merging/joining data frames, via left_join() and inner_join() verbs (which we’ll learn later)

Part IV

tidyr verbs

`tidyr` verbs

Two of the most important tidyr verbs (functions):

pivot_longer(): make “wide” data longer
pivot_wider(): make “long” data wider

There are many others like spread(), gather(), nest(), unnest(), etc. (If you’re interested go and read about them here)

`pivot_longer()`: make “wide” data longer

#devtools::install_github("rstudio/EDAWR")
library(EDAWR) # Load some nice data sets
EDAWR::cases

##   country  2011  2012  2013
## 1      FR  7000  6900  7000
## 2      DE  5800  6000  6200
## 3      US 15000 14000 13000

EDAWR::cases %>% 
  pivot_longer(names_to = "year", 
               values_to = "n", 
               cols = 2:4)

## # A tibble: 9 × 3
##   country year      n
##   <chr>   <chr> <dbl>
## 1 FR      2011   7000
## 2 FR      2012   6900
## 3 FR      2013   7000
## 4 DE      2011   5800
## 5 DE      2012   6000
## 6 DE      2013   6200
## 7 US      2011  15000
## 8 US      2012  14000
## 9 US      2013  13000

Here we transposed columns 2:4 into a year column
We put the corresponding count values into a column called n
Note tidyr did all the heavy lifting of the transposing work
We just had to declaratively specify the output

# Different approach to do the same thing
EDAWR::cases %>% 
  pivot_longer(names_to = "year", 
               values_to = "n", 
               -country)

## # A tibble: 9 × 3
##   country year      n
##   <chr>   <chr> <dbl>
## 1 FR      2011   7000
## 2 FR      2012   6900
## 3 FR      2013   7000
## 4 DE      2011   5800
## 5 DE      2012   6000
## 6 DE      2013   6200
## 7 US      2011  15000
## 8 US      2012  14000
## 9 US      2013  13000

# Could also do: 
# EDAWR::cases %>% 
#   pivot_longer(names_to = "year", 
#                values_to = "n", 
#                c(`2011`, `2012`, `2013`))

`pivot_wider()`: make “long” data wider

EDAWR::pollution

##       city  size amount
## 1 New York large     23
## 2 New York small     14
## 3   London large     22
## 4   London small     16
## 5  Beijing large    121
## 6  Beijing small     56

EDAWR::pollution %>% 
  pivot_wider(names_from = "size", 
              values_from = "amount")

## # A tibble: 3 x 3
##   city     large small
##   <chr>    <dbl> <dbl>
## 1 New York    23    14
## 2 London      22    16
## 3 Beijing    121    56

Here we transposed to a wide format by size
We tabulated the corresponding amount for each size
Note tidyr did all the heavy lifting again
We just had to declaratively specify the output
Note that pivot_wider() and pivot_longer() are inverses

Summary

The tidyverse is a collection of packages for common data science tasks
Tidyverse functionality is greatly enhanced using pipes (%>% operator)
Pipes allow you to string together commands to get a flow of results
dplyr is a package for data wrangling, with several key verbs (functions)
filter(): subset rows based on a condition
group_by(): define groups of rows according to a condition
summarize(): apply computations across groups of rows
arrange(): order rows by value of a column
select(): pick out given columns
mutate(): create new columns
mutate_at(): apply a function to given columns
tidyr is a package for manipulating the structure of data frames
pivot_longer(): make “wide” data longer
pivot_wider(): make “long” data wider

Dplyr, Pipes, and More

Last week: Purrr and a bit of dplyr

Part I

What is the tidyverse?

Why the tidyverse?

Data wrangling the tidy way

Part II

All behold the glorius pipe

How to read pipes: single arguments

Simple example

How to read pipes: multiple arguments

Simple example

The dot

Simple example

Part III

dplyr verbs

arrange(): order rows by values of a column

select(): pick out given columns

Some handy select() helpers

mutate(): create one or several columns

mutate_at(): apply a function to one or several columns

Important note

dplyr and SQL

Part IV

tidyr verbs

pivot_longer(): make “wide” data longer

pivot_wider(): make “long” data wider

Summary

`dplyr` verbs

`arrange()`: order rows by values of a column

`select()`: pick out given columns

Some handy `select()` helpers

`mutate()`: create one or several columns

`mutate_at()`: apply a function to one or several columns

`dplyr` and SQL

`tidyr` verbs

`pivot_longer()`: make “wide” data longer

`pivot_wider()`: make “long” data wider