Statistical Computing, 36-350
Tuesday September 28, 2021
purrr
is one such package that provides a consistent family of iteration functionsmap()
: list in, list outmap_dbl()
, map_lgl()
, map_chr()
: list in, vector out (of a particular data type)map_dfr()
, map_dfc()
: list in, data frame out (row-binded or column-binded)dplyr
is another such package that provides functions for data frame computationsfilter()
: subset rows based on a conditiongroup_by()
: define groups of rows according to a conditionsummarize()
: apply computations across groups of rowsMotivation: tidyverse, revisited
The tidyverse is a coherent collection of packages in R for data science (and tidyverse
is itself a actually package that loads all its constituent packages). Packages include:
dplyr
, tidyr
, readr
purrr
ggplot2
Last week we covered purrr
and a bit of dplyr
. This week we’ll do more dplyr
, and some tidyr
. (Many of you will learn ggplot2
in Statistical Graphics 36-315)
Loading the tidyverse so that we can get all this functionality (plus more):
%>%
operator) allows us to fluidly glue functionality togetherdplyr
and tidyr
are going to be our main workhorses for data wrangling%>%
will facilitate learning the dplyr
and tidyr
verbs (functions)dplyr
functions are analogous to SQL counterparts, so learn dplyr
and get SQL for free!Mastering the pipe
Tidyverse functions are at their best when composed together using the pipe operator
It looks like this: %>%
. Shortcut: use ctrl + shift + m
in RStudio
This operator actually comes from the magrittr
package (automatically included in dplyr
)
Piping at its most basic level:
Take one return value and automatically feed it in as an input to another function, to form a flow of results
In unix and related systems, we also have pipes, as in:
Passing a single argument through pipes, we interpret something like:
as h(g(f(x)))
Key takeaway: in your mind, when you see %>%
, read this as “and then”
We can write exp(1)
with pipes as 1 %>% exp
, and log(exp(1))
as 1 %>% exp %>% log
## [1] 2.718282
## [1] 2.718282
## [1] 1
Now for multi-arguments functions, we interpret something like:
as f(x,y)
And what’s the “old school” (base R) way?
Notice that, with pipes:
The command x %>% f(y)
can be equivalently written in dot notation as:
What’s the advantage of using dots? Sometimes you want to pass in a variable as the second or third (say, not first) argument to a function, with a pipe. As in:
which is equivalent to f(y,x)
Again, see if you can interpret the code below without running it, then run it in your R console as a way to check your understanding:
A more complicated example:
x = "Prof Tibs really loves piping"
x %>%
strsplit(split = " ") %>%
.[[1]] %>% # indexing, could also use `[[`(1)
nchar %>%
max
## [1] 6
dyplr
verbs
dplyr
verbsSome of the most important dplyr
verbs (functions):
filter()
: subset rows based on a conditiongroup_by()
: define groups of rows according to a conditionsummarize()
: apply computations across groups of rowsarrange()
: order rows by value of a columnselect()
: pick out given columnsmutate()
: create new columnsmutate_at()
: apply a function to given columnsWe’ve learned filter()
, group_by()
, summarize()
in the last lecture. (Go back and rewrite the examples using pipes!)
arrange()
: order rows by values of a column## mpg cyl disp hp drat wt qsec vs am gear carb
## Cadillac Fleetwood 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4
## Camaro Z28 13.3 8 350 245 3.73 3.840 15.41 0 0 3 4
## Duster 360 14.3 8 360 245 3.21 3.570 15.84 0 0 3 4
## mpg cyl disp hp drat wt qsec vs am gear carb
## Cadillac Fleetwood 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4
## Camaro Z28 13.3 8 350 245 3.73 3.840 15.41 0 0 3 4
## Duster 360 14.3 8 360 245 3.21 3.570 15.84 0 0 3 4
We can ask for descending order:
## mpg cyl disp hp drat wt qsec vs am gear carb hp_wt mpg_wt
## 1 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 35.42234 18.47411
## 2 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 30.00000 14.72727
## 3 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 32.19814 18.82353
## 4 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 74.68605 20.09253
## mpg cyl disp hp drat wt qsec vs am gear carb hp_wt mpg_wt
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 35.42234 18.47411
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 30.00000 14.72727
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 32.19814 18.82353
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 74.68605 20.09253
We can order by multiple columns too:
## mpg cyl disp hp drat wt qsec vs am gear carb
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
select()
: pick out given columns## cyl disp hp
## Mazda RX4 6 160 110
## Mazda RX4 Wag 6 160 110
## cyl disp hp
## Mazda RX4 6 160 110
## Mazda RX4 Wag 6 160 110
select()
helpers## disp drat
## Mazda RX4 160 3.9
## Mazda RX4 Wag 160 3.9
# Base R (yikes!)
d_colnames = grep(x = colnames(mtcars), pattern = "^d")
head(mtcars[, d_colnames], 2)
## disp drat
## Mazda RX4 160 3.9
## Mazda RX4 Wag 160 3.9
We can do many other things as well:
## drat wt
## Mazda RX4 3.9 2.620
## Mazda RX4 Wag 3.9 2.875
## cyl
## Mazda RX4 6
## Mazda RX4 Wag 6
## gear carb
## Mazda RX4 4 4
## Mazda RX4 Wag 4 4
(If you’re interested go and read more here)
mutate()
: create one or several columnsNewly created variables are useable immediately:
mutate_at()
: apply a function to one or several columnsCalling dplyr
verbs always outputs a new data frame, it does not alter the existing data frame
So to keep the changes, we have to reassign the data frame to be the output of the pipe! (Look back at the examples for mutate()
and mutate_at()
)
dplyr
and SQLdplyr
you should find SQL very natural, and vice versa!select
is SELECT
, filter
is WHERE
, arrange
is ORDER BY
etc.group_by()
and summarize()
, which are used to aggregate data (next lecture)left_join()
and inner_join()
verbs (which we’ll learn later)tidyr
verbs
tidyr
verbsTwo of the most important tidyr
verbs (functions):
pivot_longer()
: make “wide” data longerpivot_wider()
: make “long” data widerThere are many others like spread()
, gather()
, nest()
, unnest()
, etc. (If you’re interested go and read about them here)
pivot_longer()
: make “wide” data longer## country 2011 2012 2013
## 1 FR 7000 6900 7000
## 2 DE 5800 6000 6200
## 3 US 15000 14000 13000
## # A tibble: 9 × 3
## country year n
## <chr> <chr> <dbl>
## 1 FR 2011 7000
## 2 FR 2012 6900
## 3 FR 2013 7000
## 4 DE 2011 5800
## 5 DE 2012 6000
## 6 DE 2013 6200
## 7 US 2011 15000
## 8 US 2012 14000
## 9 US 2013 13000
year
columnn
tidyr
did all the heavy lifting of the transposing work# Different approach to do the same thing
EDAWR::cases %>%
pivot_longer(names_to = "year",
values_to = "n",
-country)
## # A tibble: 9 × 3
## country year n
## <chr> <chr> <dbl>
## 1 FR 2011 7000
## 2 FR 2012 6900
## 3 FR 2013 7000
## 4 DE 2011 5800
## 5 DE 2012 6000
## 6 DE 2013 6200
## 7 US 2011 15000
## 8 US 2012 14000
## 9 US 2013 13000
pivot_wider()
: make “long” data wider## city size amount
## 1 New York large 23
## 2 New York small 14
## 3 London large 22
## 4 London small 16
## 5 Beijing large 121
## 6 Beijing small 56
## # A tibble: 3 x 3
## city large small
## <chr> <dbl> <dbl>
## 1 New York 23 14
## 2 London 22 16
## 3 Beijing 121 56
tidyr
did all the heavy lifting againpivot_wider()
and pivot_longer()
are inverses%>%
operator)dplyr
is a package for data wrangling, with several key verbs (functions)filter()
: subset rows based on a conditiongroup_by()
: define groups of rows according to a conditionsummarize()
: apply computations across groups of rowsarrange()
: order rows by value of a columnselect()
: pick out given columnsmutate()
: create new columnsmutate_at()
: apply a function to given columnstidyr
is a package for manipulating the structure of data framespivot_longer()
: make “wide” data longerpivot_wider()
: make “long” data wider