Statistical Computing, 36-350
Monday November 5, 2018
%>%
operator)dplyr
is a package for data wrangling, with several key verbs (functions)slice()
and filter()
: subset rows based on numbers or conditionsselect()
: select columnsarrange()
: order rows by one or multiple columnsrename()
and mutate()
: rename or create columnsmutate_at()
: apply a function to given columns%>%
)Tidyverse functions are at their best when composed together using the pipe (%>%
) operator!
library(tidyverse)
mtcars %>%
filter(., (mpg >= 14 & disp >= 200) | (drat <= 3)) %>%
head(., 2)
## mpg cyl disp hp drat wt qsec vs am gear carb
## 1 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## 2 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Pro tip: use ctrl + shift + m
in RStudio as a shortcut for typing %>%
Mastering the tidyr
verbs
tidyr
verbsOur tidyr
journey starts of with learning the following verbs (functions):
gather()
: make “wide” data longerspread()
: make “long” data widerseparate()
: split a single column into multiple columnsunite()
: combine multiple columns into a single columnKey takeaway: as with dplyr
, think of data frames as nouns and tidyr
verbs as actions that you apply to manipulate them—especially natural when using pipes
gather()
Use gather()
to make “wide” data longer:
library(EDAWR) # Load some nice data sets
EDAWR::cases %>%
head(., 3)
## country 2011 2012 2013
## 1 FR 7000 6900 7000
## 2 DE 5800 6000 6200
## 3 US 15000 14000 13000
EDAWR::cases %>%
gather(., "year", "n", 2:4) %>%
head(., 5)
## country year n
## 1 FR 2011 7000
## 2 DE 2011 5800
## 3 US 2011 15000
## 4 FR 2012 6900
## 5 DE 2012 6000
year
columnn
tidyr
did all the heavy lifting of the transposing workspread()
Use spread()
to make “long” data wider:
EDAWR::pollution %>%
head(., 5)
## city size amount
## 1 New York large 23
## 2 New York small 14
## 3 London large 22
## 4 London small 16
## 5 Beijing large 121
EDAWR::pollution %>%
spread(., size, amount)
## city large small
## 1 Beijing 121 56
## 2 London 22 16
## 3 New York 23 14
tidyr
did all the heavy lifting ofspread()
and gather()
are inversesseparate()
EDAWR::storms %>%
head(., 3)
## # A tibble: 3 x 4
## storm wind pressure date
## <chr> <int> <int> <date>
## 1 Alberto 110 1007 2000-08-03
## 2 Alex 45 1009 1998-07-27
## 3 Allison 65 1005 1995-06-03
storms2 <- EDAWR::storms %>%
separate(., date, c("y", "m", "d"))
storms2
## # tibble [6 × 6]
## storm wind pressure y m d
## <chr> <int> <int> <chr> <chr> <chr>
## 1 Alberto 110 1007 2000 08 03
## 2 Alex 45 1009 1998 07 27
## 3 Allison 65 1005 1995 06 03
## 4 Ana 40 1013 1997 06 30
## 5 Arlene 50 1010 1999 06 11
## 6 Arthur 45 1010 1996 06 17
unite()
Use unite()
to combine multiple columns into a single column:
storms2 %>%
unite(., date, y, m, d, sep = "-")
## # A tibble: 6 x 4
## storm wind pressure date
## <chr> <int> <int> <chr>
## 1 Alberto 110 1007 2000-08-03
## 2 Alex 45 1009 1998-07-27
## 3 Allison 65 1005 1995-06-03
## 4 Ana 40 1013 1997-06-30
## 5 Arlene 50 1010 1999-06-11
## 6 Arthur 45 1010 1996-06-17
Note that unite()
and separate()
are inverse operations
dplyr
: group_by()
and summarize()
dplyr
and SQLdplyr
you should find SQL very natural, and vice versa!group_by()
and summarize()
, which are used to aggregate data (now)left_join()
and inner_join()
verbs (shortly)group_by()
Use group_by()
to define a grouping of rows based on a column:
mtcars %>%
group_by(., cyl) %>%
head(., 6)
## # A tibble: 6 x 11
## # Groups: cyl [3]
## mpg cyl disp hp drat wt qsec vs am gear carb
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
## 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
## 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
## 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
## 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
## 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
dplyr
verbs actsummarise()
Use summarise()
(or summarize()
for us Americans) to apply functions to rows—ungrouped or grouped—of a data frame:
# Ungrouped
mtcars %>%
summarise(.,
mpg=mean(mpg),
hp=mean(hp))
## mpg hp
## 1 20.09062 146.6875
# Grouped by number of cylinders
mtcars %>%
group_by(., cyl) %>%
summarise(.,
mpg=mean(mpg),
hp=mean(hp))
## # A tibble: 3 x 3
## cyl mpg hp
## <dbl> <dbl> <dbl>
## 1 4 26.7 82.6
## 2 6 19.7 122.
## 3 8 15.1 209.
mtcars %>%
group_by(., cyl) %>%
summarise(.,
mpg_mean=mean(mpg),
mpg_max=max(mpg),
hp_mean=mean(hp),
hp_max=max(hp))
## # A tibble: 3 x 5
## cyl mpg_mean mpg_max hp_mean hp_max
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 4 26.7 33.9 82.6 113
## 2 6 19.7 21.4 122. 175
## 3 8 15.1 19.2 209. 335
ungroup()
Use ungroup()
to remove groupings from a data frame:
mtcars %>%
group_by(., cyl) %>%
ungroup(.) %>%
summarise(.,
hp=mean(hp),
mpg=mean(mpg))
## # A tibble: 1 x 2
## hp mpg
## <dbl> <dbl>
## 1 147. 20.1
In last week’s lab we encountered the following example:
str.url <- "http://www.stat.cmu.edu/~ryantibs/statcomp-F18/data/trump.txt"
str.url %>%
readLines %>%
paste(collapse=" ") %>%
strsplit(split="[[:space:]]|[[:punct:]]") %>%
unlist(x = .) %>%
.[. != ""] %>%
table(.) %>%
sort(decreasing=TRUE) %>%
head(.)
## .
## the and of to our will
## 189 146 127 126 90 83
This required a tricky indexing operation using %>%
on vectors: .[. != ""]
We can perform the same operations using dplyr
if we convert to a data frame:
str.url %>%
readLines(.) %>%
paste(collapse=" ") %>%
strsplit(split="[[:space:]]|[[:punct:]]") %>%
unlist(.) %>%
as_data_frame() %>% # Convert to a data frame (dplyr version)
rename(word_list = value) %>% # Rename the column to `word_list`
filter(word_list != "") %>% # Now indexing is easy!
group_by(word_list) %>%
summarise(word_count = n()) %>%
arrange(desc(word_count)) %>%
slice(1:6)
## # A tibble: 6 x 2
## word_list word_count
## <chr> <int>
## 1 the 189
## 2 and 146
## 3 of 127
## 4 to 126
## 5 our 90
## 6 will 83
dplyr
: left_join()
and inner_join()
A “join” operation in database terminology is a merging of two data frames for us. There are 4 types of joins:
Column values that cannot be filled in are assigned NA
values
It helps to visualize the join types:
tab1 = data.frame(name = c("Alexis", "Bernie", "Charlie"),
children = 1:3,
stringsAsFactors = FALSE)
tab2 = data.frame(name = c("Alexis", "Bernie", "David"),
age = c(54, 34, 63),
stringsAsFactors = FALSE)
tab1
## name children
## 1 Alexis 1
## 2 Bernie 2
## 3 Charlie 3
tab2
## name age
## 1 Alexis 54
## 2 Bernie 34
## 3 David 63
inner_join()
Suppose we want to join tab1
and tab2
by name, but keep only customers in intersection:
inner_join(x=tab1, y=tab2, by="name")
## name children age
## 1 Alexis 1 54
## 2 Bernie 2 34
left_join()
Suppose we want to join tab1
and tab2
by name, but keep all customers from tab1
:
left_join(x=tab1, y=tab2, by="name")
## name children age
## 1 Alexis 1 54
## 2 Bernie 2 34
## 3 Charlie 3 NA
right_join()
Suppose we want to join tab1
and tab2
by name, but keep all customers from tab2
:
right_join(x=tab1, y=tab2, by="name")
## name children age
## 1 Alexis 1 54
## 2 Bernie 2 34
## 3 David NA 63
full_join()
Finally, suppose we want to join tab1
and tab2
by name, and keep all customers from both:
full_join(x=tab1, y=tab2, by="name")
## name children age
## 1 Alexis 1 54
## 2 Bernie 2 34
## 3 Charlie 3 NA
## 4 David NA 63
tidyr
is a package for manipulating the structure of data framesgather()
: make wide data longerspread()
: make long data widerunite()
and separate()
: combine or split columnsdplyr
has advanced functionality that mirrors SQLgroup_by()
: create groups of rows according to a conditionsummarise()
: apply computations across groups of rows*_join()
where *
= inner
, left
, right
, or full
: join two data frames together according to common values in certain columns, and *
indicates how many rows to keep