Statistical Computing, 36-350
Monday October 28, 2019
Summary:
lm()
allows you to fit a linear model by specifying a formula, in terms of column names of a given data framecoef()
, fitted()
, residuals()
, summary()
, plot()
, predict()
are very handy and should be used over manual access tricksglm()
with family="binomial"
and all the same utility functionsgam()
and utility functionsWelcome to the tidyverse
tidyverse
is a coherent collection of packages for data scienceggplot2
, plyr
)dplyr
, tidyr
, tibble
, readr
ggplot2
purrr
tidyverse
package, we can see what’s in there:%>%
operator) to fluidly glue functionality togetherdplyr
and tidyr
are going to be our main workhorses for data wrangling%>%
dplyr
verbsdplyr
functions are analogous to SQL counterparts, so learn dplyr
and get SQL for free!R
data science communityMastering the pipe
%>%
. Shortcut: use ctrl + shift + m
in RStudiomagrittr
package (automatically included in tidyverse
)Piping at its most basic level:
Take one return value and automatically feed it in as an input to another function, to form a flow of results
In unix and related systems, we also have pipes, as in:
## 9
Passing a single argument through pipes, we interpret something like:
as h(g(f(x)))
Key takeaway: in your mind, when you see %>%
, read this as “and then”
We can write exp(1)
with pipes as 1 %>% exp()
, and log(exp(1))
as 1 %>% exp() %>% log()
## [1] 2.718282
## [1] 2.718282
## [1] 1
Now for multi-arguments functions, we interpret something like:
as f(x,y)
And what’s the “old school” (base R) way?
Notice that, with pipes:
The command x %>% f(y)
can be equivalently written in dot notation as:
What’s the advantage of using dots? Sometimes you want to pass in a variable as the second or third (say, not first) argument to a function, with a pipe. As in:
which is equivalent to f(y,x)
Again, see if you can interpret the code below without running it, then run it in your R console as a way to check your understanding:
A more complicated example:
x <- "Ben really loves piping"
x %>%
strsplit(split = " ") %>%
.[[1]] %>% # subsetting
nchar() %>%
max # note the lack of paretheses
## [1] 6
Mastering the dyplr
verbs
dplyr
verbsOur dplyr
journey starts of with learning the following verbs (functions):
slice()
: subset rows based on integer indexingfilter()
: subset rows based on logical criteriaselect()
: select certain columnspull()
: pull out a individual columnarrange()
: order rows by value of a columnrename()
: rename columnsmutate()
: create new columnsmutate_at()
: apply a function to given columnsKey takeaway: think of data frames as nouns and dplyr
verbs as actions that you apply to manipulate them—especially natural when using pipes
slice()
Use slice()
when you want to indicate certain row numbers need to be kept:
## mpg cyl disp hp drat wt qsec vs am gear carb
## 1 14.3 8 360.0 245 3.21 3.57 15.84 0 0 3 4
## 2 24.4 4 146.7 62 3.69 3.19 20.00 1 0 4 2
## 3 15.2 8 275.8 180 3.07 3.78 18.00 0 0 3 3
## 4 10.4 8 472.0 205 2.93 5.25 17.98 0 0 3 4
## mpg cyl disp hp drat wt qsec vs am gear carb
## Duster 360 14.3 8 360.0 245 3.21 3.57 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.19 20.00 1 0 4 2
## Merc 450SLC 15.2 8 275.8 180 3.07 3.78 18.00 0 0 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.25 17.98 0 0 3 4
We can also do negative slicing:
## [1] 25
## [1] 25
filter()
Use filter()
when you want to subset rows based on logical conditions:
## mpg cyl disp hp drat wt qsec vs am gear carb
## 1 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## 2 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## mpg cyl disp hp drat wt qsec vs am gear carb
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## mpg cyl disp hp drat wt qsec vs am gear carb
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
select()
Use select()
when you want to pick out certain columns:
## cyl disp hp
## Mazda RX4 6 160 110
## Mazda RX4 Wag 6 160 110
## cyl disp hp
## Mazda RX4 6 160 110
## Mazda RX4 Wag 6 160 110
select()
helpersVery handy selections using dplyr
helper functions:
## disp drat
## Mazda RX4 160 3.9
## Mazda RX4 Wag 160 3.9
# Base R (yikes!):
d_colnames <- grep(x = colnames(mtcars), pattern = "^d")
head(mtcars[, d_colnames], 2)
## disp drat
## Mazda RX4 160 3.9
## Mazda RX4 Wag 160 3.9
We can do many other things as well:
## drat wt
## Mazda RX4 3.9 2.620
## Mazda RX4 Wag 3.9 2.875
## cyl
## Mazda RX4 6
## Mazda RX4 Wag 6
## gear carb
## Mazda RX4 4 4
## Mazda RX4 Wag 4 4
See documentation (either ?select
and then go to the “Useful functions” section or to the website with the same info)
pull()
pull
select
preserves column structure even with a single column## [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2
## [15] 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4
## [29] 15.8 19.7 15.0 21.4
## mpg
## Mazda RX4 21.0
## Mazda RX4 Wag 21.0
## Datsun 710 22.8
## Hornet 4 Drive 21.4
## Hornet Sportabout 18.7
## Valiant 18.1
## Duster 360 14.3
## Merc 240D 24.4
## Merc 230 22.8
## Merc 280 19.2
## Merc 280C 17.8
## Merc 450SE 16.4
## Merc 450SL 17.3
## Merc 450SLC 15.2
## Cadillac Fleetwood 10.4
## Lincoln Continental 10.4
## Chrysler Imperial 14.7
## Fiat 128 32.4
## Honda Civic 30.4
## Toyota Corolla 33.9
## Toyota Corona 21.5
## Dodge Challenger 15.5
## AMC Javelin 15.2
## Camaro Z28 13.3
## Pontiac Firebird 19.2
## Fiat X1-9 27.3
## Porsche 914-2 26.0
## Lotus Europa 30.4
## Ford Pantera L 15.8
## Ferrari Dino 19.7
## Maserati Bora 15.0
## Volvo 142E 21.4
arrange()
Use arrange()
to order rows by values of a column:
## mpg disp drat
## 1 10.4 472 2.93
## 2 10.4 460 3.00
# Base R:
drat_inds <- order(mtcars$disp, decreasing = TRUE)
head(mtcars[drat_inds, c("mpg", "disp", "drat")], 2)
## mpg disp drat
## Cadillac Fleetwood 10.4 472 2.93
## Lincoln Continental 10.4 460 3.00
We can order by multiple columns too:
## gear hp mpg cyl disp drat wt qsec vs am carb
## 1 5 335 15.0 8 301.0 3.54 3.570 14.60 0 1 8
## 2 5 264 15.8 8 351.0 4.22 3.170 14.50 0 1 4
## 3 5 175 19.7 6 145.0 3.62 2.770 15.50 0 1 6
## 4 5 113 30.4 4 95.1 3.77 1.513 16.90 1 1 2
## 5 5 91 26.0 4 120.3 4.43 2.140 16.70 0 1 2
## 6 4 123 19.2 6 167.6 3.92 3.440 18.30 1 0 4
## 7 4 123 17.8 6 167.6 3.92 3.440 18.90 1 0 4
## 8 4 110 21.0 6 160.0 3.90 2.620 16.46 0 1 4
mutate()
Use mutate()
when you want to create one or several columns:
data.frame
with updated/added columns## mpg cyl disp hp drat wt qsec vs am gear carb hp_wt mpg_wt
## 1 21 6 160 110 3.9 2.620 16.46 0 1 4 4 1 8.015267
## 2 21 6 160 110 3.9 2.875 17.02 0 1 4 4 1 7.304348
mtcars <- mtcars %>%
mutate(hp_wt_correct = hp/wt,
hp_wt_cyl = hp_wt_correct/cyl)
mtcars %>% head(2)
## mpg cyl disp hp drat wt qsec vs am gear carb hp_wt mpg_wt
## 1 21 6 160 110 3.9 2.620 16.46 0 1 4 4 1 8.015267
## 2 21 6 160 110 3.9 2.875 17.02 0 1 4 4 1 7.304348
## hp_wt_correct hp_wt_cyl
## 1 41.98473 6.997455
## 2 38.26087 6.376812
mutate_at()
Use mutate_at()
when you want to apply a function to one or several columns:
# correction
mtcars <- mtcars %>% mutate(hp_wt = hp_wt_correct)
mtcars <- mtcars %>%
mutate_at(c("hp_wt", "mpg_wt"), log)
# Base R:
mtcars$hp_wt <- log(mtcars$hp_wt)
mtcars$mpg_wt <- log(mtcars$mpg_wt)
Note: again, calling mutate_at()
outputs a new data frame, it does not alter the given data frame, so to keep the column transformations, we have to reassign mtcars
to be the output of the pipe!
rename()
Use rename()
to easily rename columns:
## mpg cyl disp hp drat wt qsec vs am gear carb hp_wt_log mpg_wt_log
## 1 21 6 160 110 3.9 2.620 16.46 0 1 4 4 1.318365 0.7330158
## 2 21 6 160 110 3.9 2.875 17.02 0 1 4 4 1.293199 0.6873654
## hp_wt_correct hp_wt_cyl
## 1 41.98473 6.997455
## 2 38.26087 6.376812
# Base R:
colnames(mtcars)[colnames(mtcars) == "hp_wt"] <- "hp_wt_log"
colnames(mtcars)[colnames(mtcars) == "mpg_wt"] <- "mpg_wt_log"
head(mtcars, 2)
## mpg cyl disp hp drat wt qsec vs am gear carb hp_wt_log mpg_wt_log
## 1 21 6 160 110 3.9 2.620 16.46 0 1 4 4 1.318365 0.7330158
## 2 21 6 160 110 3.9 2.875 17.02 0 1 4 4 1.293199 0.6873654
## hp_wt_correct hp_wt_cyl
## 1 41.98473 6.997455
## 2 38.26087 6.376812
Calling dplyr
verbs always outputs a new data frame, it does not alter the existing data frame
So to keep the changes, we have to reassign the data frame to be the output of the pipe! (Look back at the examples for mutate()
and mutate_at()
)
dplyr
and SQLdplyr
you should find SQL very natural, and vice versa!select
is SELECT
, filter
is WHERE
, arrange
is ORDER BY
etc.group_by()
and summarize()
, which are used to aggregate data (next lecture)left_join()
and inner_join()
verbs (also next lecture)%>%
operator)dplyr
is a package for data wrangling, with several key verbs (functions)slice()
and filter()
: subset rows based on numbers or conditionsselect()
and pull
: select columns or pull out as single column vectorarrange()
: order rows by one or multiple columnsrename()
and mutate()
: rename or create columnsmutate_at()
: apply a function to given columns