Last Week: Fitting Models to Data

Summary:

Fitting models is critical to both statistical inference and prediction
Exploratory data analysis is a very good first step and gives you a sense of what you’re dealing with before you start modeling
Linear regression is the most basic modeling tool of all, and one of the most ubiquitous
lm() allows you to fit a linear model by specifying a formula, in terms of column names of a given data frame
Utility functions coef(), fitted(), residuals(), summary(), plot(), predict() are very handy and should be used over manual access tricks
Logistic regression is the natural extension of linear regression to binary data; use glm() with family="binomial" and all the same utility functions
Generalized additive models add a level of flexibility in that they allow the predictors to have nonlinear effects; use gam() and utility functions

Part I

Welcome to the tidyverse

What is the tidyverse?

tidyverse is a coherent collection of packages for data science
Originally developed by Hadley Wickham (ggplot2, plyr)
But now a much larger developer community (hadleyverse -> tidyverse)
Packages include:
- Data wrangling: dplyr, tidyr, tibble, readr
- Visualization: ggplot2
- List manipulation: purrr
Loading the tidyverse package, we can see what’s in there:

library(tidyverse)

Why the tidyverse?

Packages have a very consistent API
Function names and commands follow a focused grammar
Extremely powerful when working with data frames and lists (matrices, not so much, yet!)
Allows pipes (%>% operator) to fluidly glue functionality together
Very active developer and user community
Main advantage: at its best, tidyverse data wrangling code can be read like a story using the pipe operator!

Data wrangling the tidy way

dplyr and tidyr are going to be our main workhorses for data wrangling
The main structure these packages use is the data frame (or tibble, but we won’t go there)
Two keys to getting started:
1. learn about pipes %>%
2. learn the dplyr verbs
dplyr functions are analogous to SQL counterparts, so learn dplyr and get SQL for free!
Heavily used in the R data science community

Part II

Mastering the pipe

All behold the glorius pipe

Tidyverse functions are at their best when composed together using the pipe operator
It looks like this: %>%. Shortcut: use ctrl + shift + m in RStudio
This operator actually comes from the magrittr package (automatically included in tidyverse)
So it can be used on its own, completely independently of the tidyverse
Piping at its most basic level:

Take one return value and automatically feed it in as an input to another function, to form a flow of results
In unix and related systems, we also have pipes, as in:

ls -l | grep tidy | wc -l

##        9

How to read pipes: single arguments

Passing a single argument through pipes, we interpret something like:

x %>% f %>% g %>% h

as h(g(f(x)))

Key takeaway: in your mind, when you see %>%, read this as “and then”

Example

We can write exp(1) with pipes as 1 %>% exp(), and log(exp(1)) as 1 %>% exp() %>% log()

exp(1)

## [1] 2.718282

1 %>% exp()

## [1] 2.718282

1 %>% exp() %>% log()

## [1] 1

How to read pipes: multiple arguments

Now for multi-arguments functions, we interpret something like:

x %>% f(y)

as f(x,y)

Example

mtcars %>% # data.frame
  head(4)

And what’s the “old school” (base R) way?

head(mtcars, 4)

Notice that, with pipes:

Your code is more readable (arguably)
You can run partial commands more easily

The dot

The command x %>% f(y) can be equivalently written in dot notation as:

x %>% f(., y)

What’s the advantage of using dots? Sometimes you want to pass in a variable as the second or third (say, not first) argument to a function, with a pipe. As in:

x %>% f(y, .)

which is equivalent to f(y,x)

Example

Again, see if you can interpret the code below without running it, then run it in your R console as a way to check your understanding:

state_df <- data.frame(state.x77)
state.region %>% 
  tolower %>%
  tapply(state_df$Income, ., summary)

A more complicated example:

x <- "Ben really loves piping"
x %>% 
  strsplit(split = " ") %>%
  .[[1]] %>% # subsetting
  nchar() %>% 
  max # note the lack of paretheses

## [1] 6

Part III

Mastering the dyplr verbs

`dplyr` verbs

Our dplyr journey starts of with learning the following verbs (functions):

slice(): subset rows based on integer indexing
filter(): subset rows based on logical criteria
select(): select certain columns
pull(): pull out a individual column
arrange(): order rows by value of a column
rename(): rename columns
mutate(): create new columns
mutate_at(): apply a function to given columns

Key takeaway: think of data frames as nouns and dplyr verbs as actions that you apply to manipulate them—especially natural when using pipes

`slice()`

Use slice() when you want to indicate certain row numbers need to be kept:

mtcars %>% 
  slice(c(7,8,14:15))

##    mpg cyl  disp  hp drat   wt  qsec vs am gear carb
## 1 14.3   8 360.0 245 3.21 3.57 15.84  0  0    3    4
## 2 24.4   4 146.7  62 3.69 3.19 20.00  1  0    4    2
## 3 15.2   8 275.8 180 3.07 3.78 18.00  0  0    3    3
## 4 10.4   8 472.0 205 2.93 5.25 17.98  0  0    3    4

# Base R:
mtcars[c(7,8,14:15),]

##                     mpg cyl  disp  hp drat   wt  qsec vs am gear carb
## Duster 360         14.3   8 360.0 245 3.21 3.57 15.84  0  0    3    4
## Merc 240D          24.4   4 146.7  62 3.69 3.19 20.00  1  0    4    2
## Merc 450SLC        15.2   8 275.8 180 3.07 3.78 18.00  0  0    3    3
## Cadillac Fleetwood 10.4   8 472.0 205 2.93 5.25 17.98  0  0    3    4

We can also do negative slicing:

mtcars %>% 
  slice(-c(1:2,19:23)) %>% 
  nrow()

## [1] 25

# Base R:
nrow(mtcars[-c(1:2,19:23),])

## [1] 25

`filter()`

Use filter() when you want to subset rows based on logical conditions:

mtcars %>% 
  filter((mpg >= 14 & disp >= 200) | (drat <= 3)) %>% 
  head(2) # note rownames are silenced

##    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## 1 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## 2 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2

# Base R:
head(subset(mtcars, (mpg >= 14 & disp >= 200) | (drat <= 3)), 2)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2

head(mtcars[(mtcars$mpg >= 14 & mtcars$disp >= 200) | (mtcars$drat <= 3),], 2)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2

`select()`

Use select() when you want to pick out certain columns:

mtcars %>% 
  select(cyl, disp, hp) %>% 
  head(2)

##               cyl disp  hp
## Mazda RX4       6  160 110
## Mazda RX4 Wag   6  160 110

# Base R:
head(mtcars[, c("cyl", "disp", "hp")], 2)

##               cyl disp  hp
## Mazda RX4       6  160 110
## Mazda RX4 Wag   6  160 110

Handy `select()` helpers

Very handy selections using dplyr helper functions:

mtcars %>% 
  select(starts_with("d")) %>% 
  head(2)

##               disp drat
## Mazda RX4      160  3.9
## Mazda RX4 Wag  160  3.9

# Base R (yikes!):
d_colnames <- grep(x = colnames(mtcars), pattern = "^d")
head(mtcars[, d_colnames], 2)

##               disp drat
## Mazda RX4      160  3.9
## Mazda RX4 Wag  160  3.9

We can do many other things as well:

mtcars %>% select(ends_with('t')) %>% head(2)

##               drat    wt
## Mazda RX4      3.9 2.620
## Mazda RX4 Wag  3.9 2.875

mtcars %>% select(ends_with('yl')) %>% head(2)

##               cyl
## Mazda RX4       6
## Mazda RX4 Wag   6

mtcars %>% select(contains('ar')) %>% head(2)

##               gear carb
## Mazda RX4        4    4
## Mazda RX4 Wag    4    4

See documentation (either ?select and then go to the “Useful functions” section or to the website with the same info)

Additional, less important function: `pull()`

You can grab a single column from a data frame and get it back as a vector if you use pull
select preserves column structure even with a single column

mtcars %>% pull(mpg)

##  [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2
## [15] 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4
## [29] 15.8 19.7 15.0 21.4

# Same as: mtcars$mpg

mtcars %>% select(mpg)

##                      mpg
## Mazda RX4           21.0
## Mazda RX4 Wag       21.0
## Datsun 710          22.8
## Hornet 4 Drive      21.4
## Hornet Sportabout   18.7
## Valiant             18.1
## Duster 360          14.3
## Merc 240D           24.4
## Merc 230            22.8
## Merc 280            19.2
## Merc 280C           17.8
## Merc 450SE          16.4
## Merc 450SL          17.3
## Merc 450SLC         15.2
## Cadillac Fleetwood  10.4
## Lincoln Continental 10.4
## Chrysler Imperial   14.7
## Fiat 128            32.4
## Honda Civic         30.4
## Toyota Corolla      33.9
## Toyota Corona       21.5
## Dodge Challenger    15.5
## AMC Javelin         15.2
## Camaro Z28          13.3
## Pontiac Firebird    19.2
## Fiat X1-9           27.3
## Porsche 914-2       26.0
## Lotus Europa        30.4
## Ford Pantera L      15.8
## Ferrari Dino        19.7
## Maserati Bora       15.0
## Volvo 142E          21.4

`arrange()`

Use arrange() to order rows by values of a column:

mtcars %>% 
  arrange(desc(disp)) %>% 
  select(mpg, disp, drat) %>%
  head(2)

##    mpg disp drat
## 1 10.4  472 2.93
## 2 10.4  460 3.00

# Base R:
drat_inds <- order(mtcars$disp, decreasing = TRUE)
head(mtcars[drat_inds, c("mpg", "disp", "drat")], 2)

##                      mpg disp drat
## Cadillac Fleetwood  10.4  472 2.93
## Lincoln Continental 10.4  460 3.00

We can order by multiple columns too:

mtcars %>% 
  arrange(desc(gear), desc(hp)) %>%
  select(gear, hp, everything()) %>%
  head(8)

##   gear  hp  mpg cyl  disp drat    wt  qsec vs am carb
## 1    5 335 15.0   8 301.0 3.54 3.570 14.60  0  1    8
## 2    5 264 15.8   8 351.0 4.22 3.170 14.50  0  1    4
## 3    5 175 19.7   6 145.0 3.62 2.770 15.50  0  1    6
## 4    5 113 30.4   4  95.1 3.77 1.513 16.90  1  1    2
## 5    5  91 26.0   4 120.3 4.43 2.140 16.70  0  1    2
## 6    4 123 19.2   6 167.6 3.92 3.440 18.30  1  0    4
## 7    4 123 17.8   6 167.6 3.92 3.440 18.90  1  0    4
## 8    4 110 21.0   6 160.0 3.90 2.620 16.46  0  1    4

`mutate()`

Use mutate() when you want to create one or several columns:

mtcars <- mtcars %>% 
  mutate(hp_wt = hp/wt, 
         mpg_wt = mpg/wt) 

# Base R:
mtcars$hp_wt <- mtcars$hp/mtcars$wt
mtcars$mpg_wt <- mtcars$mpg/mtcars$wt

create a new data.frame with updated/added columns

mtcars <- mtcars %>% 
  mutate(hp_wt = 1) # update hp_wt to just the one value
mtcars %>% head(2)

##   mpg cyl disp  hp drat    wt  qsec vs am gear carb hp_wt   mpg_wt
## 1  21   6  160 110  3.9 2.620 16.46  0  1    4    4     1 8.015267
## 2  21   6  160 110  3.9 2.875 17.02  0  1    4    4     1 7.304348

# base R
mtcars$hp_wt <- 1

newly created variables are useable immediately

mtcars <- mtcars %>% 
  mutate(hp_wt_correct = hp/wt,
         hp_wt_cyl = hp_wt_correct/cyl) 
mtcars %>% head(2)

##   mpg cyl disp  hp drat    wt  qsec vs am gear carb hp_wt   mpg_wt
## 1  21   6  160 110  3.9 2.620 16.46  0  1    4    4     1 8.015267
## 2  21   6  160 110  3.9 2.875 17.02  0  1    4    4     1 7.304348
##   hp_wt_correct hp_wt_cyl
## 1      41.98473  6.997455
## 2      38.26087  6.376812

# base R
mtcars$hp_wt_correct <- mtcars$hp/mtcars$wt
mtcars$hp_wt_cyl <- mtcars$hp_wt_correct/mtcars$cyl

`mutate_at()`

Use mutate_at() when you want to apply a function to one or several columns:

# correction
mtcars <- mtcars %>% mutate(hp_wt = hp_wt_correct)

mtcars <- mtcars %>%
  mutate_at(c("hp_wt", "mpg_wt"), log) 

# Base R:
mtcars$hp_wt <- log(mtcars$hp_wt)
mtcars$mpg_wt <- log(mtcars$mpg_wt)

Note: again, calling mutate_at() outputs a new data frame, it does not alter the given data frame, so to keep the column transformations, we have to reassign mtcars to be the output of the pipe!

`rename()`

Use rename() to easily rename columns:

mtcars %>% 
  rename(hp_wt_log = hp_wt, mpg_wt_log = mpg_wt) %>%
  head(2)

##   mpg cyl disp  hp drat    wt  qsec vs am gear carb hp_wt_log mpg_wt_log
## 1  21   6  160 110  3.9 2.620 16.46  0  1    4    4  1.318365  0.7330158
## 2  21   6  160 110  3.9 2.875 17.02  0  1    4    4  1.293199  0.6873654
##   hp_wt_correct hp_wt_cyl
## 1      41.98473  6.997455
## 2      38.26087  6.376812

# Base R:
colnames(mtcars)[colnames(mtcars) == "hp_wt"] <- "hp_wt_log"
colnames(mtcars)[colnames(mtcars) == "mpg_wt"] <- "mpg_wt_log"
head(mtcars, 2)

##   mpg cyl disp  hp drat    wt  qsec vs am gear carb hp_wt_log mpg_wt_log
## 1  21   6  160 110  3.9 2.620 16.46  0  1    4    4  1.318365  0.7330158
## 2  21   6  160 110  3.9 2.875 17.02  0  1    4    4  1.293199  0.6873654
##   hp_wt_correct hp_wt_cyl
## 1      41.98473  6.997455
## 2      38.26087  6.376812

Important note

Calling dplyr verbs always outputs a new data frame, it does not alter the existing data frame

So to keep the changes, we have to reassign the data frame to be the output of the pipe! (Look back at the examples for mutate() and mutate_at())

`dplyr` and SQL

Once you learn dplyr you should find SQL very natural, and vice versa!
e.g. select is SELECT, filter is WHERE, arrange is ORDER BY etc.
This will make it much easier for tasks that require using both R and SQL to munge data and build statistical models
One major link is through powerful verbs like group_by() and summarize(), which are used to aggregate data (next lecture)
Another major link to SQL is through merging/joining data frames, via left_join() and inner_join() verbs (also next lecture)

Summary

The tidyverse is a collection of packages for common data science tasks
Tidyverse functionality is greatly enhanced using pipes (%>% operator)
Pipes allow you to string together commands to get a flow of results
dplyr is a package for data wrangling, with several key verbs (functions)
slice() and filter(): subset rows based on numbers or conditions
select() and pull: select columns or pull out as single column vector
arrange(): order rows by one or multiple columns
rename() and mutate(): rename or create columns
mutate_at(): apply a function to given columns

Tidyverse I: Pipes and Dplyr

Last Week: Fitting Models to Data

Part I

What is the tidyverse?

Why the tidyverse?

Data wrangling the tidy way

Part II

All behold the glorius pipe

How to read pipes: single arguments

Example

How to read pipes: multiple arguments

Example

The dot

Example

Part III

`dplyr` verbs

`slice()`

`filter()`

`select()`

Handy `select()` helpers

Additional, less important function: `pull()`

`arrange()`

`mutate()`

`mutate_at()`

`rename()`

Important note

`dplyr` and SQL

Summary

References

Tidyverse I: Pipes and Dplyr

Last Week: Fitting Models to Data

Part I

What is the tidyverse?

Why the tidyverse?

Data wrangling the tidy way

Part II

All behold the glorius pipe

How to read pipes: single arguments

Example

How to read pipes: multiple arguments

Example

The dot

Example

Part III

dplyr verbs

slice()

filter()

select()

Handy select() helpers

Additional, less important function: pull()

arrange()

mutate()

mutate_at()

rename()

Important note

dplyr and SQL

Summary

References

`dplyr` verbs

`slice()`

`filter()`

`select()`

Handy `select()` helpers

Additional, less important function: `pull()`

`arrange()`

`mutate()`

`mutate_at()`

`rename()`

`dplyr` and SQL