Summary:
lm() allows you to fit a linear model by specifying a formula, in terms of column names of a given data framecoef(), fitted(), residuals(), summary(), plot(), predict() are very handy and should be used over manual access tricksglm() with family="binomial" and all the same utility functionsgam() and utility functionsWelcome to the tidyverse
tidyverse is a coherent collection of packages for data scienceggplot2, plyr)dplyr, tidyr, tibble, readrggplot2purrrtidyverse package, we can see what’s in there:library(tidyverse)%>% operator) to fluidly glue functionality togetherdplyr and tidyr are going to be our main workhorses for data wrangling%>%dplyr verbsdplyr functions are analogous to SQL counterparts, so learn dplyr and get SQL for free!R data science communityMastering the pipe
%>%. Shortcut: use ctrl + shift + m in RStudiomagrittr package (automatically included in tidyverse)Piping at its most basic level:
Take one return value and automatically feed it in as an input to another function, to form a flow of results
In unix and related systems, we also have pipes, as in:
ls -l | grep tidy | wc -l##        7Passing a single argument through pipes, we interpret something like:
x %>% f %>% g %>% has h(g(f(x)))
Key takeaway: in your mind, when you see %>%, read this as “and then”
We can write exp(1) with pipes as 1 %>% exp(), and log(exp(1)) as 1 %>% exp() %>% log()
exp(1)## [1] 2.7182821 %>% exp()## [1] 2.7182821 %>% exp() %>% log()## [1] 1Now for multi-arguments functions, we interpret something like:
x %>% f(y) as f(x,y)
mtcars %>% # data.frame
  head(4) And what’s the “old school” (base R) way?
head(mtcars, 4)Notice that, with pipes:
The command x %>% f(y) can be equivalently written in dot notation as:
x %>% f(., y)What’s the advantage of using dots? Sometimes you want to pass in a variable as the second or third (say, not first) argument to a function, with a pipe. As in:
x %>% f(y, .)which is equivalent to f(y,x)
Again, see if you can interpret the code below without running it, then run it in your R console as a way to check your understanding:
state_df <- data.frame(state.x77)
state.region %>% 
  tolower %>%
  tapply(state_df$Income, ., summary)A more complicated example:
x <- "Ben really loves piping"
x %>% 
  strsplit(split = " ") %>%
  .[[1]] %>% # subsetting
  nchar() %>% 
  max # note the lack of paretheses ## [1] 6Mastering the dyplr verbs
dplyr verbsOur dplyr journey starts of with learning the following verbs (functions):
slice(): subset rows based on integer indexingfilter(): subset rows based on logical criteriaselect(): select certain columnspull(): pull out a individual columnarrange(): order rows by value of a columnrename(): rename columnsmutate(): create new columnsmutate_at(): apply a function to given columnsKey takeaway: think of data frames as nouns and dplyr verbs as actions that you apply to manipulate them—especially natural when using pipes
slice()Use slice() when you want to indicate certain row numbers need to be kept:
mtcars %>% 
  slice(c(7,8,14:15))##    mpg cyl  disp  hp drat   wt  qsec vs am gear carb
## 1 14.3   8 360.0 245 3.21 3.57 15.84  0  0    3    4
## 2 24.4   4 146.7  62 3.69 3.19 20.00  1  0    4    2
## 3 15.2   8 275.8 180 3.07 3.78 18.00  0  0    3    3
## 4 10.4   8 472.0 205 2.93 5.25 17.98  0  0    3    4# Base R:
mtcars[c(7,8,14:15),]##                     mpg cyl  disp  hp drat   wt  qsec vs am gear carb
## Duster 360         14.3   8 360.0 245 3.21 3.57 15.84  0  0    3    4
## Merc 240D          24.4   4 146.7  62 3.69 3.19 20.00  1  0    4    2
## Merc 450SLC        15.2   8 275.8 180 3.07 3.78 18.00  0  0    3    3
## Cadillac Fleetwood 10.4   8 472.0 205 2.93 5.25 17.98  0  0    3    4We can also do negative slicing:
mtcars %>% 
  slice(-c(1:2,19:23)) %>% 
  nrow()## [1] 25# Base R:
nrow(mtcars[-c(1:2,19:23),])## [1] 25filter()Use filter() when you want to subset rows based on logical conditions:
mtcars %>% 
  filter((mpg >= 14 & disp >= 200) | (drat <= 3)) %>% 
  head(2) # note rownames are silenced##    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## 1 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## 2 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2# Base R:
head(subset(mtcars, (mpg >= 14 & disp >= 200) | (drat <= 3)), 2)##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2head(mtcars[(mtcars$mpg >= 14 & mtcars$disp >= 200) | (mtcars$drat <= 3),], 2)##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2select()Use select() when you want to pick out certain columns:
mtcars %>% 
  select(cyl, disp, hp) %>% 
  head(2)##               cyl disp  hp
## Mazda RX4       6  160 110
## Mazda RX4 Wag   6  160 110# Base R:
head(mtcars[, c("cyl", "disp", "hp")], 2)##               cyl disp  hp
## Mazda RX4       6  160 110
## Mazda RX4 Wag   6  160 110select() helpersVery handy selections using dplyr helper functions:
mtcars %>% 
  select(starts_with("d")) %>% 
  head(2)##               disp drat
## Mazda RX4      160  3.9
## Mazda RX4 Wag  160  3.9# Base R (yikes!):
d_colnames <- grep(x = colnames(mtcars), pattern = "^d")
head(mtcars[, d_colnames], 2)##               disp drat
## Mazda RX4      160  3.9
## Mazda RX4 Wag  160  3.9We can do many other things as well:
mtcars %>% select(ends_with('t')) %>% head(2)##               drat    wt
## Mazda RX4      3.9 2.620
## Mazda RX4 Wag  3.9 2.875mtcars %>% select(ends_with('yl')) %>% head(2)##               cyl
## Mazda RX4       6
## Mazda RX4 Wag   6mtcars %>% select(contains('ar')) %>% head(2)##               gear carb
## Mazda RX4        4    4
## Mazda RX4 Wag    4    4See documentation (either ?select and then go to the “Useful functions” section or to the website with the same info)
pull()pullselect preserves column structure even with a single columnmtcars %>% pull(mpg)##  [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2
## [15] 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4
## [29] 15.8 19.7 15.0 21.4# Same as: mtcars$mpg
mtcars %>% select(mpg)##                      mpg
## Mazda RX4           21.0
## Mazda RX4 Wag       21.0
## Datsun 710          22.8
## Hornet 4 Drive      21.4
## Hornet Sportabout   18.7
## Valiant             18.1
## Duster 360          14.3
## Merc 240D           24.4
## Merc 230            22.8
## Merc 280            19.2
## Merc 280C           17.8
## Merc 450SE          16.4
## Merc 450SL          17.3
## Merc 450SLC         15.2
## Cadillac Fleetwood  10.4
## Lincoln Continental 10.4
## Chrysler Imperial   14.7
## Fiat 128            32.4
## Honda Civic         30.4
## Toyota Corolla      33.9
## Toyota Corona       21.5
## Dodge Challenger    15.5
## AMC Javelin         15.2
## Camaro Z28          13.3
## Pontiac Firebird    19.2
## Fiat X1-9           27.3
## Porsche 914-2       26.0
## Lotus Europa        30.4
## Ford Pantera L      15.8
## Ferrari Dino        19.7
## Maserati Bora       15.0
## Volvo 142E          21.4arrange()Use arrange() to order rows by values of a column:
mtcars %>% 
  arrange(desc(disp)) %>% 
  select(mpg, disp, drat) %>%
  head(2)##    mpg disp drat
## 1 10.4  472 2.93
## 2 10.4  460 3.00# Base R:
drat_inds <- order(mtcars$disp, decreasing = TRUE)
head(mtcars[drat_inds, c("mpg", "disp", "drat")], 2)##                      mpg disp drat
## Cadillac Fleetwood  10.4  472 2.93
## Lincoln Continental 10.4  460 3.00We can order by multiple columns too:
mtcars %>% 
  arrange(desc(gear), desc(hp)) %>%
  select(gear, hp, everything()) %>%
  head(8)##   gear  hp  mpg cyl  disp drat    wt  qsec vs am carb
## 1    5 335 15.0   8 301.0 3.54 3.570 14.60  0  1    8
## 2    5 264 15.8   8 351.0 4.22 3.170 14.50  0  1    4
## 3    5 175 19.7   6 145.0 3.62 2.770 15.50  0  1    6
## 4    5 113 30.4   4  95.1 3.77 1.513 16.90  1  1    2
## 5    5  91 26.0   4 120.3 4.43 2.140 16.70  0  1    2
## 6    4 123 19.2   6 167.6 3.92 3.440 18.30  1  0    4
## 7    4 123 17.8   6 167.6 3.92 3.440 18.90  1  0    4
## 8    4 110 21.0   6 160.0 3.90 2.620 16.46  0  1    4mutate()Use mutate() when you want to create one or several columns:
mtcars <- mtcars %>% 
  mutate(hp_wt = hp/wt, 
         mpg_wt = mpg/wt) 
# Base R:
mtcars$hp_wt <- mtcars$hp/mtcars$wt
mtcars$mpg_wt <- mtcars$mpg/mtcars$wtdata.frame with updated/added columnsmtcars <- mtcars %>% 
  mutate(hp_wt = 1) # update hp_wt to just the one value
mtcars %>% head(2)##   mpg cyl disp  hp drat    wt  qsec vs am gear carb hp_wt   mpg_wt
## 1  21   6  160 110  3.9 2.620 16.46  0  1    4    4     1 8.015267
## 2  21   6  160 110  3.9 2.875 17.02  0  1    4    4     1 7.304348# base R
mtcars$hp_wt <- 1mtcars <- mtcars %>% 
  mutate(hp_wt_correct = hp/wt,
         hp_wt_cyl = hp_wt_correct/cyl) 
mtcars %>% head(2)##   mpg cyl disp  hp drat    wt  qsec vs am gear carb hp_wt   mpg_wt
## 1  21   6  160 110  3.9 2.620 16.46  0  1    4    4     1 8.015267
## 2  21   6  160 110  3.9 2.875 17.02  0  1    4    4     1 7.304348
##   hp_wt_correct hp_wt_cyl
## 1      41.98473  6.997455
## 2      38.26087  6.376812# base R
mtcars$hp_wt_correct <- mtcars$hp/mtcars$wt
mtcars$hp_wt_cyl <- mtcars$hp_wt_correct/mtcars$cylmutate_at()Use mutate_at() when you want to apply a function to one or several columns:
# correction
mtcars <- mtcars %>% mutate(hp_wt = hp_wt_correct)
mtcars <- mtcars %>%
  mutate_at(c("hp_wt", "mpg_wt"), log) 
# Base R:
mtcars$hp_wt <- log(mtcars$hp_wt)
mtcars$mpg_wt <- log(mtcars$mpg_wt)Note: again, calling mutate_at() outputs a new data frame, it does not alter the given data frame, so to keep the column transformations, we have to reassign mtcars to be the output of the pipe!
rename()Use rename() to easily rename columns:
mtcars %>% 
  rename(hp_wt_log = hp_wt, mpg_wt_log = mpg_wt) %>%
  head(2)##   mpg cyl disp  hp drat    wt  qsec vs am gear carb hp_wt_log mpg_wt_log
## 1  21   6  160 110  3.9 2.620 16.46  0  1    4    4  1.318365  0.7330158
## 2  21   6  160 110  3.9 2.875 17.02  0  1    4    4  1.293199  0.6873654
##   hp_wt_correct hp_wt_cyl
## 1      41.98473  6.997455
## 2      38.26087  6.376812# Base R:
colnames(mtcars)[colnames(mtcars) == "hp_wt"] <- "hp_wt_log"
colnames(mtcars)[colnames(mtcars) == "mpg_wt"] <- "mpg_wt_log"
head(mtcars, 2)##   mpg cyl disp  hp drat    wt  qsec vs am gear carb hp_wt_log mpg_wt_log
## 1  21   6  160 110  3.9 2.620 16.46  0  1    4    4  1.318365  0.7330158
## 2  21   6  160 110  3.9 2.875 17.02  0  1    4    4  1.293199  0.6873654
##   hp_wt_correct hp_wt_cyl
## 1      41.98473  6.997455
## 2      38.26087  6.376812Calling dplyr verbs always outputs a new data frame, it does not alter the existing data frame
So to keep the changes, we have to reassign the data frame to be the output of the pipe! (Look back at the examples for mutate() and mutate_at())
dplyr and SQLdplyr you should find SQL very natural, and vice versa!select is SELECT, filter is WHERE, arrange is ORDER BY etc.group_by() and summarize(), which are used to aggregate data (next lecture)left_join() and inner_join() verbs (also next lecture)%>% operator)dplyr is a package for data wrangling, with several key verbs (functions)slice() and filter(): subset rows based on numbers or conditionsselect() and pull: select columns or pull out as single column vectorarrange(): order rows by one or multiple columnsrename() and mutate(): rename or create columnsmutate_at(): apply a function to given columns