Splitting a string

To split based on a keyword, use the strsplit() function

ingredients = "chickpeas, tahini, olive oil, garlic, salt"
strsplit(ingredients, split=",")

## [[1]]
## [1] "chickpeas"  " tahini"    " olive oil" " garlic"    " salt"

Looks like a weird output format, what is it?

ingredients = "chickpeas, tahini, olive oil, garlic, salt"
split.obj = strsplit(ingredients, split=",")
class(split.obj)

## [1] "list"

length(split.obj)

## [1] 1

A list! With just one element, and that element is a vector of strings

`strsplit()` vectorizes

Just like substring(), nchar(), and the many others

great.profs = "Nugent, Genovese, Greenhouse, Seltman, Shalizi, Ventura"
favorite.cats = "tiger, leopard, jaguar, lion"
split.list = strsplit(c(ingredients, great.profs, favorite.cats), split=",")
split.list

## [[1]]
## [1] "chickpeas"  " tahini"    " olive oil" " garlic"    " salt"     
## 
## [[2]]
## [1] "Nugent"      " Genovese"   " Greenhouse" " Seltman"    " Shalizi"   
## [6] " Ventura"   
## 
## [[3]]
## [1] "tiger"    " leopard" " jaguar"  " lion"

Returned object is a list with 3 elements. Each one a vector of strings, having lengths 5, 6, and 4 (now you see why we need a list?)

Reminder: list access tricks

Those darned lists can be tricky beasts, but super useful once you master them

split.list[[1]] # This is a vector

## [1] "chickpeas"  " tahini"    " olive oil" " garlic"    " salt"

split.list[1] # This is actually list of length 1

## [[1]]
## [1] "chickpeas"  " tahini"    " olive oil" " garlic"    " salt"

split.list[2:3] # Can't do this with [[ ]]

## [[1]]
## [1] "Nugent"      " Genovese"   " Greenhouse" " Seltman"    " Shalizi"   
## [6] " Ventura"   
## 
## [[2]]
## [1] "tiger"    " leopard" " jaguar"  " lion"

split.list[-(2:3)] # Also can't do this with [[ ]]

## [[1]]
## [1] "chickpeas"  " tahini"    " olive oil" " garlic"    " salt"

split.list[[2]] = NULL 
split.list # The 2nd element (vector of Professor names) was deleted

## [[1]]
## [1] "chickpeas"  " tahini"    " olive oil" " garlic"    " salt"     
## 
## [[2]]
## [1] "tiger"    " leopard" " jaguar"  " lion"

Splitting character-by-character

Finest splitting you can do is character-by-character; for this, use strsplit() with split=""

split.chars = strsplit(ingredients, split="")[[1]]
split.chars

##  [1] "c" "h" "i" "c" "k" "p" "e" "a" "s" "," " " "t" "a" "h" "i" "n" "i"
## [18] "," " " "o" "l" "i" "v" "e" " " "o" "i" "l" "," " " "g" "a" "r" "l"
## [35] "i" "c" "," " " "s" "a" "l" "t"

length(split.chars)

## [1] 42

nchar(ingredients) # Matches the previous count

## [1] 42

Combining strings

To join two (or more) strings into one, separated by a keyword, use the paste() function

paste("Spider", "Man") # Default is to separate by " "

## [1] "Spider Man"

paste("Spider", "Man", sep="-")

## [1] "Spider-Man"

paste("Spider", "Man", "does whatever", sep=", ")

## [1] "Spider, Man, does whatever"

`paste()` vectorizes

Just like strsplit(), substring(), nchar(), etc. Seeing a theme yet?

presidents = c("Clinton", "Bush", "Reagan", "Carter", "Ford")
paste(presidents, c("D", "R", "R", "D", "R"))

## [1] "Clinton D" "Bush R"    "Reagan R"  "Carter D"  "Ford R"

paste(presidents, c("D", "R")) # Notice the recycling (not historically accurate!)

## [1] "Clinton D" "Bush R"    "Reagan D"  "Carter R"  "Ford D"

paste(presidents, " (", 42:38, ")", sep="")

## [1] "Clinton (42)" "Bush (41)"    "Reagan (40)"  "Carter (39)" 
## [5] "Ford (38)"

Condensing a vector of strings

Can condense a vector of strings into one big string by using paste() with the collapse argument

presidents

## [1] "Clinton" "Bush"    "Reagan"  "Carter"  "Ford"

paste(presidents, collapse="; ")

## [1] "Clinton; Bush; Reagan; Carter; Ford"

paste(presidents, " (", 42:38, ")", sep="", collapse="; ")

## [1] "Clinton (42); Bush (41); Reagan (40); Carter (39); Ford (38)"

paste(presidents, " (", c("D", "R", "R", "D", "R"), 42:38, ")", sep="", collapse="; ")

## [1] "Clinton (D42); Bush (R41); Reagan (R40); Carter (D39); Ford (R38)"

paste(presidents, collapse=NULL) # No condensing

## [1] "Clinton" "Bush"    "Reagan"  "Carter"  "Ford"

Another example

Can think of strsplit() and paste() with collapse as inverses of each other

ingredients

## [1] "chickpeas, tahini, olive oil, garlic, salt"

split.ing = strsplit(ingredients, split=",")[[1]]
split.ing

## [1] "chickpeas"  " tahini"    " olive oil" " garlic"    " salt"

paste(split.ing, collapse=",")

## [1] "chickpeas, tahini, olive oil, garlic, salt"

Another example

Here we make use of some list tricks and trimming tricks

c(ingredients, great.profs, favorite.cats)

## [1] "chickpeas, tahini, olive oil, garlic, salt"             
## [2] "Nugent, Genovese, Greenhouse, Seltman, Shalizi, Ventura"
## [3] "tiger, leopard, jaguar, lion"

split.list = strsplit(c(ingredients, great.profs, favorite.cats), split=",")
split.list

## [[1]]
## [1] "chickpeas"  " tahini"    " olive oil" " garlic"    " salt"     
## 
## [[2]]
## [1] "Nugent"      " Genovese"   " Greenhouse" " Seltman"    " Shalizi"   
## [6] " Ventura"   
## 
## [[3]]
## [1] "tiger"    " leopard" " jaguar"  " lion"

split.vec = unlist(split.list)
split.vec

##  [1] "chickpeas"   " tahini"     " olive oil"  " garlic"     " salt"      
##  [6] "Nugent"      " Genovese"   " Greenhouse" " Seltman"    " Shalizi"   
## [11] " Ventura"    "tiger"       " leopard"    " jaguar"     " lion"

split.scrambled = sample(split.vec)
split.scrambled

##  [1] " salt"       " garlic"     " lion"       " Ventura"    "tiger"      
##  [6] " Greenhouse" " Seltman"    " leopard"    " jaguar"     "Nugent"     
## [11] "chickpeas"   " Shalizi"    " Genovese"   " tahini"     " olive oil"

split.scrambled.trimmed = trimws(split.scrambled) 
split.scrambled.trimmed # Trimmed whitespaces for us!

##  [1] "salt"       "garlic"     "lion"       "Ventura"    "tiger"     
##  [6] "Greenhouse" "Seltman"    "leopard"    "jaguar"     "Nugent"    
## [11] "chickpeas"  "Shalizi"    "Genovese"   "tahini"     "olive oil"

paste(split.scrambled.trimmed, collapse=" + ")

## [1] "salt + garlic + lion + Ventura + tiger + Greenhouse + Seltman + leopard + jaguar + Nugent + chickpeas + Shalizi + Genovese + tahini + olive oil"

Splitting and Combining Strings

Splitting a string

strsplit() vectorizes

Reminder: list access tricks

Splitting character-by-character

Combining strings

paste() vectorizes

Condensing a vector of strings

Another example

Another example

`strsplit()` vectorizes

`paste()` vectorizes