Splitting and Combining Strings

Statistical Computing, 36-350

Friday September 2, 2016

Splitting a string

To split based on a keyword, use the strsplit() function

ingredients = "chickpeas, tahini, olive oil, garlic, salt"
strsplit(ingredients, split=",")
## [[1]]
## [1] "chickpeas"  " tahini"    " olive oil" " garlic"    " salt"

Looks like a weird output format, what is it?

ingredients = "chickpeas, tahini, olive oil, garlic, salt"
split.obj = strsplit(ingredients, split=",")
class(split.obj)
## [1] "list"
length(split.obj)
## [1] 1

A list! With just one element, and that element is a vector of strings

strsplit() vectorizes

Just like substring(), nchar(), and the many others

great.profs = "Nugent, Genovese, Greenhouse, Seltman, Shalizi, Ventura"
favorite.cats = "tiger, leopard, jaguar, lion"
split.list = strsplit(c(ingredients, great.profs, favorite.cats), split=",")
split.list
## [[1]]
## [1] "chickpeas"  " tahini"    " olive oil" " garlic"    " salt"     
## 
## [[2]]
## [1] "Nugent"      " Genovese"   " Greenhouse" " Seltman"    " Shalizi"   
## [6] " Ventura"   
## 
## [[3]]
## [1] "tiger"    " leopard" " jaguar"  " lion"

Returned object is a list with 3 elements. Each one a vector of strings, having lengths 5, 6, and 4 (now you see why we need a list?)

Reminder: list access tricks

Those darned lists can be tricky beasts, but super useful once you master them

split.list[[1]] # This is a vector
## [1] "chickpeas"  " tahini"    " olive oil" " garlic"    " salt"
split.list[1] # This is actually list of length 1
## [[1]]
## [1] "chickpeas"  " tahini"    " olive oil" " garlic"    " salt"
split.list[2:3] # Can't do this with [[ ]]
## [[1]]
## [1] "Nugent"      " Genovese"   " Greenhouse" " Seltman"    " Shalizi"   
## [6] " Ventura"   
## 
## [[2]]
## [1] "tiger"    " leopard" " jaguar"  " lion"
split.list[-(2:3)] # Also can't do this with [[ ]]
## [[1]]
## [1] "chickpeas"  " tahini"    " olive oil" " garlic"    " salt"
split.list[[2]] = NULL 
split.list # The 2nd element (vector of Professor names) was deleted
## [[1]]
## [1] "chickpeas"  " tahini"    " olive oil" " garlic"    " salt"     
## 
## [[2]]
## [1] "tiger"    " leopard" " jaguar"  " lion"

Splitting character-by-character

Finest splitting you can do is character-by-character; for this, use strsplit() with split=""

split.chars = strsplit(ingredients, split="")[[1]]
split.chars
##  [1] "c" "h" "i" "c" "k" "p" "e" "a" "s" "," " " "t" "a" "h" "i" "n" "i"
## [18] "," " " "o" "l" "i" "v" "e" " " "o" "i" "l" "," " " "g" "a" "r" "l"
## [35] "i" "c" "," " " "s" "a" "l" "t"
length(split.chars)
## [1] 42
nchar(ingredients) # Matches the previous count
## [1] 42

Combining strings

To join two (or more) strings into one, separated by a keyword, use the paste() function

paste("Spider", "Man") # Default is to separate by " "
## [1] "Spider Man"
paste("Spider", "Man", sep="-")
## [1] "Spider-Man"
paste("Spider", "Man", "does whatever", sep=", ")
## [1] "Spider, Man, does whatever"

paste() vectorizes

Just like strsplit(), substring(), nchar(), etc. Seeing a theme yet?

presidents = c("Clinton", "Bush", "Reagan", "Carter", "Ford")
paste(presidents, c("D", "R", "R", "D", "R"))
## [1] "Clinton D" "Bush R"    "Reagan R"  "Carter D"  "Ford R"
paste(presidents, c("D", "R")) # Notice the recycling (not historically accurate!)
## [1] "Clinton D" "Bush R"    "Reagan D"  "Carter R"  "Ford D"
paste(presidents, " (", 42:38, ")", sep="")
## [1] "Clinton (42)" "Bush (41)"    "Reagan (40)"  "Carter (39)" 
## [5] "Ford (38)"

Condensing a vector of strings

Can condense a vector of strings into one big string by using paste() with the collapse argument

presidents
## [1] "Clinton" "Bush"    "Reagan"  "Carter"  "Ford"
paste(presidents, collapse="; ")
## [1] "Clinton; Bush; Reagan; Carter; Ford"
paste(presidents, " (", 42:38, ")", sep="", collapse="; ")
## [1] "Clinton (42); Bush (41); Reagan (40); Carter (39); Ford (38)"
paste(presidents, " (", c("D", "R", "R", "D", "R"), 42:38, ")", sep="", collapse="; ")
## [1] "Clinton (D42); Bush (R41); Reagan (R40); Carter (D39); Ford (R38)"
paste(presidents, collapse=NULL) # No condensing
## [1] "Clinton" "Bush"    "Reagan"  "Carter"  "Ford"

Another example

Can think of strsplit() and paste() with collapse as inverses of each other

ingredients
## [1] "chickpeas, tahini, olive oil, garlic, salt"
split.ing = strsplit(ingredients, split=",")[[1]]
split.ing
## [1] "chickpeas"  " tahini"    " olive oil" " garlic"    " salt"
paste(split.ing, collapse=",")
## [1] "chickpeas, tahini, olive oil, garlic, salt"

Another example

Here we make use of some list tricks and trimming tricks

c(ingredients, great.profs, favorite.cats)
## [1] "chickpeas, tahini, olive oil, garlic, salt"             
## [2] "Nugent, Genovese, Greenhouse, Seltman, Shalizi, Ventura"
## [3] "tiger, leopard, jaguar, lion"
split.list = strsplit(c(ingredients, great.profs, favorite.cats), split=",")
split.list
## [[1]]
## [1] "chickpeas"  " tahini"    " olive oil" " garlic"    " salt"     
## 
## [[2]]
## [1] "Nugent"      " Genovese"   " Greenhouse" " Seltman"    " Shalizi"   
## [6] " Ventura"   
## 
## [[3]]
## [1] "tiger"    " leopard" " jaguar"  " lion"
split.vec = unlist(split.list)
split.vec
##  [1] "chickpeas"   " tahini"     " olive oil"  " garlic"     " salt"      
##  [6] "Nugent"      " Genovese"   " Greenhouse" " Seltman"    " Shalizi"   
## [11] " Ventura"    "tiger"       " leopard"    " jaguar"     " lion"
split.scrambled = sample(split.vec)
split.scrambled
##  [1] " salt"       " garlic"     " lion"       " Ventura"    "tiger"      
##  [6] " Greenhouse" " Seltman"    " leopard"    " jaguar"     "Nugent"     
## [11] "chickpeas"   " Shalizi"    " Genovese"   " tahini"     " olive oil"
split.scrambled.trimmed = trimws(split.scrambled) 
split.scrambled.trimmed # Trimmed whitespaces for us!
##  [1] "salt"       "garlic"     "lion"       "Ventura"    "tiger"     
##  [6] "Greenhouse" "Seltman"    "leopard"    "jaguar"     "Nugent"    
## [11] "chickpeas"  "Shalizi"    "Genovese"   "tahini"     "olive oil"
paste(split.scrambled.trimmed, collapse=" + ")
## [1] "salt + garlic + lion + Ventura + tiger + Greenhouse + Seltman + leopard + jaguar + Nugent + chickpeas + Shalizi + Genovese + tahini + olive oil"