Statistical Computing, 36-350
Friday September 2, 2016
To split based on a keyword, use the strsplit()
function
ingredients = "chickpeas, tahini, olive oil, garlic, salt"
strsplit(ingredients, split=",")
## [[1]]
## [1] "chickpeas" " tahini" " olive oil" " garlic" " salt"
Looks like a weird output format, what is it?
ingredients = "chickpeas, tahini, olive oil, garlic, salt"
split.obj = strsplit(ingredients, split=",")
class(split.obj)
## [1] "list"
length(split.obj)
## [1] 1
A list! With just one element, and that element is a vector of strings
strsplit()
vectorizesJust like substring()
, nchar()
, and the many others
great.profs = "Nugent, Genovese, Greenhouse, Seltman, Shalizi, Ventura"
favorite.cats = "tiger, leopard, jaguar, lion"
split.list = strsplit(c(ingredients, great.profs, favorite.cats), split=",")
split.list
## [[1]]
## [1] "chickpeas" " tahini" " olive oil" " garlic" " salt"
##
## [[2]]
## [1] "Nugent" " Genovese" " Greenhouse" " Seltman" " Shalizi"
## [6] " Ventura"
##
## [[3]]
## [1] "tiger" " leopard" " jaguar" " lion"
Returned object is a list with 3 elements. Each one a vector of strings, having lengths 5, 6, and 4 (now you see why we need a list?)
Those darned lists can be tricky beasts, but super useful once you master them
split.list[[1]] # This is a vector
## [1] "chickpeas" " tahini" " olive oil" " garlic" " salt"
split.list[1] # This is actually list of length 1
## [[1]]
## [1] "chickpeas" " tahini" " olive oil" " garlic" " salt"
split.list[2:3] # Can't do this with [[ ]]
## [[1]]
## [1] "Nugent" " Genovese" " Greenhouse" " Seltman" " Shalizi"
## [6] " Ventura"
##
## [[2]]
## [1] "tiger" " leopard" " jaguar" " lion"
split.list[-(2:3)] # Also can't do this with [[ ]]
## [[1]]
## [1] "chickpeas" " tahini" " olive oil" " garlic" " salt"
split.list[[2]] = NULL
split.list # The 2nd element (vector of Professor names) was deleted
## [[1]]
## [1] "chickpeas" " tahini" " olive oil" " garlic" " salt"
##
## [[2]]
## [1] "tiger" " leopard" " jaguar" " lion"
Finest splitting you can do is character-by-character; for this, use strsplit()
with split=""
split.chars = strsplit(ingredients, split="")[[1]]
split.chars
## [1] "c" "h" "i" "c" "k" "p" "e" "a" "s" "," " " "t" "a" "h" "i" "n" "i"
## [18] "," " " "o" "l" "i" "v" "e" " " "o" "i" "l" "," " " "g" "a" "r" "l"
## [35] "i" "c" "," " " "s" "a" "l" "t"
length(split.chars)
## [1] 42
nchar(ingredients) # Matches the previous count
## [1] 42
To join two (or more) strings into one, separated by a keyword, use the paste()
function
paste("Spider", "Man") # Default is to separate by " "
## [1] "Spider Man"
paste("Spider", "Man", sep="-")
## [1] "Spider-Man"
paste("Spider", "Man", "does whatever", sep=", ")
## [1] "Spider, Man, does whatever"
paste()
vectorizesJust like strsplit()
, substring()
, nchar()
, etc. Seeing a theme yet?
presidents = c("Clinton", "Bush", "Reagan", "Carter", "Ford")
paste(presidents, c("D", "R", "R", "D", "R"))
## [1] "Clinton D" "Bush R" "Reagan R" "Carter D" "Ford R"
paste(presidents, c("D", "R")) # Notice the recycling (not historically accurate!)
## [1] "Clinton D" "Bush R" "Reagan D" "Carter R" "Ford D"
paste(presidents, " (", 42:38, ")", sep="")
## [1] "Clinton (42)" "Bush (41)" "Reagan (40)" "Carter (39)"
## [5] "Ford (38)"
Can condense a vector of strings into one big string by using paste()
with the collapse
argument
presidents
## [1] "Clinton" "Bush" "Reagan" "Carter" "Ford"
paste(presidents, collapse="; ")
## [1] "Clinton; Bush; Reagan; Carter; Ford"
paste(presidents, " (", 42:38, ")", sep="", collapse="; ")
## [1] "Clinton (42); Bush (41); Reagan (40); Carter (39); Ford (38)"
paste(presidents, " (", c("D", "R", "R", "D", "R"), 42:38, ")", sep="", collapse="; ")
## [1] "Clinton (D42); Bush (R41); Reagan (R40); Carter (D39); Ford (R38)"
paste(presidents, collapse=NULL) # No condensing
## [1] "Clinton" "Bush" "Reagan" "Carter" "Ford"
Can think of strsplit()
and paste()
with collapse
as inverses of each other
ingredients
## [1] "chickpeas, tahini, olive oil, garlic, salt"
split.ing = strsplit(ingredients, split=",")[[1]]
split.ing
## [1] "chickpeas" " tahini" " olive oil" " garlic" " salt"
paste(split.ing, collapse=",")
## [1] "chickpeas, tahini, olive oil, garlic, salt"
Here we make use of some list tricks and trimming tricks
c(ingredients, great.profs, favorite.cats)
## [1] "chickpeas, tahini, olive oil, garlic, salt"
## [2] "Nugent, Genovese, Greenhouse, Seltman, Shalizi, Ventura"
## [3] "tiger, leopard, jaguar, lion"
split.list = strsplit(c(ingredients, great.profs, favorite.cats), split=",")
split.list
## [[1]]
## [1] "chickpeas" " tahini" " olive oil" " garlic" " salt"
##
## [[2]]
## [1] "Nugent" " Genovese" " Greenhouse" " Seltman" " Shalizi"
## [6] " Ventura"
##
## [[3]]
## [1] "tiger" " leopard" " jaguar" " lion"
split.vec = unlist(split.list)
split.vec
## [1] "chickpeas" " tahini" " olive oil" " garlic" " salt"
## [6] "Nugent" " Genovese" " Greenhouse" " Seltman" " Shalizi"
## [11] " Ventura" "tiger" " leopard" " jaguar" " lion"
split.scrambled = sample(split.vec)
split.scrambled
## [1] " salt" " garlic" " lion" " Ventura" "tiger"
## [6] " Greenhouse" " Seltman" " leopard" " jaguar" "Nugent"
## [11] "chickpeas" " Shalizi" " Genovese" " tahini" " olive oil"
split.scrambled.trimmed = trimws(split.scrambled)
split.scrambled.trimmed # Trimmed whitespaces for us!
## [1] "salt" "garlic" "lion" "Ventura" "tiger"
## [6] "Greenhouse" "Seltman" "leopard" "jaguar" "Nugent"
## [11] "chickpeas" "Shalizi" "Genovese" "tahini" "olive oil"
paste(split.scrambled.trimmed, collapse=" + ")
## [1] "salt + garlic + lion + Ventura + tiger + Greenhouse + Seltman + leopard + jaguar + Nugent + chickpeas + Shalizi + Genovese + tahini + olive oil"