Statistical Computing, 36-350
Monday September 12, 2016
In last week’s lectures, we computed word tables by splitting up text and counting unique words, from documents of interest. Snippet:
> clinton.wordtab[1:5]
— … “a “do “go
37 26 1 1 1
These are not all actual words (they include punctuation marks). We need to learn how to better split text, and for this we need regular expressions. This will also help us better search text
Scan a vector of strings for matches to a regex, using grep()
str.vec = c("time flies when you're having fun in 350",
"time does not fly in 350, because it's not fun",
"Flyers suck, Penguins rule")
grep("fly", str.vec)
## [1] 2
grep("fly", str.vec, value=TRUE)
## [1] "time does not fly in 350, because it's not fun"
grep("fly|flies", str.vec, value=TRUE)
## [1] "time flies when you're having fun in 350"
## [2] "time does not fly in 350, because it's not fun"
str.vec.2 = c("time flies when you're having fun in 350",
"fruit flies when you throw it",
"a fruit fly is a beautiful creature",
"how do you spell fruitfly?")
grep("(time|fruit)(fly|flies)", str.vec.2, value=TRUE)
## [1] "how do you spell fruitfly?"
grep("(time|fruit) (fly|flies)", str.vec.2, value=TRUE)
## [1] "time flies when you're having fun in 350"
## [2] "fruit flies when you throw it"
## [3] "a fruit fly is a beautiful creature"
grep("(time|fruit) (fly|flies)", str.vec.2, value=TRUE)
## character(0)
str.vec.3 = c("R2D2","r2d2","RJD2","RT85")
grep("[A-Z][0-9]", str.vec.3, value=TRUE)
## [1] "R2D2" "RJD2" "RT85"
grep("[A-Z][0-9][A-Z][0-9]", str.vec.3, value=TRUE)
## [1] "R2D2"
grep("[A-Za-z][0-9][A-Za-z][0-9]", str.vec.3, value=TRUE)
## [1] "R2D2" "r2d2"
grep("[A-Z][^0-9][^0-9][0-9]", str.vec.3, value=TRUE)
## [1] "RJD2"
In R, we need to use double brackets for special abbreviated metacharacter classes like “[:punct:]” (to distinguish this from “[:punct]”, which has its own interpretation)
str.vec.4 = c("im simple i dont like punctuation",
"I'm, all; about! p.u.n.c.t.u.a.t.i.o.n.")
grep("[:punct:]", str.vec.4, value=TRUE)
## [1] "im simple i dont like punctuation"
## [2] "I'm, all; about! p.u.n.c.t.u.a.t.i.o.n."
grep("[[:punct:]]", str.vec.4, value=TRUE)
## [1] "I'm, all; about! p.u.n.c.t.u.a.t.i.o.n."
In R, we always have to use double the number of backslashes (because the backslash itself is a special character in an R string)
str.vec.5 = c("Stat + Computing = Magic",
"Stat - Computing = Boring Theorems",
"Do you have the time?")
grep("Stat \\+|Stat -", str.vec.5, value=TRUE)
## [1] "Stat + Computing = Magic"
## [2] "Stat - Computing = Boring Theorems"
grep("time\\?", str.vec.5, value=TRUE)
## [1] "Do you have the time?"