Regex Patterns

Statistical Computing, 36-350

Monday September 12, 2016

Why do we need regex patterns?

In last week’s lectures, we computed word tables by splitting up text and counting unique words, from documents of interest. Snippet:

> clinton.wordtab[1:5]
  —   …  “a “do “go 
 37  26   1   1   1 

These are not all actual words (they include punctuation marks). We need to learn how to better split text, and for this we need regular expressions. This will also help us better search text

What are regex patterns?

Scanning for matches to a regex

Scan a vector of strings for matches to a regex, using grep()

str.vec = c("time flies when you're having fun in 350",
            "time does not fly in 350, because it's not fun",
            "Flyers suck, Penguins rule")
grep("fly", str.vec) 
## [1] 2
grep("fly", str.vec, value=TRUE)
## [1] "time does not fly in 350, because it's not fun"
grep("fly|flies", str.vec, value=TRUE)
## [1] "time flies when you're having fun in 350"      
## [2] "time does not fly in 350, because it's not fun"

More examples

str.vec.2 = c("time flies when you're having fun in 350",
              "fruit flies when you throw it",
              "a fruit fly is a beautiful creature",
              "how do you spell fruitfly?")
grep("(time|fruit)(fly|flies)", str.vec.2, value=TRUE)
## [1] "how do you spell fruitfly?"
grep("(time|fruit) (fly|flies)", str.vec.2, value=TRUE)
## [1] "time flies when you're having fun in 350"
## [2] "fruit flies when you throw it"           
## [3] "a fruit fly is a beautiful creature"
grep("(time|fruit)  (fly|flies)", str.vec.2, value=TRUE)
## character(0)

Metacharacters

More metacharacters

More examples

str.vec.3 = c("R2D2","r2d2","RJD2","RT85")
grep("[A-Z][0-9]", str.vec.3, value=TRUE)
## [1] "R2D2" "RJD2" "RT85"
grep("[A-Z][0-9][A-Z][0-9]", str.vec.3, value=TRUE)
## [1] "R2D2"
grep("[A-Za-z][0-9][A-Za-z][0-9]", str.vec.3, value=TRUE)
## [1] "R2D2" "r2d2"
grep("[A-Z][^0-9][^0-9][0-9]", str.vec.3, value=TRUE)
## [1] "RJD2"

More examples

In R, we need to use double brackets for special abbreviated metacharacter classes like “[:punct:]” (to distinguish this from “[:punct]”, which has its own interpretation)

str.vec.4 = c("im simple i dont like punctuation",
              "I'm, all; about! p.u.n.c.t.u.a.t.i.o.n.")
grep("[:punct:]", str.vec.4, value=TRUE)
## [1] "im simple i dont like punctuation"      
## [2] "I'm, all; about! p.u.n.c.t.u.a.t.i.o.n."
grep("[[:punct:]]", str.vec.4, value=TRUE)
## [1] "I'm, all; about! p.u.n.c.t.u.a.t.i.o.n."

Escape sequences

More examples

In R, we always have to use double the number of backslashes (because the backslash itself is a special character in an R string)

str.vec.5 = c("Stat + Computing = Magic", 
              "Stat - Computing = Boring Theorems",
              "Do you have the time?")
grep("Stat \\+|Stat -", str.vec.5, value=TRUE)
## [1] "Stat + Computing = Magic"          
## [2] "Stat - Computing = Boring Theorems"
grep("time\\?", str.vec.5, value=TRUE)
## [1] "Do you have the time?"