Splitting and Searching with Regexes

Statistical Computing, 36-350

Wednesday September 16, 2016

Splitting on a regex

Regexes, we’ve seen, are powerful tools for matching patterns of text. Turns out, we can also split strings based on them: use strsplit() with split equal to a regex

strsplit("Just .. gotta  .... keep ......  going.......", 
  split=" *\\.+ *")[[1]]
## [1] "Just"  "gotta" "keep"  "going"
strsplit("A semicolon; is used to; join two; or more ideas; in a sentence",
         split=";? +")
## [[1]]
##  [1] "A"         "semicolon" "is"        "used"      "to"       
##  [6] "join"      "two"       "or"        "more"      "ideas"    
## [11] "in"        "a"         "sentence"
strsplit("Mercedes S550 iPhone 5s Titleist 915D Stat Computing v2", 
         split=" *[[:alpha:]]*[0-9]+[[:alpha:]]* *")
## [[1]]
## [1] "Mercedes"       "iPhone"         "Titleist"       "Stat Computing"

Example: Trump’s RNC speech

trump.lines = readLines("http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/trump.txt")
trump.text = paste(trump.lines, collapse=" ")
trump.words.naive = strsplit(trump.text, split=" ")[[1]]
head(trump.words.naive, 10)
##  [1] "Friends,"   "delegates"  "and"        "fellow"     "Americans:"
##  [6] "I"          "humbly"     "and"        "gratefully" "accept"
trump.words = strsplit(trump.text, split="[[:space:]]|[[:punct:]]")[[1]]
head(trump.words, 10)
##  [1] "Friends"   ""          "delegates" "and"       "fellow"   
##  [6] "Americans" ""          "I"         "humbly"    "and"

The second looks better! And a sanity check:

grep("[[:punct:]]", trump.words, value=TRUE)
## character(0)

So there’s really no punctuation marks in the second set of words

(Continued)

Let’s get rid of the empty string and explore the word counts a bit

sum(trump.words == "")
## [1] 592
trump.words = trump.words[trump.words != ""] # Get rid of empty strings
trump.wordtab = table(trump.words)
head(sort(trump.wordtab, decreasing=TRUE), 5)
## trump.words
## the and  of  to our 
## 189 146 127 126  90
plot(trump.wordtab)

Searching with a regex

Scraping data from the web is not easy. Often times there is a lot of HTML code and/or other “junk” mixed in with text or data that we care about

Unfortunately there is no general recipe for extracting data from a generic webpage, this is what makes web scraping hard; but regexes can help a lot (recall earthquake data from Wednesday’s lab)

(Continued)

General steps:

Example: fastest 100m sprint times

Getting data lines is pretty easy, because the data is contiguous

sprint.lines = readLines("http://www.alltime-athletics.com/m_100ok.htm")
length(sprint.lines)
## [1] 3077
head(sprint.lines, 5)
## [1] "<HTML>"                                                                                                                                                                                                                                  
## [2] "<HEAD>"                                                                                                                                                                                                                                  
## [3] "<META Name=\"description\" Content=\"100m, 100 meter, 100 metres, statistics\">"                                                                                                                                                         
## [4] "<META Name=\"keywords\" Content=\"friidrott, friidrottsstatistik, track, field, athletics, sprints, running, marathon, high jump, long jump, triple jump, shot put, discus, javelin, hammer, pole vault, heptathlon, decahtlon, relay\">"
## [5] "<TITLE>Men's 100m</TITLE>"
(i.first = min(grep("Usain Bolt", sprint.lines)))
## [1] 81
(i.last = max(grep("Julian Forte", sprint.lines)))
## [1] 2921
i.last - i.first + 1 # Sanity check
## [1] 2841
sprint.lines[i.first:(i.first+4)]
## [1] "        1      9.58       +0.9    Usain Bolt                     JAM     21.08.86    1      Berlin                        16.08.2009"       
## [2] "        2      9.63       +1.5    Usain Bolt                     JAM     21.08.86    1      London                        05.08.2012"       
## [3] "        3      9.69       &plusmn;0.0    Usain Bolt                     JAM     21.08.86    1      Beijing                       16.08.2008"
## [4] "        3      9.69       +2.0    Tyson Gay                      USA     09.08.82    1      Shanghai                      20.09.2009"       
## [5] "        3      9.69       -0.1    Yohan Blake                    JAM     26.12.89    1      Lausanne                      23.08.2012"
sprint.lines = sprint.lines[i.first:i.last] # Throw away everything else

(Continued)

Now let’s extract the times themselves

time.pattern = "[[:space:]]+(9|(10))\\.[0-9]{2}[[:space:]]"
test.vec = c("  9.58 ", " 10.05  ", "13.81")
as.numeric(grep(time.pattern, test.vec, value=TRUE))
## [1]  9.58 10.05
sprint.times = as.numeric(regmatches(sprint.lines, regexpr(time.pattern, sprint.lines)))
sprint.times[1:10]
##  [1] 9.58 9.63 9.69 9.69 9.69 9.71 9.72 9.72 9.74 9.74
length(sprint.times) # Sanity check ... uh oh
## [1] 2766
head(grep("Obadele Thompson", sprint.lines, value=TRUE), 1)
## [1] "        131    9.87A      -0.2    Obadele Thompson               BAR     30.03.76    1      Johannesburg                  11.09.1998"

(Continued)

Second attempt at extracting times, catch trailing “A” (stands for wind assisted)

time.pattern = "[[:space:]]+(9|(10))\\.[0-9]{2}([[:space:]]|A)"
sprint.times = regmatches(sprint.lines, regexpr(time.pattern, sprint.lines))
length(sprint.times) # Sanity check 
## [1] 2841
sprint.times = as.numeric(regmatches(sprint.times, regexpr("(9|(10))\\.[0-9]{2}", sprint.times)))
length(sprint.times) # One more sanity check
## [1] 2841
sprint.times[1:10]
##  [1] 9.58 9.63 9.69 9.69 9.69 9.71 9.72 9.72 9.74 9.74
plot(sprint.times, type="l", xlab="Rank", ylab="Time") 

(Continued)

Some motivation for learning cool plot tools, which we’ll cover next week!

sprint.times.unique = unique(sprint.times)
sprint.times.table = table(sprint.times)
n.unique = length(sprint.times.unique)
plot(sprint.times, type="n", xlab="Rank", ylab="Time", yaxt="n",
     main=paste("The", length(sprint.times), "fastest 100m dash times"))
axis(side=2, at=sprint.times.unique, labels=sprint.times.unique)
(mar = par("usr"))
## [1] -112.6000 2954.6000    9.5596   10.1104
xleft = mar[1]; xright = mar[2]
rect(xleft, sprint.times.unique[-n.unique], xright, sprint.times.unique[-1],
     col=c("pink","orange","lightblue"), border=NA, density=20)
abline(h=sprint.times.unique)
lines(sprint.times, lwd=2)
text(2700+50*rnorm(n.unique-1), 
     (sprint.times.unique[-n.unique]+sprint.times.unique[-1])/2,
     labels=sprint.times.table[-n.unique], cex=0.75)