Splitting on a regex

Regexes, we’ve seen, are powerful tools for matching patterns of text. Turns out, we can also split strings based on them: use strsplit() with split equal to a regex

strsplit("Just .. gotta  .... keep ......  going.......", 
  split=" *\\.+ *")[[1]]

## [1] "Just"  "gotta" "keep"  "going"

strsplit("A semicolon; is used to; join two; or more ideas; in a sentence",
         split=";? +")

## [[1]]
##  [1] "A"         "semicolon" "is"        "used"      "to"       
##  [6] "join"      "two"       "or"        "more"      "ideas"    
## [11] "in"        "a"         "sentence"

strsplit("Mercedes S550 iPhone 5s Titleist 915D Stat Computing v2", 
         split=" *[[:alpha:]]*[0-9]+[[:alpha:]]* *")

## [[1]]
## [1] "Mercedes"       "iPhone"         "Titleist"       "Stat Computing"

Example: Trump’s RNC speech

trump.lines = readLines("http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/trump.txt")
trump.text = paste(trump.lines, collapse=" ")
trump.words.naive = strsplit(trump.text, split=" ")[[1]]
head(trump.words.naive, 10)

##  [1] "Friends,"   "delegates"  "and"        "fellow"     "Americans:"
##  [6] "I"          "humbly"     "and"        "gratefully" "accept"

trump.words = strsplit(trump.text, split="[[:space:]]|[[:punct:]]")[[1]]
head(trump.words, 10)

##  [1] "Friends"   ""          "delegates" "and"       "fellow"   
##  [6] "Americans" ""          "I"         "humbly"    "and"

The second looks better! And a sanity check:

grep("[[:punct:]]", trump.words, value=TRUE)

## character(0)

So there’s really no punctuation marks in the second set of words

(Continued)

Let’s get rid of the empty string and explore the word counts a bit

sum(trump.words == "")

## [1] 592

trump.words = trump.words[trump.words != ""] # Get rid of empty strings
trump.wordtab = table(trump.words)
head(sort(trump.wordtab, decreasing=TRUE), 5)

## trump.words
## the and  of  to our 
## 189 146 127 126  90

plot(trump.wordtab)

Searching with a regex

Scraping data from the web is not easy. Often times there is a lot of HTML code and/or other “junk” mixed in with text or data that we care about

Unfortunately there is no general recipe for extracting data from a generic webpage, this is what makes web scraping hard; but regexes can help a lot (recall earthquake data from Wednesday’s lab)

(Continued)

General steps:

Read the webpage into R with readLines(), print out some of the results
Find where the data starts, by (say) using grep() and a regex literal
Find out where the data ends, in the same way
If the data lines are contiguous, then just pull these lines
If not, then get lines using grep() and a regex pattern
One you have the data lines, study the pattern of data you want to extract
Use regexpr() and regmatches() with a regex pattern to get the data
Do some sanity checks to make sure that we have extracted the “right” data

Example: fastest 100m sprint times

Getting data lines is pretty easy, because the data is contiguous

sprint.lines = readLines("http://www.alltime-athletics.com/m_100ok.htm")
length(sprint.lines)

## [1] 3077

head(sprint.lines, 5)

## [1] "<HTML>"                                                                                                                                                                                                                                  
## [2] "<HEAD>"                                                                                                                                                                                                                                  
## [3] "<META Name=\"description\" Content=\"100m, 100 meter, 100 metres, statistics\">"                                                                                                                                                         
## [4] "<META Name=\"keywords\" Content=\"friidrott, friidrottsstatistik, track, field, athletics, sprints, running, marathon, high jump, long jump, triple jump, shot put, discus, javelin, hammer, pole vault, heptathlon, decahtlon, relay\">"
## [5] "<TITLE>Men's 100m</TITLE>"

(i.first = min(grep("Usain Bolt", sprint.lines)))

## [1] 81

(i.last = max(grep("Julian Forte", sprint.lines)))

## [1] 2921

i.last - i.first + 1 # Sanity check

## [1] 2841

sprint.lines[i.first:(i.first+4)]

## [1] "        1      9.58       +0.9    Usain Bolt                     JAM     21.08.86    1      Berlin                        16.08.2009"       
## [2] "        2      9.63       +1.5    Usain Bolt                     JAM     21.08.86    1      London                        05.08.2012"       
## [3] "        3      9.69       &plusmn;0.0    Usain Bolt                     JAM     21.08.86    1      Beijing                       16.08.2008"
## [4] "        3      9.69       +2.0    Tyson Gay                      USA     09.08.82    1      Shanghai                      20.09.2009"       
## [5] "        3      9.69       -0.1    Yohan Blake                    JAM     26.12.89    1      Lausanne                      23.08.2012"

sprint.lines = sprint.lines[i.first:i.last] # Throw away everything else

(Continued)

Now let’s extract the times themselves

time.pattern = "[[:space:]]+(9|(10))\\.[0-9]{2}[[:space:]]"
test.vec = c("  9.58 ", " 10.05  ", "13.81")
as.numeric(grep(time.pattern, test.vec, value=TRUE))

## [1]  9.58 10.05

sprint.times = as.numeric(regmatches(sprint.lines, regexpr(time.pattern, sprint.lines)))
sprint.times[1:10]

##  [1] 9.58 9.63 9.69 9.69 9.69 9.71 9.72 9.72 9.74 9.74

length(sprint.times) # Sanity check ... uh oh

## [1] 2766

head(grep("Obadele Thompson", sprint.lines, value=TRUE), 1)

## [1] "        131    9.87A      -0.2    Obadele Thompson               BAR     30.03.76    1      Johannesburg                  11.09.1998"

(Continued)

Second attempt at extracting times, catch trailing “A” (stands for wind assisted)

time.pattern = "[[:space:]]+(9|(10))\\.[0-9]{2}([[:space:]]|A)"
sprint.times = regmatches(sprint.lines, regexpr(time.pattern, sprint.lines))
length(sprint.times) # Sanity check

## [1] 2841

sprint.times = as.numeric(regmatches(sprint.times, regexpr("(9|(10))\\.[0-9]{2}", sprint.times)))
length(sprint.times) # One more sanity check

## [1] 2841

sprint.times[1:10]

##  [1] 9.58 9.63 9.69 9.69 9.69 9.71 9.72 9.72 9.74 9.74

plot(sprint.times, type="l", xlab="Rank", ylab="Time")

(Continued)

Some motivation for learning cool plot tools, which we’ll cover next week!

sprint.times.unique = unique(sprint.times)
sprint.times.table = table(sprint.times)
n.unique = length(sprint.times.unique)
plot(sprint.times, type="n", xlab="Rank", ylab="Time", yaxt="n",
     main=paste("The", length(sprint.times), "fastest 100m dash times"))
axis(side=2, at=sprint.times.unique, labels=sprint.times.unique)
(mar = par("usr"))

## [1] -112.6000 2954.6000    9.5596   10.1104

xleft = mar[1]; xright = mar[2]
rect(xleft, sprint.times.unique[-n.unique], xright, sprint.times.unique[-1],
     col=c("pink","orange","lightblue"), border=NA, density=20)
abline(h=sprint.times.unique)
lines(sprint.times, lwd=2)
text(2700+50*rnorm(n.unique-1), 
     (sprint.times.unique[-n.unique]+sprint.times.unique[-1])/2,
     labels=sprint.times.table[-n.unique], cex=0.75)

Splitting and Searching with Regexes

Splitting on a regex

Example: Trump’s RNC speech

(Continued)

Searching with a regex

(Continued)

Example: fastest 100m sprint times

(Continued)

(Continued)

(Continued)