Statistical Computing, 36-350
Wednesday September 16, 2016
Regexes, we’ve seen, are powerful tools for matching patterns of text. Turns out, we can also split strings based on them: use strsplit()
with split
equal to a regex
strsplit("Just .. gotta .... keep ...... going.......",
split=" *\\.+ *")[[1]]
## [1] "Just" "gotta" "keep" "going"
strsplit("A semicolon; is used to; join two; or more ideas; in a sentence",
split=";? +")
## [[1]]
## [1] "A" "semicolon" "is" "used" "to"
## [6] "join" "two" "or" "more" "ideas"
## [11] "in" "a" "sentence"
strsplit("Mercedes S550 iPhone 5s Titleist 915D Stat Computing v2",
split=" *[[:alpha:]]*[0-9]+[[:alpha:]]* *")
## [[1]]
## [1] "Mercedes" "iPhone" "Titleist" "Stat Computing"
trump.lines = readLines("http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/trump.txt")
trump.text = paste(trump.lines, collapse=" ")
trump.words.naive = strsplit(trump.text, split=" ")[[1]]
head(trump.words.naive, 10)
## [1] "Friends," "delegates" "and" "fellow" "Americans:"
## [6] "I" "humbly" "and" "gratefully" "accept"
trump.words = strsplit(trump.text, split="[[:space:]]|[[:punct:]]")[[1]]
head(trump.words, 10)
## [1] "Friends" "" "delegates" "and" "fellow"
## [6] "Americans" "" "I" "humbly" "and"
The second looks better! And a sanity check:
grep("[[:punct:]]", trump.words, value=TRUE)
## character(0)
So there’s really no punctuation marks in the second set of words
Let’s get rid of the empty string and explore the word counts a bit
sum(trump.words == "")
## [1] 592
trump.words = trump.words[trump.words != ""] # Get rid of empty strings
trump.wordtab = table(trump.words)
head(sort(trump.wordtab, decreasing=TRUE), 5)
## trump.words
## the and of to our
## 189 146 127 126 90
plot(trump.wordtab)
Scraping data from the web is not easy. Often times there is a lot of HTML code and/or other “junk” mixed in with text or data that we care about
Unfortunately there is no general recipe for extracting data from a generic webpage, this is what makes web scraping hard; but regexes can help a lot (recall earthquake data from Wednesday’s lab)
General steps:
readLines()
, print out some of the resultsgrep()
and a regex literalgrep()
and a regex patternregexpr()
and regmatches()
with a regex pattern to get the dataGetting data lines is pretty easy, because the data is contiguous
sprint.lines = readLines("http://www.alltime-athletics.com/m_100ok.htm")
length(sprint.lines)
## [1] 3077
head(sprint.lines, 5)
## [1] "<HTML>"
## [2] "<HEAD>"
## [3] "<META Name=\"description\" Content=\"100m, 100 meter, 100 metres, statistics\">"
## [4] "<META Name=\"keywords\" Content=\"friidrott, friidrottsstatistik, track, field, athletics, sprints, running, marathon, high jump, long jump, triple jump, shot put, discus, javelin, hammer, pole vault, heptathlon, decahtlon, relay\">"
## [5] "<TITLE>Men's 100m</TITLE>"
(i.first = min(grep("Usain Bolt", sprint.lines)))
## [1] 81
(i.last = max(grep("Julian Forte", sprint.lines)))
## [1] 2921
i.last - i.first + 1 # Sanity check
## [1] 2841
sprint.lines[i.first:(i.first+4)]
## [1] " 1 9.58 +0.9 Usain Bolt JAM 21.08.86 1 Berlin 16.08.2009"
## [2] " 2 9.63 +1.5 Usain Bolt JAM 21.08.86 1 London 05.08.2012"
## [3] " 3 9.69 ±0.0 Usain Bolt JAM 21.08.86 1 Beijing 16.08.2008"
## [4] " 3 9.69 +2.0 Tyson Gay USA 09.08.82 1 Shanghai 20.09.2009"
## [5] " 3 9.69 -0.1 Yohan Blake JAM 26.12.89 1 Lausanne 23.08.2012"
sprint.lines = sprint.lines[i.first:i.last] # Throw away everything else
Now let’s extract the times themselves
time.pattern = "[[:space:]]+(9|(10))\\.[0-9]{2}[[:space:]]"
test.vec = c(" 9.58 ", " 10.05 ", "13.81")
as.numeric(grep(time.pattern, test.vec, value=TRUE))
## [1] 9.58 10.05
sprint.times = as.numeric(regmatches(sprint.lines, regexpr(time.pattern, sprint.lines)))
sprint.times[1:10]
## [1] 9.58 9.63 9.69 9.69 9.69 9.71 9.72 9.72 9.74 9.74
length(sprint.times) # Sanity check ... uh oh
## [1] 2766
head(grep("Obadele Thompson", sprint.lines, value=TRUE), 1)
## [1] " 131 9.87A -0.2 Obadele Thompson BAR 30.03.76 1 Johannesburg 11.09.1998"
Second attempt at extracting times, catch trailing “A” (stands for wind assisted)
time.pattern = "[[:space:]]+(9|(10))\\.[0-9]{2}([[:space:]]|A)"
sprint.times = regmatches(sprint.lines, regexpr(time.pattern, sprint.lines))
length(sprint.times) # Sanity check
## [1] 2841
sprint.times = as.numeric(regmatches(sprint.times, regexpr("(9|(10))\\.[0-9]{2}", sprint.times)))
length(sprint.times) # One more sanity check
## [1] 2841
sprint.times[1:10]
## [1] 9.58 9.63 9.69 9.69 9.69 9.71 9.72 9.72 9.74 9.74
plot(sprint.times, type="l", xlab="Rank", ylab="Time")
Some motivation for learning cool plot tools, which we’ll cover next week!
sprint.times.unique = unique(sprint.times)
sprint.times.table = table(sprint.times)
n.unique = length(sprint.times.unique)
plot(sprint.times, type="n", xlab="Rank", ylab="Time", yaxt="n",
main=paste("The", length(sprint.times), "fastest 100m dash times"))
axis(side=2, at=sprint.times.unique, labels=sprint.times.unique)
(mar = par("usr"))
## [1] -112.6000 2954.6000 9.5596 10.1104
xleft = mar[1]; xright = mar[2]
rect(xleft, sprint.times.unique[-n.unique], xright, sprint.times.unique[-1],
col=c("pink","orange","lightblue"), border=NA, density=20)
abline(h=sprint.times.unique)
lines(sprint.times, lwd=2)
text(2700+50*rnorm(n.unique-1),
(sprint.times.unique[-n.unique]+sprint.times.unique[-1])/2,
labels=sprint.times.table[-n.unique], cex=0.75)