Our most basic tool for summarizing text: word counts, retrieved using table()
trump.lines = readLines("http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/trump.txt")
trump.text = paste(trump.lines, collapse=" ")
trump.words = strsplit(trump.text, split=" ")[[1]]
trump.wordtab = table(trump.words)
table()
What did we get? Alphabetically sorted unique words, and their counts = number of appearances
class(trump.wordtab)
## [1] "table"
length(trump.wordtab)
## [1] 1604
trump.wordtab[1:10]
## trump.words
## – 'I “extremely “I’m
## 1 34 1 1 1
## “I’M “negligent,” $150 $19 $2
## 1 1 1 1 1
plot(trump.wordtab, xlab="Words", ylab="Frequency") # Table objects can be specially plotted
Note: this is actually a vector of numbers, and the words are the names of the vector
trump.wordtab[1:5]
## trump.words
## – 'I “extremely “I’m
## 1 34 1 1 1
trump.wordtab[2] == 34
## –
## TRUE
names(trump.wordtab)[2] == "–"
## [1] TRUE
Reminder: you can index a vector by names (not just by positions)
trump.wordtab[100]
## already
## 1
trump.wordtab["already"] # Same thing
## already
## 1
We can now use this to look up whatever words we want
trump.wordtab["America"]
## America
## 19
trump.wordtab["great"]
## great
## 7
trump.wordtab["wall"]
## wall
## 1
trump.wordtab["Canada"] # NA means Trump never mentioned Canada
## <NA>
## NA
Let’s sort in decreasing order, to get the most frequent words
trump.wordtab.sorted = sort(trump.wordtab, decreasing=TRUE)
length(trump.wordtab.sorted)
## [1] 1604
head(trump.wordtab.sorted, 20) # First 20
## trump.words
## the and of to our will in I have a that for
## 189 145 127 126 90 82 69 64 57 51 48 46
## is are we – their be on was
## 40 39 35 34 28 26 26 26
tail(trump.wordtab.sorted, 20) # Last 20
## trump.words
## wonder workers workforce works, worth wouldn’t
## 1 1 1 1 1 1
## wounded years-old, years. yet Yet Yet,
## 1 1 1 1 1 1
## YOU you, You, you: YOU. youngest
## 1 1 1 1 1 1
## YOUR youth
## 1 1
Notice that punctuation matters, e.g., “Yet” and “Yet,” are treated as separate words (not ideal)
Let’s use a histogram to visualize frequencies
nw = length(trump.wordtab.sorted)
plot(1:nw, trump.wordtab.sorted, type="l", xlab="Rank", ylab="Frequency")
A pretty drastic looking trend! It looks as if \(\mathrm{Frequency} \propto (1/\mathrm{Rank})^a\) for some \(a>0\)
This phenomenon, that frequency tends to be inversely proportional to a power of rank, is called Zipf’s law
For our data, Zipf’s law approximately holds, with \(\mathrm{Frequency} \approx C(1/\mathrm{Rank})^a\) for \(C=215\) and \(a=0.57\)
C = 215; a = 0.57
trump.wordtab.zipf = C*(1/1:nw)^a
cbind(trump.wordtab.sorted[1:8], trump.wordtab.zipf[1:8])
## [,1] [,2]
## the 189 215.00000
## and 145 144.82761
## of 127 114.94216
## to 126 97.55831
## our 90 85.90641
## will 82 77.42697
## in 69 70.91410
## I 64 65.71691
Not perfect, but not bad
We can plot the original sorted word counts, and those estimated by our formula law on top
plot(1:nw, trump.wordtab.sorted, type="l", xlab="Rank", ylab="Frequency")
curve(C*(1/x)^a, from=1, to=nw, col="red", add=TRUE)
We’ll learn about plotting tools in detail in a couple of weeks