Summarizing Text

Statistical Computing, 36-350

Wednesday September 7, 2016

Counting words

Our most basic tool for summarizing text: word counts, retrieved using table()

trump.lines = readLines("")
trump.text = paste(trump.lines, collapse=" ")
trump.words = strsplit(trump.text, split=" ")[[1]]
trump.wordtab = table(trump.words)

The output of table()

What did we get? Alphabetically sorted unique words, and their counts = number of appearances

## [1] "table"
## [1] 1604
## trump.words
##                         –           'I   “extremely         “I’m 
##            1           34            1            1            1 
##         “I’M “negligent,”         $150          $19           $2 
##            1            1            1            1            1
plot(trump.wordtab, xlab="Words", ylab="Frequency") # Table objects can be specially plotted

The names are words, the entries are counts

Note: this is actually a vector of numbers, and the words are the names of the vector

## trump.words
##                     –         'I “extremely       “I’m 
##          1         34          1          1          1
trump.wordtab[2] == 34
##    – 
names(trump.wordtab)[2] == "–"
## [1] TRUE

Fetching specific word counts

Reminder: you can index a vector by names (not just by positions)

## already 
##       1
trump.wordtab["already"] # Same thing
## already 
##       1

We can now use this to look up whatever words we want

## America 
##      19
## great 
##     7
## wall 
##    1
trump.wordtab["Canada"] # NA means Trump never mentioned Canada
## <NA> 
##   NA

Most frequent words

Let’s sort in decreasing order, to get the most frequent words

trump.wordtab.sorted = sort(trump.wordtab, decreasing=TRUE)
## [1] 1604
head(trump.wordtab.sorted, 20) # First 20
## trump.words
##   the   and    of    to   our  will    in     I  have     a  that   for 
##   189   145   127   126    90    82    69    64    57    51    48    46 
##    is   are    we     – their    be    on   was 
##    40    39    35    34    28    26    26    26
tail(trump.wordtab.sorted, 20) # Last 20
## trump.words
##     wonder    workers  workforce     works,      worth   wouldn’t 
##          1          1          1          1          1          1 
##    wounded years-old,     years.        yet        Yet       Yet, 
##          1          1          1          1          1          1 
##        YOU       you,       You,       you:       YOU.   youngest 
##          1          1          1          1          1          1 
##       YOUR      youth 
##          1          1

Notice that punctuation matters, e.g., “Yet” and “Yet,” are treated as separate words (not ideal)

Visualizing frequencies

Let’s use a histogram to visualize frequencies

nw = length(trump.wordtab.sorted)
plot(1:nw, trump.wordtab.sorted, type="l", xlab="Rank", ylab="Frequency")

A pretty drastic looking trend! It looks as if \(\mathrm{Frequency} \propto (1/\mathrm{Rank})^a\) for some \(a>0\)

Zipf’s law

This phenomenon, that frequency tends to be inversely proportional to a power of rank, is called Zipf’s law

For our data, Zipf’s law approximately holds, with \(\mathrm{Frequency} \approx C(1/\mathrm{Rank})^a\) for \(C=215\) and \(a=0.57\)

C = 215; a = 0.57
trump.wordtab.zipf = C*(1/1:nw)^a
cbind(trump.wordtab.sorted[1:8], trump.wordtab.zipf[1:8])
##      [,1]      [,2]
## the   189 215.00000
## and   145 144.82761
## of    127 114.94216
## to    126  97.55831
## our    90  85.90641
## will   82  77.42697
## in     69  70.91410
## I      64  65.71691

Not perfect, but not bad

Visualizing Zipf’s law

We can plot the original sorted word counts, and those estimated by our formula law on top

plot(1:nw, trump.wordtab.sorted, type="l", xlab="Rank", ylab="Frequency")
curve(C*(1/x)^a, from=1, to=nw, col="red", add=TRUE)

We’ll learn about plotting tools in detail in a couple of weeks