Counting words

Our most basic tool for summarizing text: word counts, retrieved using table()

trump.lines = readLines("http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/trump.txt")
trump.text = paste(trump.lines, collapse=" ")
trump.words = strsplit(trump.text, split=" ")[[1]]
trump.wordtab = table(trump.words)

The output of `table()`

What did we get? Alphabetically sorted unique words, and their counts = number of appearances

class(trump.wordtab)

## [1] "table"

length(trump.wordtab)

## [1] 1604

trump.wordtab[1:10]

## trump.words
##                         –           'I   “extremely         “I’m 
##            1           34            1            1            1 
##         “I’M “negligent,”         $150          $19           $2 
##            1            1            1            1            1

plot(trump.wordtab, xlab="Words", ylab="Frequency") # Table objects can be specially plotted

The names are words, the entries are counts

Note: this is actually a vector of numbers, and the words are the names of the vector

trump.wordtab[1:5]

## trump.words
##                     –         'I “extremely       “I’m 
##          1         34          1          1          1

trump.wordtab[2] == 34

##    – 
## TRUE

names(trump.wordtab)[2] == "–"

## [1] TRUE

Fetching specific word counts

Reminder: you can index a vector by names (not just by positions)

trump.wordtab[100]

## already 
##       1

trump.wordtab["already"] # Same thing

## already 
##       1

We can now use this to look up whatever words we want

trump.wordtab["America"]

## America 
##      19

trump.wordtab["great"]

## great 
##     7

trump.wordtab["wall"]

## wall 
##    1

trump.wordtab["Canada"] # NA means Trump never mentioned Canada

## <NA> 
##   NA

Most frequent words

Let’s sort in decreasing order, to get the most frequent words

trump.wordtab.sorted = sort(trump.wordtab, decreasing=TRUE)
length(trump.wordtab.sorted)

## [1] 1604

head(trump.wordtab.sorted, 20) # First 20

## trump.words
##   the   and    of    to   our  will    in     I  have     a  that   for 
##   189   145   127   126    90    82    69    64    57    51    48    46 
##    is   are    we     – their    be    on   was 
##    40    39    35    34    28    26    26    26

tail(trump.wordtab.sorted, 20) # Last 20

## trump.words
##     wonder    workers  workforce     works,      worth   wouldn’t 
##          1          1          1          1          1          1 
##    wounded years-old,     years.        yet        Yet       Yet, 
##          1          1          1          1          1          1 
##        YOU       you,       You,       you:       YOU.   youngest 
##          1          1          1          1          1          1 
##       YOUR      youth 
##          1          1

Notice that punctuation matters, e.g., “Yet” and “Yet,” are treated as separate words (not ideal)

Visualizing frequencies

Let’s use a histogram to visualize frequencies

nw = length(trump.wordtab.sorted)
plot(1:nw, trump.wordtab.sorted, type="l", xlab="Rank", ylab="Frequency")

A pretty drastic looking trend! It looks as if \(\mathrm{Frequency} \propto (1/\mathrm{Rank})^a\) for some \(a>0\)

Zipf’s law

This phenomenon, that frequency tends to be inversely proportional to a power of rank, is called Zipf’s law

For our data, Zipf’s law approximately holds, with \(\mathrm{Frequency} \approx C(1/\mathrm{Rank})^a\) for \(C=215\) and \(a=0.57\)

C = 215; a = 0.57
trump.wordtab.zipf = C*(1/1:nw)^a
cbind(trump.wordtab.sorted[1:8], trump.wordtab.zipf[1:8])

##      [,1]      [,2]
## the   189 215.00000
## and   145 144.82761
## of    127 114.94216
## to    126  97.55831
## our    90  85.90641
## will   82  77.42697
## in     69  70.91410
## I      64  65.71691

Not perfect, but not bad

Visualizing Zipf’s law

We can plot the original sorted word counts, and those estimated by our formula law on top

plot(1:nw, trump.wordtab.sorted, type="l", xlab="Rank", ylab="Frequency")
curve(C*(1/x)^a, from=1, to=nw, col="red", add=TRUE)

We’ll learn about plotting tools in detail in a couple of weeks

Summarizing Text

Statistical Computing, 36-350

Wednesday September 7, 2016

Counting words

The output of `table()`

The names are words, the entries are counts

Fetching specific word counts

Most frequent words

Visualizing frequencies

Zipf’s law

Visualizing Zipf’s law

Summarizing Text

Statistical Computing, 36-350

Wednesday September 7, 2016

Counting words

The output of table()

The names are words, the entries are counts

Fetching specific word counts

Most frequent words

Visualizing frequencies

Zipf’s law

Visualizing Zipf’s law

The output of `table()`