Two documents of interest

Let’s get two documents of interest into R, before we can start comparing them, and caculate word counts for each. (Note: we convert all words to lower case)

# First one
trump.lines = readLines("http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/trump.txt")
trump.words = strsplit(paste(trump.lines, collapse=" "), split=" ")[[1]]
trump.wordtab = table(tolower(trump.words))

# Second one
clinton.lines = readLines("http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/clinton.txt")
clinton.words = strsplit(paste(clinton.lines, collapse=" "), split=" ")[[1]]
clinton.wordtab = table(tolower(clinton.words))

Simple pre-processing

Turns out that both tables actually has an entry for the empty string, “”; let’s get rid of these entries

trump.wordtab[names(trump.wordtab) == ""] # We have 1 empty word with Trump
##   
## 1
clinton.wordtab[names(clinton.wordtab) == ""] # We have 312 empty words with Clinton
##     
## 312
trump.wordtab = trump.wordtab[names(trump.wordtab) != ""] 
clinton.wordtab = clinton.wordtab[names(clinton.wordtab) != ""] # Let's get rid of them

Basic comparisons

Let’s make some basic comparisons. (Note: we know these are imperfect, because our words may contain punctuation marks, etc. Ignore for now, but next week we’ll see a nice way to clean this up)

# Who had the longer speech (spoke more words)?
sum(trump.wordtab)
## [1] 4436
sum(clinton.wordtab)
## [1] 5407
# Who used more unique words?
length(trump.wordtab)
## [1] 1508
length(clinton.wordtab)
## [1] 1717
# Who repeated themselves less (higher percentage of unique words)?
length(trump.wordtab) / sum(trump.wordtab) * 100
## [1] 33.99459
length(clinton.wordtab) / sum(clinton.wordtab) * 100
## [1] 31.75513
# Who used "great" more? 
trump.wordtab["great"]
## great 
##     8
clinton.wordtab["great"]
## great 
##     5

How to go beyond the basics?

Suppose we want to make more advanced comparisons. E.g., given a new (third) document, is it more like the first or the second?

We must think of a way of representing our two documents. We want our representation to:

  1. Be easy to generate from the raw documents, and be easy to work with
  2. Highlight important aspects of the documents, and suppress unimportant aspects

Think we need some fancy representation? Think again! We pretty much already have what we need

Bag-of-words model

The bag-of-words model is a central to information retrieval and natural language processing. The idea: a document is nothing more than a bag of words, i.e., order doesn’t matter, only the word counts

This is like what we’re already doing! Key difference is that we must calculate word counts for words that appear across all documents

# Let's set up our third document, the query document
query.text = "Make America great again."
query.words = strsplit(query.text, split=" ")[[1]]
query.wordtab = table(tolower(query.words))

# Let's get all the words, then just consider the unique ones, sorted alphabetically
all.words = c(names(trump.wordtab), names(clinton.wordtab), names(query.wordtab))
all.words.unique = sort(unique(all.words))
length(all.words.unique)
## [1] 2686

Document-term matrix

This is a matrix that is \((\text{# of documents}) \times (\text{# of unique words})\), or for us, 3 x 2686. Each row contains the word counts for one document

dt.mat = matrix(0, 3, length(all.words.unique))
rownames(dt.mat) = c("Trump", "Clinton", "Query")
colnames(dt.mat) = all.words.unique
dt.mat[1, names(trump.wordtab)] = trump.wordtab
dt.mat[2, names(clinton.wordtab)] = clinton.wordtab
dt.mat[3, names(query.wordtab)] = query.wordtab
dt.mat[, 1:10]
##         -  –  — . … 'i "let "midnight "morning "planting
## Trump   0 34  0 0 0  1    0         0        0         0
## Clinton 1 26 13 1 3  0    1         1        1         1
## Query   0  0  0 0 0  0    0         0        0         0

Computing word count differences

Which words had the biggest (absolute) differences in counts between our two documents?

dt.diff = dt.mat[1,] - dt.mat[2,]
dt.diff[sample(length(dt.diff), 5)]
##   families.           .    america" autoworkers          10 
##          -1          -1          -1          -1           0
dt.diff.abs.sorted = sort(abs(dt.diff), decreasing=TRUE) 
dt.diff.abs.sorted[1:20] # Top 20 biggest absolute differences
##     and     you       a    will      to      we      of      he    have 
##      66      58      51      49      42      37      31      30      29 
##  people      it    just     our     not believe      as    what    been 
##      23      21      21      19      18      17      16      16      15 
##  should      am 
##      15      14

Now we don’t know the direction though, e.g., who said “believe” more?

Reminder: order()

The sort() function gives us the values in sorted order; its close friend, the order() function gives us the indices that correspond to sorted order

x = c(5, 2, 7)
sort(x, decreasing=TRUE)
## [1] 7 5 2
order(x, decreasing=TRUE)
## [1] 3 1 2
x[order(x, decreasing=TRUE)]
## [1] 7 5 2

Now back to our running data set

inds.sorted = order(abs(dt.diff), decreasing=TRUE)
dt.diff[inds.sorted[1:20]] # Top 20 biggest differences, with signs
##     and     you       a    will      to      we      of      he    have 
##     -66     -58     -51      49     -42     -37      31     -30      29 
##  people      it    just     our     not believe      as    what    been 
##     -23     -21     -21      19     -18     -17     -16     -16      15 
##  should      am 
##     -15      14

Computing correlations

Which of our two documents is “closest” to the query? Depends on how we define “closest”; but one reasonable way is to use correlation, easily computed with the cor() function

x = 1:10; y = 1:10; z = 5*(1:10)
cor(x, y)
## [1] 1
cor(x, z)
## [1] 1
(y = 1:10 + rnorm(10))
##  [1]  0.5758977  2.2368037  0.6572769  4.9616966  4.3955743  5.2471227
##  [7]  5.4443884  6.5461063  9.0563318 10.5093694
cor(x, y)
## [1] 0.9494454

Now back to our running data set

cor(dt.mat[1,], dt.mat[3,])
## [1] 0.05231936
cor(dt.mat[2,], dt.mat[3,])
## [1] 0.03750096

IDF weighting

How well does our bag-of-words model meet our two goals: 1. easy to set up and use, 2. helpful at teasing out important aspects, and muting unimportant ones?

Not so strong on the second goal (recall the word count differences). But an extension called inverse-document-frequency or IDF weighting can help a lot with this

Helps when we have many documents in our collection (also called our corpus). Suppose we have \(D\) documents total. IDF weighting works as follows: e.g., the word count \[ (\text{# of times Trump said great}) \] becomes the weighted word count \[ (\text{# of times Trump said great}) \times \log\bigg(\frac{D}{\text{# of documents with word great}}\bigg) \] (Note: you’ll see this as TF-IDF, where the TF stands for term-frequency)

So when all documents contain the word in question, the weight is zero. Better to have a large corpus, say \(D\) in the hundreds or thousands, not \(D=2\) like us