Let’s get two documents of interest into R, before we can start comparing them, and caculate word counts for each. (Note: we convert all words to lower case)
# First one
trump.lines = readLines("http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/trump.txt")
trump.words = strsplit(paste(trump.lines, collapse=" "), split=" ")[[1]]
trump.wordtab = table(tolower(trump.words))
# Second one
clinton.lines = readLines("http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/clinton.txt")
clinton.words = strsplit(paste(clinton.lines, collapse=" "), split=" ")[[1]]
clinton.wordtab = table(tolower(clinton.words))
Turns out that both tables actually has an entry for the empty string, “”; let’s get rid of these entries
trump.wordtab[names(trump.wordtab) == ""] # We have 1 empty word with Trump
##
## 1
clinton.wordtab[names(clinton.wordtab) == ""] # We have 312 empty words with Clinton
##
## 312
trump.wordtab = trump.wordtab[names(trump.wordtab) != ""]
clinton.wordtab = clinton.wordtab[names(clinton.wordtab) != ""] # Let's get rid of them
Let’s make some basic comparisons. (Note: we know these are imperfect, because our words may contain punctuation marks, etc. Ignore for now, but next week we’ll see a nice way to clean this up)
# Who had the longer speech (spoke more words)?
sum(trump.wordtab)
## [1] 4436
sum(clinton.wordtab)
## [1] 5407
# Who used more unique words?
length(trump.wordtab)
## [1] 1508
length(clinton.wordtab)
## [1] 1717
# Who repeated themselves less (higher percentage of unique words)?
length(trump.wordtab) / sum(trump.wordtab) * 100
## [1] 33.99459
length(clinton.wordtab) / sum(clinton.wordtab) * 100
## [1] 31.75513
# Who used "great" more?
trump.wordtab["great"]
## great
## 8
clinton.wordtab["great"]
## great
## 5
Suppose we want to make more advanced comparisons. E.g., given a new (third) document, is it more like the first or the second?
We must think of a way of representing our two documents. We want our representation to:
Think we need some fancy representation? Think again! We pretty much already have what we need
The bag-of-words model is a central to information retrieval and natural language processing. The idea: a document is nothing more than a bag of words, i.e., order doesn’t matter, only the word counts
This is like what we’re already doing! Key difference is that we must calculate word counts for words that appear across all documents
# Let's set up our third document, the query document
query.text = "Make America great again."
query.words = strsplit(query.text, split=" ")[[1]]
query.wordtab = table(tolower(query.words))
# Let's get all the words, then just consider the unique ones, sorted alphabetically
all.words = c(names(trump.wordtab), names(clinton.wordtab), names(query.wordtab))
all.words.unique = sort(unique(all.words))
length(all.words.unique)
## [1] 2686
This is a matrix that is \((\text{# of documents}) \times (\text{# of unique words})\), or for us, 3 x 2686. Each row contains the word counts for one document
dt.mat = matrix(0, 3, length(all.words.unique))
rownames(dt.mat) = c("Trump", "Clinton", "Query")
colnames(dt.mat) = all.words.unique
dt.mat[1, names(trump.wordtab)] = trump.wordtab
dt.mat[2, names(clinton.wordtab)] = clinton.wordtab
dt.mat[3, names(query.wordtab)] = query.wordtab
dt.mat[, 1:10]
## - – — . … 'i "let "midnight "morning "planting
## Trump 0 34 0 0 0 1 0 0 0 0
## Clinton 1 26 13 1 3 0 1 1 1 1
## Query 0 0 0 0 0 0 0 0 0 0
Which words had the biggest (absolute) differences in counts between our two documents?
dt.diff = dt.mat[1,] - dt.mat[2,]
dt.diff[sample(length(dt.diff), 5)]
## families. . america" autoworkers 10
## -1 -1 -1 -1 0
dt.diff.abs.sorted = sort(abs(dt.diff), decreasing=TRUE)
dt.diff.abs.sorted[1:20] # Top 20 biggest absolute differences
## and you a will to we of he have
## 66 58 51 49 42 37 31 30 29
## people it just our not believe as what been
## 23 21 21 19 18 17 16 16 15
## should am
## 15 14
Now we don’t know the direction though, e.g., who said “believe” more?
order()
The sort()
function gives us the values in sorted order; its close friend, the order()
function gives us the indices that correspond to sorted order
x = c(5, 2, 7)
sort(x, decreasing=TRUE)
## [1] 7 5 2
order(x, decreasing=TRUE)
## [1] 3 1 2
x[order(x, decreasing=TRUE)]
## [1] 7 5 2
Now back to our running data set
inds.sorted = order(abs(dt.diff), decreasing=TRUE)
dt.diff[inds.sorted[1:20]] # Top 20 biggest differences, with signs
## and you a will to we of he have
## -66 -58 -51 49 -42 -37 31 -30 29
## people it just our not believe as what been
## -23 -21 -21 19 -18 -17 -16 -16 15
## should am
## -15 14
Which of our two documents is “closest” to the query? Depends on how we define “closest”; but one reasonable way is to use correlation, easily computed with the cor()
function
x = 1:10; y = 1:10; z = 5*(1:10)
cor(x, y)
## [1] 1
cor(x, z)
## [1] 1
(y = 1:10 + rnorm(10))
## [1] 0.5758977 2.2368037 0.6572769 4.9616966 4.3955743 5.2471227
## [7] 5.4443884 6.5461063 9.0563318 10.5093694
cor(x, y)
## [1] 0.9494454
Now back to our running data set
cor(dt.mat[1,], dt.mat[3,])
## [1] 0.05231936
cor(dt.mat[2,], dt.mat[3,])
## [1] 0.03750096
How well does our bag-of-words model meet our two goals: 1. easy to set up and use, 2. helpful at teasing out important aspects, and muting unimportant ones?
Not so strong on the second goal (recall the word count differences). But an extension called inverse-document-frequency or IDF weighting can help a lot with this
Helps when we have many documents in our collection (also called our corpus). Suppose we have \(D\) documents total. IDF weighting works as follows: e.g., the word count \[ (\text{# of times Trump said great}) \] becomes the weighted word count \[ (\text{# of times Trump said great}) \times \log\bigg(\frac{D}{\text{# of documents with word great}}\bigg) \] (Note: you’ll see this as TF-IDF, where the TF stands for term-frequency)
So when all documents contain the word in question, the weight is zero. Better to have a large corpus, say \(D\) in the hundreds or thousands, not \(D=2\) like us