Information Retrieval 2 — Queries and Relevance

36-462/662

Lecture 3, 4 September 2019

What we know how to do

Take a big collection of items (e.g., documents)
Represent each item as a vector of features (e.g., bag of words)
Calculate distances between items (e.g., Euclidean distance with normalization and IDF)
Find nearest items to a given item

source("http://www.stat.cmu.edu/~cshalizi/dm/19/hw/01/nytac-and-bow.R")
music.stories <- read.directory("nyt_corpus/music")
art.stories <- read.directory("nyt_corpus/art")
art.BoW.list <- lapply(art.stories, table)
music.BoW.list <- lapply(music.stories, table)
nyt.BoW.frame <- make.BoW.frame(c(art.BoW.list, music.BoW.list), row.names = c(paste("art", 
    1:length(art.BoW.list), sep = "."), paste("music", 1:length(music.BoW.list), 
    sep = ".")))
dim(nyt.BoW.frame)

## [1]  102 4431

The trick: Queries are documents

Turn the query string into a bag of words vector
Find distances to other vectors in the data base
Return the closest items
- Optional: weigh distance against other measures of quality

The trick in action

query.by.similarity <- function(query, BoW.frame) {
    query.vec = strip.text(query)
    query.BoW = table(query.vec)
    lexicon = colnames(BoW.frame)
    query.vocab = names(query.BoW)
    query.lex = query.BoW[intersect(query.vocab, lexicon)]
    query.lex[setdiff(lexicon, query.vocab)] = 0
    query.lex = query.lex[lexicon]
    q = t(as.matrix(query.lex))
    idf = get.idf.weights(BoW.frame)
    BoW = scale.cols(BoW.frame, idf)
    q = q * idf
    BoW = div.by.euc.length(BoW)
    q = q/sqrt(sum(q^2))
    best.index = nearest.points(q, BoW)$which
    best.name = rownames(BoW)[best.index]
    return(list(best.index = best.index, best.name = best.name))
}
get.idf.weights <- function(x) {
    doc.freq <- colSums(x > 0)
    doc.freq[doc.freq == 0] <- 1
    w <- log(nrow(x)/doc.freq)
    return(w)
}

query.by.similarity("jazz lincoln center", nyt.BoW.frame)

## $best.index
## [1] 96
## 
## $best.name
## [1] "music.39"

paste(music.stories[[39]][1:25])

##  [1] "perched"     "five"        "stories"     "above"       "columbus"   
##  [6] "circle"      "in"          "the"         "time"        "warner"     
## [11] "center"      "rafael"      "vi"          "olys"        "new"        
## [16] "design"      "for"         "jazz"        "at"          "lincoln"    
## [21] "center"      "has"         "a"           "cool"        "ethereality"

query.by.similarity("painting sale", nyt.BoW.frame)

## $best.index
## [1] 30
## 
## $best.name
## [1] "art.30"

paste(art.stories[[30]][1:25])

##  [1] "xl"          "xavier"      "laboulbenne" "gallery"     "#"          
##  [6] "west"        "#nd"         "street"      "chelsea"     "through"    
## [11] "feb"         "#popular"    "culture"     "may"         "be"         
## [16] "the"         "mainspring"  "for"         "a"           "lot"        
## [21] "of"          "new"         "art"         "but"         "it"

Evaluating queries

Usually done in terms of relevance
Something is relevant (to the user) if it makes a difference to what they think, how the act, etc.
Want all the relevant items and only relevant items
Precision: What fraction of returned items are relevant \(= \frac{\mathrm{number\ of\ hits}}{\mathrm{number\ of\ items\ returned}}\)
Recall: What fraction of relevant items are returned \(= \frac{\mathrm{number\ of\ hits}}{\mathrm{number\ of\ relevant\ items}}\)

Precision-recall curve

Trade-off:
- Returning everything guarantees 100% recall
- If you’re any good at all, returning fewer, higher-ranked items improves precision
The precision-recall curve:
- Threshold in terms of number of items returned, or confidence in the relevance, etc.
- For each value of the threshold, calculate precision and recall of the query
- Plot precision vs. recall
- Connect the dots

Expanding on the query: Rocchio’s algorithm

Get users to show you what’s relevant, instead of trying to tell you
Run a query with vector \(\vec{q}_t\), show user results
User marks results as relevant (\(R\)) or not-relevant (\(N\))
\(\vec{q}_{t+1} = \alpha \vec{q}_t + \frac{\beta}{|R|}\sum_{\vec{x} \in R}{\vec{x}} - \frac{\gamma}{|N|}\sum_{\vec{y}\in N}{\vec{y}}\)
- \(\alpha\): continuity between old query and new
- \(\beta\): amps up recall (more like the relevant stuff!)
- \(\gamma\): amps up precision (less like the irrelevant stuff!)
- Control settings, not parameters
Iterate with new query vector \(\vec{q}_{t+1}\)

Rocchio’s algorithm and adaptation

Basic strategy: “do more of what worked and less of what didn’t”
Needs feedback about what worked and what didn’t
Other instances of the basic strategy:
- Lots of online / incremental estimation procedures
- Psychological conditioning
- Reinforcement learning (Sutton and Barto 1998)
- Natural selection and evolutionary optimization procedures (Mitchell 1996)
- Bayesian inference (Shalizi 2009)

Evaluating relevance is hard

Conceptually: it’s not just binary but at least scalar
- Excellent if subtle psychology: Sperber and Wilson (1995)
Practically: how do you tell whether document X was relevant to query Q?
- Users don’t usually tell you!
- You could ask them (but that costs time, money, trouble…)
Substitutes for relevance: engagement, clicks, payments, …
- “If you’re not paying for it, you’re not the customer; you’re the product being sold”
- Rocchio’s algorithm also “works” for these substitutes
- “This is what will keep Alice watching” (even if it melts her brain)
- “This is what Babur keeps buying” (even if he can’t afford it)

Classification

Assign \(X\) a binary label, “positive” and “negative”:
- Relevant / irrelevant
- Spam / not-spam
- Cancerous cell / healthy cell
- Fraudulent transaction / genuine transaction
- (3+ classes are similar but extra notation)
Two kinds of errors:
- False positive: Classifier says “positive” when it’s not (false alarm)
- False negative: Classifier misses a true positive (miss)
- False positive = lack of precision
- False negative = lack of recall
Confusion matrix: \(2\times 2\) table of true class vs. guessed class

Some classification methods

Nearest neighbors
- Guess same class as most similar (labeled) item
\(k\) nearest neighbors
- Majority vote among the \(k\) most similar items
- Higher \(k\) \(\Rightarrow\) less noise, more (possible) bias
Linear classifiers
- Everything on one side of a linear hyperplane is in one class
- Mathematically, is \(\vec{x} \cdot \vec{b} + b_0 \geq 0\)?
- The hyperplane \(\vec{x} \cdot \vec{b} + b_0 = 0\) is the decision boundary
Prototype method
- Each class has a prototype point (often the class average)
- Assign new points to the class with the closer prototype
- A linear classifier (see backup)

Nearest-neighbor-method demo:

# Which story is in which class?
story.classes <- c(rep("art", times = length(art.stories)), rep("music", times = length(music.stories)))
nyt.similarities <- distances(div.by.euc.length(idf.weight(nyt.BoW.frame)))

NNs <- nearest.points(nyt.BoW.frame, d = nyt.similarities)$which
NN.classes <- story.classes[NNs]
# Average error rate
mean(NN.classes != story.classes)

## [1] 0.1862745

# 'Confusion matrix'
table(story.classes, NN.classes)

##              NN.classes
## story.classes art music
##         art    45    12
##         music   7    38

Exercise:

Say “art” is positive and “music” is negative

What’s the false positive rate, i.e., the probability that a story about music will be falsely classified as about art?

What’s the false negative rate?

What’s the positive predictive value, i.e., the probability that a story classified as “art” is actually about art?

Prototype-method demo:

nyt.BoW.normed.idf <- div.by.euc.length(idf.weight(nyt.BoW.frame))
dim(nyt.BoW.normed.idf)

## [1]  102 4431

prototypical.art <- colMeans(nyt.BoW.normed.idf[story.classes == "art", ])
prototypical.music <- colMeans(nyt.BoW.normed.idf[story.classes == "music", 
    ])
prototypes <- rbind(prototypical.art, prototypical.music)
prototype.matches <- nearest.points(nyt.BoW.normed.idf, prototypes)$which
prototype.classes <- c("art", "music")[prototype.matches]
mean(prototype.classes != story.classes)

## [1] 0

table(story.classes, prototype.classes)

##              prototype.classes
## story.classes art music
##         art    57     0
##         music   0    45

(Don’t expect the prototype method to always out-perform nearest neighbors!)

Summing up

We represent our database of items as feature vectors
We do similarity search by looking at distance in feature space
Queries are also (represented by) feature vectors, so similar items are (represented by) near-by vectors
Good searches have high precision (everything they return is relevant) and high recall (they return everything relevant), but there’s usually a trade-off
Users are bad at describing what they want (\(\Rightarrow\) adapt) and we’re bad at evaluating actual relevance (\(\Rightarrow\) proxies)
Search is a kind of classification; there are others

Looking forward

dim(nyt.BoW.frame)

## [1]  102 4431

4431 features
- Too many for us to grasp
- Lots of parameters for any model
- Many features are useless
- Many are redundant
- And real data sets have even larger lexicons
In other settings we may start with only weak representations with even more features
- e.g., pixels
Next up: dimension reduction
- Which means: linear algebra

Backup: The Prototype method is a linear classifier

Say the two prototypes are \(\vec{p}_0\) and \(\vec{p}_1\)
We assign \(x\) to class 1 if \(\| \vec{x} - \vec{p}_1\| \leq \|\vec{x}-\vec{p}_0\|\), otherwise it’s assign to class 0
Same inequality applies to squared distances: \[\begin{eqnarray} \| \vec{x} - \vec{p}_1\| & \leq & \|\vec{x}-\vec{p}_0\|\\ \| \vec{x} - \vec{p}_1\|^2 & \leq & \|\vec{x}-\vec{p}_0\|^2\\ \| \vec{x}\|^2 - 2 \vec{x}\cdot\vec{p}_1 + \|\vec{p}_1\|^2 & \leq & \|\vec{x}\|^2 - 2 \vec{x}\cdot\vec{p}_0 + \|\vec{p}_0\|^2\\ 0 & \leq & \vec{x}\cdot 2(\vec{p}_1-\vec{p}_0) + \|\vec{p}_0\|^2 - \|\vec{p}_1\|^2 \end{eqnarray}\]
Query: Can every linear classifier be written as a prototype method for some choice of prototypes?

Backup: Time complexity of nearest neighbor vs. prototype methods

\(n\) data points, 2 classes

Nearest neighbors:
- Each prediction needs finding the distance to all \(n\) points, so each prediction takes \(O(n)\) operations
- No set-up cost
Prototype method:
- Each prediction requires calculating only 2 distances
- Set-up requires two averages, each of which takes \(O(n)\) to compute

References

Mitchell, Melanie. 1996. An Introduction to Genetic Algorithms. Cambridge, Massachusetts: MIT Press.

Shalizi, Cosma Rohilla. 2009. “Dynamics of Bayesian Updating with Dependent Data and Misspecified Models.” Electronic Journal of Statistics 3:1039–74. https://doi.org/10.1214/09-EJS485.

Sperber, Dan, and Deirdre Wilson. 1995. Relevance: Cognition and Communication. Second. Oxford: Basil Blackwell.

Sutton, Richard S., and Andrew G. Barto. 1998. Reinforcement Learning: An Introduction. Cambridge, Massachusetts: MIT Press. http://www.cs.ualberta.ca/~sutton/book/the-book.html.