\[
\DeclareMathOperator*{\argmin}{argmin} % thanks, wikipedia!
\]
\[
\newcommand{\X}{\mathbf{x}}
\newcommand{\w}{\mathbf{w}}
\newcommand{\V}{\mathbf{v}}
\newcommand{\S}{\mathbf{s}}
\newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]}
\newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]}
\newcommand{\SampleVar}[1]{\widehat{\mathrm{Var}}\left[ #1 \right]}
\newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]}
\DeclareMathOperator{\tr}{tr}
\newcommand{\FactorLoadings}{\mathbf{\Gamma}}
\newcommand{\Uniquenesses}{\mathbf{\psi}}
\]
Recommender systems
The basic idea
“You may also like”, “Customers also bought”, feeds in social media, …
Generically, two stages:
- Predict some outcome for user / item interactions
- Ratings (a la Netflix)
- Purchases
- Clicks
- “Engagement”
- Maximize the prediction
- Don’t bother telling people what they won’t like
- (Usually)
- Subtle issues with prediction vs. action which we’ll get to next time
Very simple (dumb) baselines
- The best-seller / most-popular list
- Prediction is implicit: everyone’s pretty much like everyone one else, use average ratings
- We’ve been doing this for at least 100 years
- Good experimental evidence that it really does alter what (some) people do (Salganik, Dodds, and Watts 2006; Salganik and Watts 2008)
- Co-purchases, association lists
- Not much user modeling
- Problems of really common items
- (For a while, Amazon recommended Harry Potter books to everyone after everything)
- Also problems for really rare items
- (For a while, I was almost certainly the only person to have bought a certain math book on Amazon)
- (You can imagine your own privacy-destroying nightmare here)
Common approaches: nearest neighbors, matrix factorization, social recommendation
- Nearest neighbors
- PCA-like dimension reduction, matrix factorization
- Social recommendation: what did your friends like?
Nearest neighbors
Content-based nearest neighbors
- Represent each item as a \(p\)-dimensional feature vector
- Appropriate features will be different for music, video, garden tools, text (even different kinds of text)…
- Take the items user \(i\) has liked
- Treat the user as a vector:
- Find the average item vector for user \(i\)
- What are the items closest to that average?
- Refinements:
- Find nearest neighbors for each liked item, prioritize anything that’s a match to multiple items
- Use dis-likes to filter
- Do a more general regression of ratings on features
- Drawback: need features on the items which track what users actually care about
Item-based nearest neighbors
- Items are features
- For user \(i\) and potential item \(k\), in principle we use all other users \(j\) and all other items \(l\) to predict \(x_{ik}\)
- With a few million users and ten thousand features, want don’t want this to be \(O(np^2)\)
- Use all the tricks for finding nearest neighbors quickly
- Only make predictions for items highly similar to items \(i\) has already rated
- Items are similar when they get similar ratings from different users (i.e., users are features for items)
- Or even: only make predictions for items highly similar to items \(i\) has already liked
Dimension reduction
- Again, items are features
- Fix a number of (latent) factors \(q\)
- Minimize \[
\sum_{(i,k) ~ \mathrm{observed}}{\left(x_{ik} - \sum_{r=1}^{q}{f_{ir} g_{rj}}\right)^2}
\]
- \(r\) runs along the latent dimensions/factors
- \(f_{ir}\) is how high user \(i\) scores on factor \(r\)
- \(g_{rj}\) is how much item \(j\) weights on factor \(r\)
- Could tweak this to let each item have its own variance
- Matrix factorization because we’re saying \(\mathbf{x} \approx \mathbf{f} \mathbf{g}\), where \(\mathbf{x}\) is \([n\times p]\), \(\mathbf{f}\) is \([n\times q]\) and \(\mathbf{g}\) is \([q \times p]\)
- Practical minimization: gradient descent, alternating between \(\mathbf{f}\) and \(\mathbf{g}\)
- See backup for a lot more on factor modeling in general, and some other uses of it in data-mining in particular
Interpreting factor models
- The latent, inferred dimensions of the \(f_{ir}\) and \(g_{rj}\) values are the factors
- To be concrete, think of movies
- Each movie loads on to each factor
- E.g., one might load highly on “stuff blowing up”, “in space”, “dark and brooding”, “family secrets”, but not at all or negatively on “romantic comedy”, “tearjerker”, “bodily humor”
- Discovery: We don’t need to tell the system these are the dimensions movies vary along; it will find as many factors as we ask
- Interpretation: The factors it finds might not have humanly-comprehensible interpretations like “stuff blowing up” or “family secrets”
- Each user also has a score on each factor
- E.g., I might have high scores for “stuff blowing up”, “in space” and “romantic comedy”, but negative scores for “tearjerker”, “bodily humor” and “family secrets”
- Ratings are inner products plus noise
- Observable-specific noise helps capture ratings of movies which are extra variable, even given their loadings
Social recommendations
- We persuade/trick the users to give us a social network
- \(a_{ij} =\) user \(i\) follows users \(j\)
- Presume that people are similar to those they follow
- So estimate: \[
\widehat{x_{ik}} = \argmin_{m}{\sum_{j \neq i}{a_{ij}(m-x_{jk})^2}}
\]
Exercise: What’s \(\widehat{x_{ik}}\) in terms of the neighbors?
- Refinements:
- Some information from neighbor’s neighbors, etc.
- Some information from neighbor’s ratings of similar items
Combining approaches
- Nothing says you have to pick just one method!
- Fit multiple models and predict a weighted average of the models
- E.g., predictions might be 50% NN, 25% factorization, 25% network smoothing
- Or: use one model as a base, then fit a second model to its residuals and add the residual-model’s predictions to the base model’s
- E.g., use factor model as a base and then kNN on its residuals
- Or: use average ratings as a base, then factor model on residuals from that, then kNN on residuals from the factor model
Some obstacles to all approaches
- The “cold start” problem: what do for new users/items?
- New users: global averages, or social averaging if that’s available
- Maybe present them with items with high information first?
- New items: content-based predictions, or hope that everyone isn’t relying on your system completely
- Missing values are informative
- Tastes change
Tastes change
- Down-weight old data
- Easy but abrupt: don’t care about ratings more than, say, 100 days old
- Or: only use the last 100 ratings
- Or: make weights on items a gradually-decaying function of age
- Could also try to explicitly model change in tastes, but that adds to the computational burden
- One simple approach for factor models: \(\vec{f}_i(t+1) = \alpha \vec{f}_i(t) + (1-\alpha) (\mathrm{new\ estimate\ at\ time}\ t+1)\)
Maximization
- Once you have predicted ratings, pick the highest-predicted ratings
- Finding the maximum of \(p\) items takes \(O(p)\) time in the worst case, so it helps if you can cut this down
- Sorting
- Early stopping if it looks like the predicted rating will be low
- We’ve noted some tricks for only predicting ratings for items likely to be good
Summing up
- Recommendation systems work by first predicting what items users will like, and then maximizing the predictions
- Basically all prediction methods assume \(x_{ik}\) can be estimated from \(x_{jl}\) when \(j\) and/or \(l\) are similar to \(i\) and/or \(k\)
- More or less elaborate models
- Different notions of similarity
- Everyone wants to restrict the computational burden that comes with large \(n\) and \(p\)
Backup: Factor models in data mining
Factor models take off from PCA
- Start with \(n\) items in a data base (\(n\) big)
- Represent items as \(p\)-dimensional vectors of features (\(p\) too big for comfort), data is now \(\X\), dimension \([n \times p]\)
- Principal components analysis:
- Find the best \(q\)-dimensional linear approximation to the data
- Equivalent to finding \(q\) directions of maximum variance through the data
- Equivalent to finding top \(q\) eigenvalues and eigenvectors of \(\frac{1}{n}\X^T \X =\) sample variance matrix of the data
- New features = PC scores = projections on to the eigenvectors
- Variances along those directions = eigenvalues
PCA is not a model
- PCA says nothing about what the data should look like
- PCA makes no predictions new data (or old data!)
- PCA just finds a linear approximation to these data
- What would be a PCA-like model?
This is where factor analysis comes in
Remember PCA: \[
\S = \X \w
\] and \[
\X = \S \w^T
\]
(because \(\w^T = \w^{-1}\))
If we use only \(q\) PCs, then \[
\S_q = \X \w_q
\] but \[
\X \neq \S_q \w_q^T
\]
- Usual approach in statistics when the equations don’t hold: the error is random noise
The factor model
\(\vec{X}\) is \(p\)-dimensional, manifest, unhidden or observable
\(\vec{F}\) is \(q\)-dimensional, \(q < p\) but latent or hidden or unobserved
The model: \[\begin{eqnarray*}
\vec{X} & = & \FactorLoadings \vec{F} + \vec{\epsilon}\\
(\text{observables}) & = & (\text{factor loadings}) (\text{factor scores}) + (\text{noise})
\end{eqnarray*}\]
- \(\FactorLoadings =\) a \([p \times q]\) matrix of factor loadings
- Analogous to \(\w_q\) in PCA but without the orthonormal restrictions (some people also write \(\w\) for the loadings)
- Analogous to \(\beta\) in a linear regression
- Assumption: \(\vec{\epsilon}\) is uncorrelated with \(\vec{F}\) and has \(\Expect{\vec{\epsilon}} = 0\)
- \(p\)-dimensional vector (unlike the scalar noise in linear regression)
- Assumption: \(\Var{\vec{\epsilon}} \equiv \Uniquenesses\) is diagonal (i.e., no correlation across dimensions of the noise)
- Sometimes called the uniquenesses or the unique variance components
- Analogous to \(\sigma^2\) in a linear regression
- Some people write it \(\mathbf{\Sigma}\), others use that for \(\Var{\vec{X}}\)
- Means: all correlation between observables comes from the factors
- Not really an assumption: \(\Var{\vec{F}} = \mathbf{I}\)
- Not an assumption because we could always de-correlate, as in homework 2
- Assumption: \(\vec{\epsilon}\) is uncorrelated across units
- As we assume in linear regression…
Summary of the factor model assumptions
\[\begin{eqnarray}
\vec{X} & = & \FactorLoadings \vec{F} + \vec{\epsilon}\\
\Cov{\vec{F}, \vec{\epsilon}} & = & \mathbf{0}\\
\Var{\vec{\epsilon}} & \equiv & \Uniquenesses, ~ \text{diagonal}
\Expect{\vec{\epsilon}} & = & \vec{0}\\
\Var{\vec{F}} & = & \mathbf{I}
\end{eqnarray}\]
Some consequences of the assumptions
\[\begin{eqnarray}
\vec{X} & = & \FactorLoadings \vec{F} + \vec{\epsilon}\\
\Expect{\vec{X}} & = & \FactorLoadings \Expect{\vec{F}}
\end{eqnarray}\]
- Typically: center all variables so we can take \(\Expect{\vec{F}} = 0\)
\[\begin{eqnarray}
\vec{X} & = & \FactorLoadings \vec{F} + \vec{\epsilon}\\
\Var{\vec{X}} & = & \FactorLoadings \Var{\vec{F}} \FactorLoadings^T + \Var{\vec{\epsilon}}\\
& = & \FactorLoadings \FactorLoadings^T + \Uniquenesses
\end{eqnarray}\]
- \(\FactorLoadings\) is \(p\times q\) so this is low-rank-plus-diagonal
- or low-rank-plus-noise
- Contrast with PCA: that approximates the variance matrix as purely low-rank
\[\begin{eqnarray}
\vec{X} & = & \FactorLoadings \vec{F} + \vec{\epsilon}\\
\Var{\vec{X}} & = & \FactorLoadings \FactorLoadings^T + \Uniquenesses\\
\Cov{X_i, X_j} & = & \text{what?}
\end{eqnarray}\]
Geometry
As \(\vec{F}\) varies over \(q\) dimensions, \(\w \vec{F}\) sweeps out a \(q\)-dimensional subspace in \(p\)-dimensional space
Then \(\vec{\epsilon}\) perturbs out of this subspace
- If \(\Var{\vec{\epsilon}} = \mathbf{0}\) then we’d be exactly in the \(q\)-dimensional space, and we’d expect correspondence between factors and principal components
- (Modulo the rotation problem, to be discussed)
- If the noise isn’t zero, factors \(\neq\) PCs
- In extremes: the largest direction of variation could come from a big entry in \(\Uniquenesses\), not from the linear structure at all
How do we estimate a factor model?
\[
\vec{X} = \FactorLoadings \vec{F} + \vec{\epsilon}
\]
- We can’t regress \(\vec{X}\) on \(\vec{F}\) because we never see \(\vec{F}\)
Suppose we knew \(\Uniquenesses\)
- we’d say \[\begin{eqnarray}
\Var{\vec{X}} & = & \FactorLoadings\FactorLoadings^T + \Uniquenesses\\
\Var{\vec{X}} - \Uniquenesses & = & \FactorLoadings\FactorLoadings^T
\end{eqnarray}\]
- LHS is \(\Var{\FactorLoadings\vec{F}}\) so we know it’s symmetric and non-negative-definite
- \(\therefore\) We can eigendecompose LHS as \[\begin{eqnarray}
\Var{\vec{X}} - \Uniquenesses & = &\mathbf{v} \mathbf{\lambda} \mathbf{v}^T\\
& = & (\mathbf{v} \mathbf{\lambda}^{1/2}) (\mathbf{v} \mathbf{\lambda}^{1/2})^T
\end{eqnarray}\]
- \(\mathbf{\lambda} =\) diagonal matrix of eigenvalues, only \(q\) of which are non-zero
- Set \(\FactorLoadings = \mathbf{v} \mathbf{\lambda}^{1/2}\) and everything’s consistent
Suppose we knew \(\FactorLoadings\)
then we’d say \[\begin{eqnarray}
\Var{\vec{X}} & = & \FactorLoadings\FactorLoadings^T + \Uniquenesses\\
\Var{\vec{X}} - \FactorLoadings\FactorLoadings^T & = & \Uniquenesses
\end{eqnarray}\]
“One person’s vicious circle is another’s iterative approximation”:
- Start with a guess about \(\Uniquenesses\)
- Suitable guess: regress each observable on the others, residual variance is \(\Uniquenesses_{ii}\)
- Until the estimates converge:
- Use \(\Uniquenesses\) to find \(\FactorLoadings\) (by eigen-magic)
- Use \(\FactorLoadings\) to find \(\Uniquenesses\) (by subtraction)
- Once we have the loadings (and uniquenesses), we can estimate the scores
Estimating factor scores
- PC scores were just projection
Estimating factor scores isn’t so easy!
- Factor model: \[
\vec{X} = \FactorLoadings \vec{F} + \vec{\epsilon}
\]
It’d be convenient to estimate factor scores as \[
\FactorLoadings^{-1} \vec{X}
\] but \(\FactorLoadings^{-1}\) doesn’t exist!
- Typical approach: optimal linear estimator
- We know (from 401) that the optimal linear estimator of any \(Y\) from any \(\vec{Z}\) is \[
\Cov{Y, \vec{Z}} \Var{\vec{Z}}^{-1} \vec{Z}
\]
- (ignoring the intercept because everything’s centered)
- i.e., column vector of optimal coefficients is \(\Var{\vec{Z}}^{-1} \Cov{\vec{Z}, Y}\)
Here \[
\Cov{\vec{X}, \vec{F}} = \FactorLoadings\Var{F} = \FactorLoadings
\] and \[
\Var{\vec{X}} = \FactorLoadings\FactorLoadings^T + \Uniquenesses
\] so the optimal linear factor score estimates are \[
\FactorLoadings^T (\FactorLoadings\FactorLoadings^T + \Uniquenesses)^{-1} \vec{X}
\]
Example: Back to the dresses from HW 7
## Length Class Mode
## Gamma 14400 -none- numeric
## Z 205 -none- numeric
## Sigma 14400 -none- numeric
- Positive and negative images along the that factor:
- Now fit a five-factor model:
## Length Class Mode
## Gamma 72000 -none- numeric
## Z 1025 -none- numeric
## Sigma 14400 -none- numeric
- Positive and negative images along each factor:
Dress vs model, width of dress, pose, pose, pose (?)
We can recover images from the factor scores, e.g, image no. 1:
Factor models and high-dimensional variance estimation
- With \(p\) observable features, a variance matrix has \(p(p+1)/2\) entries (by symmetry)
- Ordinarily, to estimate \(k\) parameters requires \(n \geq k\) data points, so we’d need at least \(p(p+1)/2\) data points to get a variance matrix
- So it looks like we need \(n=O(p^2)\) data points to estimate variance matrices
- Trouble if \(p=10^6\) or even \(10^4\)
- A \(q\)-factor model only has \(pq+p=p(q+1)\) parameters
- So we can get away with only \(O(p)\) data points
- What’s going on in the data example above, where \(p= 14400\) but \(n = 205\)?
Checking assumptions
- Can’t check assumptions about \(\vec{F}\) or \(\vec{\epsilon}\) directly
- Can check whether \(\Var{\vec{X}}\) is low-rank-plus-noise
- Need to know how far we should expect \(\Var{\vec{X}}\) to be from low-rank-plus-noise
- Can simulate
- Exact theory if you assume everything’s Gaussian
- Other models can also give low-rank-plus-noise covariance
- See readings from Shalizi (n.d.)
Caution: the rotation problem
Remember the factor model: \[
\vec{X} = \FactorLoadings \vec{F} + \vec{\epsilon}
\] with \(\Var{\vec{F}} = \mathbf{I}\), \(\Cov{\vec{F}, \vec{\epsilon}} = \mathbf{0}\), \(\Var{\vec{\epsilon}} = \Uniquenesses\)
- Now consider \(\vec{G} = \mathbf{r} \vec{F}\) for any orthogonal matrix \(\mathbf{r}\) \[\begin{eqnarray}
\vec{G} & = & \mathbf{r} \vec{F}\\
\Var{\vec{G}} &= & \mathbf{r}\Var{\vec{F}}\mathbf{r}^T\\
& = & \mathbf{r}\mathbf{I}\mathbf{r}^T = \mathbf{I}\\
\Cov{\vec{G}, \vec{\epsilon}} & = & \mathbf{r}\Cov{\vec{F}, \vec{\epsilon}} = \mathbf{0}\\
\vec{F} & = & \mathbf{r}^{-1} \vec{G} = \mathbf{r}^{T} \vec{G}\\
\vec{X} & = & \FactorLoadings \mathbf{r}^T \vec{G} + \vec{\epsilon}\\
& = & \FactorLoadings^{\prime} \vec{G} + \vec{\epsilon}\\
\end{eqnarray}\]
- Once we’ve found one factor solution, we can rotate to another, and nothing observable changes
- In other words: we’re free to use any coordinate system we like for the latent variables
- Really a problem if we want to interpret the factors
- Different rotations make exactly the same predictions about the data
- If we prefer one over another, it cannot be because one of them fits the data better or has more empirical support (at least not this data)
- On the other hand, if our initial estimate of \(\FactorLoadings\) is hard to interpret, we can always try rotating to make it easier to tell stories about
Rotation is no problem at all if we just want to predict
Applications
- Factor analysis begins with Spearman (1904) (IQ testing)
- Thurstone (1934) made it a general tool for psychometrics
- “Five factor model” of personality (openness, conscientiousness, extraversion, agreeableness, neuroticism): Basically, FA on personality quizzes
- A complicated story of dictionaries, lunatic asylums, and linear algebra (Paul 2004)
- Fails goodness-of-fit tests but that doesn’t stop the psychologists (Borsboom 2006)
- More recent applications:
- Netflix again
- Cambridge Analytica
- Not covered here: spatial and especially spatio-temporal data
Cambridge Analytica
- UK political-operations firm
- Starting point: Data sets of Facebook likes plus a five-factor personality test
- Ran regressions to link likes to personality factors
- Then snarfed a lot of data from other people about their Facebook likes
- Then extrapolated to personality scores
- Then sold this as the basis for targeting political ads in 2016 both in the UK and the US
- Five-factor personality scores do correlate to political preferences (see, e.g., here), but so do education and IQ, which are all correlated with each other
- Cambridge Analytica claimed to be able to target the inner psyches of voters and tap their hidden fears and desires
- Not clear how well it worked or even how much of what they actually did used the estimated personality scores
- At a technical level, Cambridge Analytica made (or claimed to make) a lot of extrapolations
- From Facebook likes among initial app users to latent factor scores
- From Facebook likes among friends of app users to latent factor scores
- From factor scores to ad effectiveness
- How did modeling error and noise propagate along this chain?
- Again, not clear that this worked any better than traditional demographics or even that the psychological parts were used in practice
- As with lots of data firms, a big contrast in rhetoric:
- To customers, claims of doing magic
- To regulators / governments, claims of being just polling / advising agency
- The recent (Netflix!) documentary takes their earlier PR at face value…
- Clearly they were shady, but they don’t seem to have been very effective
- “They meant ill, but they were incompetent” is not altogether comforting
- Further readings: linked to from the course homepage
References (in addition to the background reading on the course homepage)
Marlin, Benjamin, Richard S. Zemel, Sam Roweis, and Malcolm Slaney. 2007. “Collaborative Filtering and the Missing at Random Assumption.” In Proceedings of the 23rd Conference on Uncertainty in Artificial Intelligence [Uai 2007]. https://arxiv.org/abs/1206.5267.
Paul, Annie Murphy. 2004. The Cult of Personality: How Personality Tests Are Leading Us to Miseducate Our Children, Mismanage Our Companies, and Misunderstand Ourselves. New York: Free Press.
Salganik, Matthew J., and Duncan J. Watts. 2008. “Leading the Herd Astray: An Experimental Study of Self-Fulfilling Prophecies in an Artificial Cultural Market.” Social Psychological Quarterly 71:338–55. http://www.princeton.edu/~mjs3/salganik_watts08.pdf.
2.4 Social recommendations
Exercise: What’s \(\widehat{x_{ik}}\) in terms of the neighbors?