\[ \DeclareMathOperator*{\argmin}{argmin} % thanks, wikipedia! \]

\[ \newcommand{\X}{\mathbf{x}} \newcommand{\w}{\mathbf{w}} \newcommand{\V}{\mathbf{v}} \newcommand{\S}{\mathbf{s}} \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\SampleVar}[1]{\widehat{\mathrm{Var}}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \DeclareMathOperator{\tr}{tr} \newcommand{\FactorLoadings}{\mathbf{\Gamma}} \newcommand{\Uniquenesses}{\mathbf{\psi}} \]

1 Recommender systems

1.1 The basic idea

“You may also like”, “Customers also bought”, feeds in social media, …

Generically, two stages:

  • Predict some outcome for user / item interactions
    • Ratings (a la Netflix)
    • Purchases
    • Clicks
    • “Engagement”
  • Maximize the prediction
    • Don’t bother telling people what they won’t like
    • (Usually)
  • Subtle issues with prediction vs. action which we’ll get to next time

1.2 Very simple (dumb) baselines

  • The best-seller / most-popular list
    • Prediction is implicit: everyone’s pretty much like everyone one else, use average ratings
    • We’ve been doing this for at least 100 years
    • Good experimental evidence that it really does alter what (some) people do (Salganik, Dodds, and Watts 2006; Salganik and Watts 2008)
  • Co-purchases, association lists
    • Not much user modeling
    • Problems of really common items
      • (For a while, Amazon recommended Harry Potter books to everyone after everything)
    • Also problems for really rare items
      • (For a while, I was almost certainly the only person to have bought a certain math book on Amazon)
      • (You can imagine your own privacy-destroying nightmare here)

2 Common approaches: nearest neighbors, matrix factorization, social recommendation

  • Nearest neighbors
    • Content-based
    • Item-based
  • PCA-like dimension reduction, matrix factorization
  • Social recommendation: what did your friends like?

2.1 Nearest neighbors

2.1.1 Content-based nearest neighbors

  • Represent each item as a \(p\)-dimensional feature vector
    • Appropriate features will be different for music, video, garden tools, text (even different kinds of text)…
  • Take the items user \(i\) has liked
  • Treat the user as a vector:
    • Find the average item vector for user \(i\)
    • What are the items closest to that average?
  • Refinements:
    • Find nearest neighbors for each liked item, prioritize anything that’s a match to multiple items
    • Use dis-likes to filter
    • Do a more general regression of ratings on features
  • Drawback: need features on the items which track what users actually care about

2.1.2 Item-based nearest neighbors

  • Items are features
  • For user \(i\) and potential item \(k\), in principle we use all other users \(j\) and all other items \(l\) to predict \(x_{ik}\)
  • With a few million users and ten thousand features, want don’t want this to be \(O(np^2)\)
    • Use all the tricks for finding nearest neighbors quickly
    • Only make predictions for items highly similar to items \(i\) has already rated
      • Items are similar when they get similar ratings from different users (i.e., users are features for items)
      • Or even: only make predictions for items highly similar to items \(i\) has already liked

2.2 Dimension reduction

  • Again, items are features
  • Fix a number of (latent) factors \(q\)
  • Minimize \[ \sum_{(i,k) ~ \mathrm{observed}}{\left(x_{ik} - \sum_{r=1}^{q}{f_{ir} g_{rj}}\right)^2} \]
    • \(r\) runs along the latent dimensions/factors
    • \(f_{ir}\) is how high user \(i\) scores on factor \(r\)
    • \(g_{rj}\) is how much item \(j\) weights on factor \(r\)
    • Could tweak this to let each item have its own variance
  • Matrix factorization because we’re saying \(\mathbf{x} \approx \mathbf{f} \mathbf{g}\), where \(\mathbf{x}\) is \([n\times p]\), \(\mathbf{f}\) is \([n\times q]\) and \(\mathbf{g}\) is \([q \times p]\)
  • Practical minimization: gradient descent, alternating between \(\mathbf{f}\) and \(\mathbf{g}\)
  • See backup for a lot more on factor modeling in general, and some other uses of it in data-mining in particular

2.3 Interpreting factor models

  • The latent, inferred dimensions of the \(f_{ir}\) and \(g_{rj}\) values are the factors
  • To be concrete, think of movies
  • Each movie loads on to each factor
    • E.g., one might load highly on “stuff blowing up”, “in space”, “dark and brooding”, “family secrets”, but not at all or negatively on “romantic comedy”, “tearjerker”, “bodily humor”
    • Discovery: We don’t need to tell the system these are the dimensions movies vary along; it will find as many factors as we ask
    • Interpretation: The factors it finds might not have humanly-comprehensible interpretations like “stuff blowing up” or “family secrets”
  • Each user also has a score on each factor
    • E.g., I might have high scores for “stuff blowing up”, “in space” and “romantic comedy”, but negative scores for “tearjerker”, “bodily humor” and “family secrets”
  • Ratings are inner products plus noise
    • Observable-specific noise helps capture ratings of movies which are extra variable, even given their loadings

2.4 Social recommendations

  • We persuade/trick the users to give us a social network
    • \(a_{ij} =\) user \(i\) follows users \(j\)
  • Presume that people are similar to those they follow
  • So estimate: \[ \widehat{x_{ik}} = \argmin_{m}{\sum_{j \neq i}{a_{ij}(m-x_{jk})^2}} \]

Exercise: What’s \(\widehat{x_{ik}}\) in terms of the neighbors?

  • Refinements:
    • Some information from neighbor’s neighbors, etc.
    • Some information from neighbor’s ratings of similar items

2.5 Combining approaches

  • Nothing says you have to pick just one method!
  • Fit multiple models and predict a weighted average of the models
  • E.g., predictions might be 50% NN, 25% factorization, 25% network smoothing
  • Or: use one model as a base, then fit a second model to its residuals and add the residual-model’s predictions to the base model’s
    • E.g., use factor model as a base and then kNN on its residuals
    • Or: use average ratings as a base, then factor model on residuals from that, then kNN on residuals from the factor model

2.6 Some obstacles to all approaches

  • The “cold start” problem: what do for new users/items?
    • New users: global averages, or social averaging if that’s available
      • Maybe present them with items with high information first?
    • New items: content-based predictions, or hope that everyone isn’t relying on your system completely
  • Missing values are informative
  • Tastes change

2.6.1 Missing values are information

  • Both factorization and NNs can handle missing values
    • Factorization: We saw how to do this
    • NNs: only use the variables with present values in the user we want to make predictions for
  • BUT not rating something is informative
    • You may not have heard of it…
    • … or it may be the kind of thing you don’t like
      • I rate mystery novels, not Christian parenting guides or how-to books on account software
  • Often substantial improvements from explicitly modeling missingness (Marlin et al. 2007)

2.6.2 Tastes change

  • Down-weight old data
    • Easy but abrupt: don’t care about ratings more than, say, 100 days old
    • Or: only use the last 100 ratings
    • Or: make weights on items a gradually-decaying function of age
  • Could also try to explicitly model change in tastes, but that adds to the computational burden
    • One simple approach for factor models: \(\vec{f}_i(t+1) = \alpha \vec{f}_i(t) + (1-\alpha) (\mathrm{new\ estimate\ at\ time}\ t+1)\)

3 Maximization

  • Once you have predicted ratings, pick the highest-predicted ratings
    • Finding the maximum of \(p\) items takes \(O(p)\) time in the worst case, so it helps if you can cut this down
      • Sorting
      • Early stopping if it looks like the predicted rating will be low
    • We’ve noted some tricks for only predicting ratings for items likely to be good

4 Summing up

  • Recommendation systems work by first predicting what items users will like, and then maximizing the predictions
  • Basically all prediction methods assume \(x_{ik}\) can be estimated from \(x_{jl}\) when \(j\) and/or \(l\) are similar to \(i\) and/or \(k\)
    • More or less elaborate models
    • Different notions of similarity
  • Everyone wants to restrict the computational burden that comes with large \(n\) and \(p\)

5 Backup: Factor models in data mining

5.1 Factor models take off from PCA

  • Start with \(n\) items in a data base (\(n\) big)
  • Represent items as \(p\)-dimensional vectors of features (\(p\) too big for comfort), data is now \(\X\), dimension \([n \times p]\)
  • Principal components analysis:
    • Find the best \(q\)-dimensional linear approximation to the data
    • Equivalent to finding \(q\) directions of maximum variance through the data
    • Equivalent to finding top \(q\) eigenvalues and eigenvectors of \(\frac{1}{n}\X^T \X =\) sample variance matrix of the data
    • New features = PC scores = projections on to the eigenvectors
    • Variances along those directions = eigenvalues

5.2 PCA is not a model

  • PCA says nothing about what the data should look like
  • PCA makes no predictions new data (or old data!)
  • PCA just finds a linear approximation to these data
  • What would be a PCA-like model?

5.3 This is where factor analysis comes in

Remember PCA: \[ \S = \X \w \] and \[ \X = \S \w^T \]

(because \(\w^T = \w^{-1}\))

If we use only \(q\) PCs, then \[ \S_q = \X \w_q \] but \[ \X \neq \S_q \w_q^T \]

  • Usual approach in statistics when the equations don’t hold: the error is random noise

5.4 The factor model

  • \(\vec{X}\) is \(p\)-dimensional, manifest, unhidden or observable

  • \(\vec{F}\) is \(q\)-dimensional, \(q < p\) but latent or hidden or unobserved

  • The model: \[\begin{eqnarray*} \vec{X} & = & \FactorLoadings \vec{F} + \vec{\epsilon}\\ (\text{observables}) & = & (\text{factor loadings}) (\text{factor scores}) + (\text{noise}) \end{eqnarray*}\]

  • \(\FactorLoadings =\) a \([p \times q]\) matrix of factor loadings
    • Analogous to \(\w_q\) in PCA but without the orthonormal restrictions (some people also write \(\w\) for the loadings)
    • Analogous to \(\beta\) in a linear regression
  • Assumption: \(\vec{\epsilon}\) is uncorrelated with \(\vec{F}\) and has \(\Expect{\vec{\epsilon}} = 0\)
    • \(p\)-dimensional vector (unlike the scalar noise in linear regression)
  • Assumption: \(\Var{\vec{\epsilon}} \equiv \Uniquenesses\) is diagonal (i.e., no correlation across dimensions of the noise)
    • Sometimes called the uniquenesses or the unique variance components
    • Analogous to \(\sigma^2\) in a linear regression
    • Some people write it \(\mathbf{\Sigma}\), others use that for \(\Var{\vec{X}}\)
    • Means: all correlation between observables comes from the factors
  • Not really an assumption: \(\Var{\vec{F}} = \mathbf{I}\)
    • Not an assumption because we could always de-correlate, as in homework 2
  • Assumption: \(\vec{\epsilon}\) is uncorrelated across units
    • As we assume in linear regression…

5.4.1 Summary of the factor model assumptions

\[\begin{eqnarray} \vec{X} & = & \FactorLoadings \vec{F} + \vec{\epsilon}\\ \Cov{\vec{F}, \vec{\epsilon}} & = & \mathbf{0}\\ \Var{\vec{\epsilon}} & \equiv & \Uniquenesses, ~ \text{diagonal} \Expect{\vec{\epsilon}} & = & \vec{0}\\ \Var{\vec{F}} & = & \mathbf{I} \end{eqnarray}\]

5.4.2 Some consequences of the assumptions

\[\begin{eqnarray} \vec{X} & = & \FactorLoadings \vec{F} + \vec{\epsilon}\\ \Expect{\vec{X}} & = & \FactorLoadings \Expect{\vec{F}} \end{eqnarray}\]
  • Typically: center all variables so we can take \(\Expect{\vec{F}} = 0\)
\[\begin{eqnarray} \vec{X} & = & \FactorLoadings \vec{F} + \vec{\epsilon}\\ \Var{\vec{X}} & = & \FactorLoadings \Var{\vec{F}} \FactorLoadings^T + \Var{\vec{\epsilon}}\\ & = & \FactorLoadings \FactorLoadings^T + \Uniquenesses \end{eqnarray}\]
  • \(\FactorLoadings\) is \(p\times q\) so this is low-rank-plus-diagonal
    • or low-rank-plus-noise
    • Contrast with PCA: that approximates the variance matrix as purely low-rank
\[\begin{eqnarray} \vec{X} & = & \FactorLoadings \vec{F} + \vec{\epsilon}\\ \Var{\vec{X}} & = & \FactorLoadings \FactorLoadings^T + \Uniquenesses\\ \Cov{X_i, X_j} & = & \text{what?} \end{eqnarray}\]

5.5 Geometry

  • As \(\vec{F}\) varies over \(q\) dimensions, \(\w \vec{F}\) sweeps out a \(q\)-dimensional subspace in \(p\)-dimensional space

  • Then \(\vec{\epsilon}\) perturbs out of this subspace

  • If \(\Var{\vec{\epsilon}} = \mathbf{0}\) then we’d be exactly in the \(q\)-dimensional space, and we’d expect correspondence between factors and principal components
    • (Modulo the rotation problem, to be discussed)
  • If the noise isn’t zero, factors \(\neq\) PCs
    • In extremes: the largest direction of variation could come from a big entry in \(\Uniquenesses\), not from the linear structure at all

5.6 How do we estimate a factor model?

\[ \vec{X} = \FactorLoadings \vec{F} + \vec{\epsilon} \]

  • We can’t regress \(\vec{X}\) on \(\vec{F}\) because we never see \(\vec{F}\)

5.6.1 Suppose we knew \(\Uniquenesses\)

  • we’d say \[\begin{eqnarray} \Var{\vec{X}} & = & \FactorLoadings\FactorLoadings^T + \Uniquenesses\\ \Var{\vec{X}} - \Uniquenesses & = & \FactorLoadings\FactorLoadings^T \end{eqnarray}\]
  • LHS is \(\Var{\FactorLoadings\vec{F}}\) so we know it’s symmetric and non-negative-definite
  • \(\therefore\) We can eigendecompose LHS as \[\begin{eqnarray} \Var{\vec{X}} - \Uniquenesses & = &\mathbf{v} \mathbf{\lambda} \mathbf{v}^T\\ & = & (\mathbf{v} \mathbf{\lambda}^{1/2}) (\mathbf{v} \mathbf{\lambda}^{1/2})^T \end{eqnarray}\]
    • \(\mathbf{\lambda} =\) diagonal matrix of eigenvalues, only \(q\) of which are non-zero
  • Set \(\FactorLoadings = \mathbf{v} \mathbf{\lambda}^{1/2}\) and everything’s consistent

5.6.2 Suppose we knew \(\FactorLoadings\)

then we’d say \[\begin{eqnarray} \Var{\vec{X}} & = & \FactorLoadings\FactorLoadings^T + \Uniquenesses\\ \Var{\vec{X}} - \FactorLoadings\FactorLoadings^T & = & \Uniquenesses \end{eqnarray}\]

5.6.3 “One person’s vicious circle is another’s iterative approximation”:

  • Start with a guess about \(\Uniquenesses\)
    • Suitable guess: regress each observable on the others, residual variance is \(\Uniquenesses_{ii}\)
  • Until the estimates converge:
    • Use \(\Uniquenesses\) to find \(\FactorLoadings\) (by eigen-magic)
    • Use \(\FactorLoadings\) to find \(\Uniquenesses\) (by subtraction)
  • Once we have the loadings (and uniquenesses), we can estimate the scores

5.7 Estimating factor scores

  • PC scores were just projection
  • Estimating factor scores isn’t so easy!

  • Factor model: \[ \vec{X} = \FactorLoadings \vec{F} + \vec{\epsilon} \]
  • It’d be convenient to estimate factor scores as \[ \FactorLoadings^{-1} \vec{X} \] but \(\FactorLoadings^{-1}\) doesn’t exist!

  • Typical approach: optimal linear estimator
  • We know (from 401) that the optimal linear estimator of any \(Y\) from any \(\vec{Z}\) is \[ \Cov{Y, \vec{Z}} \Var{\vec{Z}}^{-1} \vec{Z} \]
    • (ignoring the intercept because everything’s centered)
    • i.e., column vector of optimal coefficients is \(\Var{\vec{Z}}^{-1} \Cov{\vec{Z}, Y}\)
  • Here \[ \Cov{\vec{X}, \vec{F}} = \FactorLoadings\Var{F} = \FactorLoadings \] and \[ \Var{\vec{X}} = \FactorLoadings\FactorLoadings^T + \Uniquenesses \] so the optimal linear factor score estimates are \[ \FactorLoadings^T (\FactorLoadings\FactorLoadings^T + \Uniquenesses)^{-1} \vec{X} \]

5.8 Example: Back to the dresses from HW 7

  • Fit a one-factor model:
##       Length Class  Mode   
## Gamma 14400  -none- numeric
## Z       205  -none- numeric
## Sigma 14400  -none- numeric
  • Positive and negative images along the that factor:

  • Now fit a five-factor model:
##       Length Class  Mode   
## Gamma 72000  -none- numeric
## Z      1025  -none- numeric
## Sigma 14400  -none- numeric
  • Positive and negative images along each factor:

  • Dress vs model, width of dress, pose, pose, pose (?)

  • We can recover images from the factor scores, e.g, image no. 1:

5.9 Factor models and high-dimensional variance estimation

  • With \(p\) observable features, a variance matrix has \(p(p+1)/2\) entries (by symmetry)
  • Ordinarily, to estimate \(k\) parameters requires \(n \geq k\) data points, so we’d need at least \(p(p+1)/2\) data points to get a variance matrix
    • So it looks like we need \(n=O(p^2)\) data points to estimate variance matrices
    • Trouble if \(p=10^6\) or even \(10^4\)
  • A \(q\)-factor model only has \(pq+p=p(q+1)\) parameters
    • So we can get away with only \(O(p)\) data points
    • What’s going on in the data example above, where \(p= 14400\) but \(n = 205\)?

5.10 Checking assumptions

  • Can’t check assumptions about \(\vec{F}\) or \(\vec{\epsilon}\) directly
  • Can check whether \(\Var{\vec{X}}\) is low-rank-plus-noise
    • Need to know how far we should expect \(\Var{\vec{X}}\) to be from low-rank-plus-noise
    • Can simulate
    • Exact theory if you assume everything’s Gaussian
  • Other models can also give low-rank-plus-noise covariance
    • See readings from Shalizi (n.d.)

5.11 Caution: the rotation problem

  • Remember the factor model: \[ \vec{X} = \FactorLoadings \vec{F} + \vec{\epsilon} \] with \(\Var{\vec{F}} = \mathbf{I}\), \(\Cov{\vec{F}, \vec{\epsilon}} = \mathbf{0}\), \(\Var{\vec{\epsilon}} = \Uniquenesses\)

  • Now consider \(\vec{G} = \mathbf{r} \vec{F}\) for any orthogonal matrix \(\mathbf{r}\) \[\begin{eqnarray} \vec{G} & = & \mathbf{r} \vec{F}\\ \Var{\vec{G}} &= & \mathbf{r}\Var{\vec{F}}\mathbf{r}^T\\ & = & \mathbf{r}\mathbf{I}\mathbf{r}^T = \mathbf{I}\\ \Cov{\vec{G}, \vec{\epsilon}} & = & \mathbf{r}\Cov{\vec{F}, \vec{\epsilon}} = \mathbf{0}\\ \vec{F} & = & \mathbf{r}^{-1} \vec{G} = \mathbf{r}^{T} \vec{G}\\ \vec{X} & = & \FactorLoadings \mathbf{r}^T \vec{G} + \vec{\epsilon}\\ & = & \FactorLoadings^{\prime} \vec{G} + \vec{\epsilon}\\ \end{eqnarray}\]
  • Once we’ve found one factor solution, we can rotate to another, and nothing observable changes
  • In other words: we’re free to use any coordinate system we like for the latent variables
  • Really a problem if we want to interpret the factors
    • Different rotations make exactly the same predictions about the data
    • If we prefer one over another, it cannot be because one of them fits the data better or has more empirical support (at least not this data)
    • On the other hand, if our initial estimate of \(\FactorLoadings\) is hard to interpret, we can always try rotating to make it easier to tell stories about
  • Rotation is no problem at all if we just want to predict

5.12 Applications

  • Factor analysis begins with Spearman (1904) (IQ testing)
  • Thurstone (1934) made it a general tool for psychometrics
  • “Five factor model” of personality (openness, conscientiousness, extraversion, agreeableness, neuroticism): Basically, FA on personality quizzes
    • A complicated story of dictionaries, lunatic asylums, and linear algebra (Paul 2004)
    • Fails goodness-of-fit tests but that doesn’t stop the psychologists (Borsboom 2006)
  • More recent applications:
    • Netflix again
    • Cambridge Analytica
  • Not covered here: spatial and especially spatio-temporal data

5.13 Cambridge Analytica

  • UK political-operations firm
  • Starting point: Data sets of Facebook likes plus a five-factor personality test
    • Ran regressions to link likes to personality factors
  • Then snarfed a lot of data from other people about their Facebook likes
  • Then extrapolated to personality scores
  • Then sold this as the basis for targeting political ads in 2016 both in the UK and the US
    • Five-factor personality scores do correlate to political preferences (see, e.g., here), but so do education and IQ, which are all correlated with each other
    • Cambridge Analytica claimed to be able to target the inner psyches of voters and tap their hidden fears and desires
    • Not clear how well it worked or even how much of what they actually did used the estimated personality scores
  • At a technical level, Cambridge Analytica made (or claimed to make) a lot of extrapolations
    • From Facebook likes among initial app users to latent factor scores
    • From Facebook likes among friends of app users to latent factor scores
    • From factor scores to ad effectiveness
    • How did modeling error and noise propagate along this chain?
  • Again, not clear that this worked any better than traditional demographics or even that the psychological parts were used in practice
  • As with lots of data firms, a big contrast in rhetoric:
    • To customers, claims of doing magic
    • To regulators / governments, claims of being just polling / advising agency
    • The recent (Netflix!) documentary takes their earlier PR at face value…
  • Clearly they were shady, but they don’t seem to have been very effective
    • “They meant ill, but they were incompetent” is not altogether comforting
  • Further readings: linked to from the course homepage

References (in addition to the background reading on the course homepage)

Borsboom, Denny. 2006. “The Attack of the Psychometricians.” Psychometrika 71:425–40. https://doi.org/10.1007/s11336-006-1447-6.

Marlin, Benjamin, Richard S. Zemel, Sam Roweis, and Malcolm Slaney. 2007. “Collaborative Filtering and the Missing at Random Assumption.” In Proceedings of the 23rd Conference on Uncertainty in Artificial Intelligence [Uai 2007]. https://arxiv.org/abs/1206.5267.

Paul, Annie Murphy. 2004. The Cult of Personality: How Personality Tests Are Leading Us to Miseducate Our Children, Mismanage Our Companies, and Misunderstand Ourselves. New York: Free Press.

Salganik, Matthew J., Peter S. Dodds, and Duncan J. Watts. 2006. “Experimental Study of Inequality and Unpredictability in an Artificial Cultural Market.” Science 311:854–56. http://www.princeton.edu/~mjs3/musiclab.shtml.

Salganik, Matthew J., and Duncan J. Watts. 2008. “Leading the Herd Astray: An Experimental Study of Self-Fulfilling Prophecies in an Artificial Cultural Market.” Social Psychological Quarterly 71:338–55. http://www.princeton.edu/~mjs3/salganik_watts08.pdf.

Shalizi, Cosma Rohilla. n.d. Advanced Data Analysis from an Elementary Point of View. Cambridge, England: Cambridge University Press. http://www.stat.cmu.edu/~cshalizi/ADAfaEPoV.

Spearman, Charles. 1904. “‘General Intelligence,’ Objectively Determined and Measured.” American Journal of Psychology 15:201–93. http://psychclassics.yorku.ca/Spearman/.

Thurstone, L. L. 1934. “The Vectors of Mind.” Psychological Review 41:1–32. http://psychclassics.yorku.ca/Thurstone/.