\[ \DeclareMathOperator*{\argmin}{argmin} % thanks, wikipedia! \]

\[ \newcommand{\X}{\mathbf{x}} \newcommand{\w}{\mathbf{w}} \newcommand{\V}{\mathbf{v}} \newcommand{\S}{\mathbf{s}} \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\SampleVar}[1]{\widehat{\mathrm{Var}}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \DeclareMathOperator{\tr}{tr} \newcommand{\FactorLoadings}{\mathbf{\Gamma}} \newcommand{\Uniquenesses}{\mathbf{\psi}} \]

1 Recommender systems

1.1 The basic idea

“You may also like”, “Customers also bought”, feeds in social media, …

Generically, two stages:

Predict some outcome for user / item interactions
- Ratings (a la Netflix)
- Purchases
- Clicks
- “Engagement”
Maximize the prediction
- Don’t bother telling people what they won’t like
- (Usually)
Subtle issues with prediction vs. action which we’ll get to next time

1.2 Very simple (dumb) baselines

The best-seller / most-popular list
- Prediction is implicit: everyone’s pretty much like everyone one else, use average ratings
- We’ve been doing this for at least 100 years
  - An interesting exercise: go to the historical best-seller lists (at, e.g., [https://lithub.com/here-are-the-biggest-fiction-bestsellers-of-the-last-100-years/]) and see how long it takes to find a book or even an author whose name you recognize (whether or not you’ve read it)
- Good experimental evidence that it really does alter what (some) people do (Salganik, Dodds, and Watts 2006; Salganik and Watts 2008)
Co-purchases, association lists
- Not much user modeling
- Problems of really common items
  - (For a while, Amazon recommended Harry Potter books to everyone after everything)
- Also problems for really rare items
  - (For a while, I was almost certainly the only person to have bought a certain math book on Amazon)
  - (You can imagine your own privacy-destroying nightmare here)

2 Common approaches: nearest neighbors, matrix factorization, social recommendation

Nearest neighbors
- Content-based
- Item-based
PCA-like dimension reduction, matrix factorization
Social recommendation: what did your friends like?

2.1 Nearest neighbors

2.1.1 Content-based nearest neighbors

Represent each item as a \(p\)-dimensional feature vector
- Appropriate features will be different for music, video, garden tools, text (even different kinds of text)…
Take the items user \(i\) has liked
Treat the user as a vector:
- Find the average item vector for user \(i\)
- What are the items closest to that average?
Refinements:
- Find nearest neighbors for each liked item, prioritize anything that’s a match to multiple items
- Use dis-likes to filter
- Do a more general regression of ratings on features
Drawback: need features on the items which track what users actually care about

2.1.2 Item-based nearest neighbors

Items are features
For user \(i\) and potential item \(k\), in principle we use all other users \(j\) and all other items \(l\) to predict \(x_{ik}\)
With a few million users and ten thousand features, want don’t want this to be \(O(np^2)\)
- Use all the tricks for finding nearest neighbors quickly
- Only make predictions for items highly similar to items \(i\) has already rated
  - Items are similar when they get similar ratings from different users (i.e., users are features for items)
  - Or even: only make predictions for items highly similar to items \(i\) has already liked

2.2 Dimension reduction

Again, items are features
Fix a number of (latent) factors \(q\)
Minimize \[ \sum_{(i,k) ~ \mathrm{observed}}{\left(x_{ik} - \sum_{r=1}^{q}{f_{ir} g_{rj}}\right)^2} \]
- \(r\) runs along the latent dimensions/factors
- \(f_{ir}\) is how high user \(i\) scores on factor \(r\)
- \(g_{rj}\) is how much item \(j\) weights on factor \(r\)
- Could tweak this to let each item have its own variance
Matrix factorization because we’re saying \(\mathbf{x} \approx \mathbf{f} \mathbf{g}\), where \(\mathbf{x}\) is \([n\times p]\), \(\mathbf{f}\) is \([n\times q]\) and \(\mathbf{g}\) is \([q \times p]\)
Practical minimization: gradient descent, alternating between \(\mathbf{f}\) and \(\mathbf{g}\)
See backup for a lot more on factor modeling in general, and some other uses of it in data-mining in particular

2.3 Interpreting factor models

The latent, inferred dimensions of the \(f_{ir}\) and \(g_{rj}\) values are the factors
To be concrete, think of movies
Each movie loads on to each factor
- E.g., one might load highly on “stuff blowing up”, “in space”, “dark and brooding”, “family secrets”, but not at all or negatively on “romantic comedy”, “tearjerker”, “bodily humor”
- Discovery: We don’t need to tell the system these are the dimensions movies vary along; it will find as many factors as we ask
- Interpretation: The factors it finds might not have humanly-comprehensible interpretations like “stuff blowing up” or “family secrets”
Each user also has a score on each factor
- E.g., I might have high scores for “stuff blowing up”, “in space” and “romantic comedy”, but negative scores for “tearjerker”, “bodily humor” and “family secrets”
Ratings are inner products plus noise
- Observable-specific noise helps capture ratings of movies which are extra variable, even given their loadings

2.5 Combining approaches

Nothing says you have to pick just one method!
Fit multiple models and predict a weighted average of the models
E.g., predictions might be 50% NN, 25% factorization, 25% network smoothing
Or: use one model as a base, then fit a second model to its residuals and add the residual-model’s predictions to the base model’s
- E.g., use factor model as a base and then kNN on its residuals
- Or: use average ratings as a base, then factor model on residuals from that, then kNN on residuals from the factor model

2.6 Some obstacles to all approaches

The “cold start” problem: what do for new users/items?
- New users: global averages, or social averaging if that’s available
  - Maybe present them with items with high information first?
- New items: content-based predictions, or hope that everyone isn’t relying on your system completely
Missing values are informative
Tastes change

2.6.1 Missing values are information

Both factorization and NNs can handle missing values
- Factorization: We saw how to do this
- NNs: only use the variables with present values in the user we want to make predictions for
BUT not rating something is informative
- You may not have heard of it…
- … or it may be the kind of thing you don’t like
  - I rate mystery novels, not Christian parenting guides or how-to books on account software
Often substantial improvements from explicitly modeling missingness (Marlin et al. 2007)

2.6.2 Tastes change

Down-weight old data
- Easy but abrupt: don’t care about ratings more than, say, 100 days old
- Or: only use the last 100 ratings
- Or: make weights on items a gradually-decaying function of age
Could also try to explicitly model change in tastes, but that adds to the computational burden
- One simple approach for factor models: \(\vec{f}_i(t+1) = \alpha \vec{f}_i(t) + (1-\alpha) (\mathrm{new\ estimate\ at\ time}\ t+1)\)

3 Maximization

Once you have predicted ratings, pick the highest-predicted ratings
- Finding the maximum of \(p\) items takes \(O(p)\) time in the worst case, so it helps if you can cut this down
  - Sorting
  - Early stopping if it looks like the predicted rating will be low
- We’ve noted some tricks for only predicting ratings for items likely to be good

4 Summing up

Recommendation systems work by first predicting what items users will like, and then maximizing the predictions
Basically all prediction methods assume \(x_{ik}\) can be estimated from \(x_{jl}\) when \(j\) and/or \(l\) are similar to \(i\) and/or \(k\)
- More or less elaborate models
- Different notions of similarity
Everyone wants to restrict the computational burden that comes with large \(n\) and \(p\)

5 Backup: Factor models in data mining

5.1 Factor models take off from PCA

Start with \(n\) items in a data base (\(n\) big)
Represent items as \(p\)-dimensional vectors of features (\(p\) too big for comfort), data is now \(\X\), dimension \([n \times p]\)
Principal components analysis:
- Find the best \(q\)-dimensional linear approximation to the data
- Equivalent to finding \(q\) directions of maximum variance through the data
- Equivalent to finding top \(q\) eigenvalues and eigenvectors of \(\frac{1}{n}\X^T \X =\) sample variance matrix of the data
- New features = PC scores = projections on to the eigenvectors
- Variances along those directions = eigenvalues

5.2 PCA is not a model

PCA says nothing about what the data should look like
PCA makes no predictions new data (or old data!)
PCA just finds a linear approximation to these data
What would be a PCA-like model?

5.3 This is where factor analysis comes in

Remember PCA: \[ \S = \X \w \] and \[ \X = \S \w^T \]

(because \(\w^T = \w^{-1}\))

If we use only \(q\) PCs, then \[ \S_q = \X \w_q \] but \[ \X \neq \S_q \w_q^T \]

Usual approach in statistics when the equations don’t hold: the error is random noise

5.4 The factor model

\(\vec{X}\) is \(p\)-dimensional, manifest, unhidden or observable
\(\vec{F}\) is \(q\)-dimensional, \(q < p\) but latent or hidden or unobserved
The model: \[\begin{eqnarray*} \vec{X} & = & \FactorLoadings \vec{F} + \vec{\epsilon}\\ (\text{observables}) & = & (\text{factor loadings}) (\text{factor scores}) + (\text{noise}) \end{eqnarray*}\]
\(\FactorLoadings =\) a \([p \times q]\) matrix of factor loadings
- Analogous to \(\w_q\) in PCA but without the orthonormal restrictions (some people also write \(\w\) for the loadings)
- Analogous to \(\beta\) in a linear regression
Assumption: \(\vec{\epsilon}\) is uncorrelated with \(\vec{F}\) and has \(\Expect{\vec{\epsilon}} = 0\)
- \(p\)-dimensional vector (unlike the scalar noise in linear regression)
Assumption: \(\Var{\vec{\epsilon}} \equiv \Uniquenesses\) is diagonal (i.e., no correlation across dimensions of the noise)
- Sometimes called the uniquenesses or the unique variance components
- Analogous to \(\sigma^2\) in a linear regression
- Some people write it \(\mathbf{\Sigma}\), others use that for \(\Var{\vec{X}}\)
- Means: all correlation between observables comes from the factors
Not really an assumption: \(\Var{\vec{F}} = \mathbf{I}\)
- Not an assumption because we could always de-correlate, as in homework 2
Assumption: \(\vec{\epsilon}\) is uncorrelated across units
- As we assume in linear regression…

5.4.1 Summary of the factor model assumptions

\[\begin{eqnarray} \vec{X} & = & \FactorLoadings \vec{F} + \vec{\epsilon}\\ \Cov{\vec{F}, \vec{\epsilon}} & = & \mathbf{0}\\ \Var{\vec{\epsilon}} & \equiv & \Uniquenesses, ~ \text{diagonal} \Expect{\vec{\epsilon}} & = & \vec{0}\\ \Var{\vec{F}} & = & \mathbf{I} \end{eqnarray}\]

5.4.2 Some consequences of the assumptions

\[\begin{eqnarray} \vec{X} & = & \FactorLoadings \vec{F} + \vec{\epsilon}\\ \Expect{\vec{X}} & = & \FactorLoadings \Expect{\vec{F}} \end{eqnarray}\]

Typically: center all variables so we can take \(\Expect{\vec{F}} = 0\)

\[\begin{eqnarray} \vec{X} & = & \FactorLoadings \vec{F} + \vec{\epsilon}\\ \Var{\vec{X}} & = & \FactorLoadings \Var{\vec{F}} \FactorLoadings^T + \Var{\vec{\epsilon}}\\ & = & \FactorLoadings \FactorLoadings^T + \Uniquenesses \end{eqnarray}\]

\(\FactorLoadings\) is \(p\times q\) so this is low-rank-plus-diagonal
- or low-rank-plus-noise
- Contrast with PCA: that approximates the variance matrix as purely low-rank

\[\begin{eqnarray} \vec{X} & = & \FactorLoadings \vec{F} + \vec{\epsilon}\\ \Var{\vec{X}} & = & \FactorLoadings \FactorLoadings^T + \Uniquenesses\\ \Cov{X_i, X_j} & = & \text{what?} \end{eqnarray}\]

5.5 Geometry

As \(\vec{F}\) varies over \(q\) dimensions, \(\w \vec{F}\) sweeps out a \(q\)-dimensional subspace in \(p\)-dimensional space
Then \(\vec{\epsilon}\) perturbs out of this subspace
If \(\Var{\vec{\epsilon}} = \mathbf{0}\) then we’d be exactly in the \(q\)-dimensional space, and we’d expect correspondence between factors and principal components
- (Modulo the rotation problem, to be discussed)
If the noise isn’t zero, factors \(\neq\) PCs
- In extremes: the largest direction of variation could come from a big entry in \(\Uniquenesses\), not from the linear structure at all

5.6 How do we estimate a factor model?

\[ \vec{X} = \FactorLoadings \vec{F} + \vec{\epsilon} \]

We can’t regress \(\vec{X}\) on \(\vec{F}\) because we never see \(\vec{F}\)

5.6.1 Suppose we knew \(\Uniquenesses\)

we’d say \[\begin{eqnarray} \Var{\vec{X}} & = & \FactorLoadings\FactorLoadings^T + \Uniquenesses\\ \Var{\vec{X}} - \Uniquenesses & = & \FactorLoadings\FactorLoadings^T \end{eqnarray}\]
LHS is \(\Var{\FactorLoadings\vec{F}}\) so we know it’s symmetric and non-negative-definite
\(\therefore\) We can eigendecompose LHS as \[\begin{eqnarray} \Var{\vec{X}} - \Uniquenesses & = &\mathbf{v} \mathbf{\lambda} \mathbf{v}^T\\ & = & (\mathbf{v} \mathbf{\lambda}^{1/2}) (\mathbf{v} \mathbf{\lambda}^{1/2})^T \end{eqnarray}\]
- \(\mathbf{\lambda} =\) diagonal matrix of eigenvalues, only \(q\) of which are non-zero
Set \(\FactorLoadings = \mathbf{v} \mathbf{\lambda}^{1/2}\) and everything’s consistent

5.6.2 Suppose we knew \(\FactorLoadings\)

then we’d say \[\begin{eqnarray} \Var{\vec{X}} & = & \FactorLoadings\FactorLoadings^T + \Uniquenesses\\ \Var{\vec{X}} - \FactorLoadings\FactorLoadings^T & = & \Uniquenesses \end{eqnarray}\]

5.6.3 “One person’s vicious circle is another’s iterative approximation”:

Start with a guess about \(\Uniquenesses\)
- Suitable guess: regress each observable on the others, residual variance is \(\Uniquenesses_{ii}\)
Until the estimates converge:
- Use \(\Uniquenesses\) to find \(\FactorLoadings\) (by eigen-magic)
- Use \(\FactorLoadings\) to find \(\Uniquenesses\) (by subtraction)
Once we have the loadings (and uniquenesses), we can estimate the scores

5.7 Estimating factor scores

PC scores were just projection
Estimating factor scores isn’t so easy!
Factor model: \[ \vec{X} = \FactorLoadings \vec{F} + \vec{\epsilon} \]
It’d be convenient to estimate factor scores as \[ \FactorLoadings^{-1} \vec{X} \] but \(\FactorLoadings^{-1}\) doesn’t exist!
Typical approach: optimal linear estimator
We know (from 401) that the optimal linear estimator of any \(Y\) from any \(\vec{Z}\) is \[ \Cov{Y, \vec{Z}} \Var{\vec{Z}}^{-1} \vec{Z} \]
- (ignoring the intercept because everything’s centered)
- i.e., column vector of optimal coefficients is \(\Var{\vec{Z}}^{-1} \Cov{\vec{Z}, Y}\)
Here \[ \Cov{\vec{X}, \vec{F}} = \FactorLoadings\Var{F} = \FactorLoadings \] and \[ \Var{\vec{X}} = \FactorLoadings\FactorLoadings^T + \Uniquenesses \] so the optimal linear factor score estimates are \[ \FactorLoadings^T (\FactorLoadings\FactorLoadings^T + \Uniquenesses)^{-1} \vec{X} \]

5.8 Example: Back to the dresses from HW 7

Fit a one-factor model:

##       Length Class  Mode   
## Gamma 14400  -none- numeric
## Z       205  -none- numeric
## Sigma 14400  -none- numeric

Positive and negative images along the that factor:

Now fit a five-factor model:

##       Length Class  Mode   
## Gamma 72000  -none- numeric
## Z      1025  -none- numeric
## Sigma 14400  -none- numeric

Positive and negative images along each factor:

Dress vs model, width of dress, pose, pose, pose (?)
We can recover images from the factor scores, e.g, image no. 1:

5.9 Factor models and high-dimensional variance estimation

With \(p\) observable features, a variance matrix has \(p(p+1)/2\) entries (by symmetry)
Ordinarily, to estimate \(k\) parameters requires \(n \geq k\) data points, so we’d need at least \(p(p+1)/2\) data points to get a variance matrix
- So it looks like we need \(n=O(p^2)\) data points to estimate variance matrices
- Trouble if \(p=10^6\) or even \(10^4\)
A \(q\)-factor model only has \(pq+p=p(q+1)\) parameters
- So we can get away with only \(O(p)\) data points
- What’s going on in the data example above, where \(p= 14400\) but \(n = 205\)?

5.10 Checking assumptions

Can’t check assumptions about \(\vec{F}\) or \(\vec{\epsilon}\) directly
Can check whether \(\Var{\vec{X}}\) is low-rank-plus-noise
- Need to know how far we should expect \(\Var{\vec{X}}\) to be from low-rank-plus-noise
- Can simulate
- Exact theory if you assume everything’s Gaussian
Other models can also give low-rank-plus-noise covariance
- See readings from Shalizi (n.d.)

5.11 Caution: the rotation problem

Remember the factor model: \[ \vec{X} = \FactorLoadings \vec{F} + \vec{\epsilon} \] with \(\Var{\vec{F}} = \mathbf{I}\), \(\Cov{\vec{F}, \vec{\epsilon}} = \mathbf{0}\), \(\Var{\vec{\epsilon}} = \Uniquenesses\)
Now consider \(\vec{G} = \mathbf{r} \vec{F}\) for any orthogonal matrix \(\mathbf{r}\) \[\begin{eqnarray} \vec{G} & = & \mathbf{r} \vec{F}\\ \Var{\vec{G}} &= & \mathbf{r}\Var{\vec{F}}\mathbf{r}^T\\ & = & \mathbf{r}\mathbf{I}\mathbf{r}^T = \mathbf{I}\\ \Cov{\vec{G}, \vec{\epsilon}} & = & \mathbf{r}\Cov{\vec{F}, \vec{\epsilon}} = \mathbf{0}\\ \vec{F} & = & \mathbf{r}^{-1} \vec{G} = \mathbf{r}^{T} \vec{G}\\ \vec{X} & = & \FactorLoadings \mathbf{r}^T \vec{G} + \vec{\epsilon}\\ & = & \FactorLoadings^{\prime} \vec{G} + \vec{\epsilon}\\ \end{eqnarray}\]
Once we’ve found one factor solution, we can rotate to another, and nothing observable changes
In other words: we’re free to use any coordinate system we like for the latent variables
Really a problem if we want to interpret the factors
- Different rotations make exactly the same predictions about the data
- If we prefer one over another, it cannot be because one of them fits the data better or has more empirical support (at least not this data)
- On the other hand, if our initial estimate of \(\FactorLoadings\) is hard to interpret, we can always try rotating to make it easier to tell stories about
Rotation is no problem at all if we just want to predict

5.12 Applications

Factor analysis begins with Spearman (1904) (IQ testing)
Thurstone (1934) made it a general tool for psychometrics
“Five factor model” of personality (openness, conscientiousness, extraversion, agreeableness, neuroticism): Basically, FA on personality quizzes
- A complicated story of dictionaries, lunatic asylums, and linear algebra (Paul 2004)
- Fails goodness-of-fit tests but that doesn’t stop the psychologists (Borsboom 2006)
More recent applications:
- Netflix again
- Cambridge Analytica
Not covered here: spatial and especially spatio-temporal data

5.13 Cambridge Analytica

UK political-operations firm
Starting point: Data sets of Facebook likes plus a five-factor personality test
- Ran regressions to link likes to personality factors
Then snarfed a lot of data from other people about their Facebook likes
Then extrapolated to personality scores
Then sold this as the basis for targeting political ads in 2016 both in the UK and the US
- Five-factor personality scores do correlate to political preferences (see, e.g., here), but so do education and IQ, which are all correlated with each other
- Cambridge Analytica claimed to be able to target the inner psyches of voters and tap their hidden fears and desires
- Not clear how well it worked or even how much of what they actually did used the estimated personality scores
At a technical level, Cambridge Analytica made (or claimed to make) a lot of extrapolations
- From Facebook likes among initial app users to latent factor scores
- From Facebook likes among friends of app users to latent factor scores
- From factor scores to ad effectiveness
- How did modeling error and noise propagate along this chain?
Again, not clear that this worked any better than traditional demographics or even that the psychological parts were used in practice
As with lots of data firms, a big contrast in rhetoric:
- To customers, claims of doing magic
- To regulators / governments, claims of being just polling / advising agency
- The recent (Netflix!) documentary takes their earlier PR at face value…
Clearly they were shady, but they don’t seem to have been very effective
- “They meant ill, but they were incompetent” is not altogether comforting
Further readings: linked to from the course homepage

References (in addition to the background reading on the course homepage)

Borsboom, Denny. 2006. “The Attack of the Psychometricians.” Psychometrika 71:425–40. https://doi.org/10.1007/s11336-006-1447-6.

Marlin, Benjamin, Richard S. Zemel, Sam Roweis, and Malcolm Slaney. 2007. “Collaborative Filtering and the Missing at Random Assumption.” In Proceedings of the 23rd Conference on Uncertainty in Artificial Intelligence [Uai 2007]. https://arxiv.org/abs/1206.5267.

Paul, Annie Murphy. 2004. The Cult of Personality: How Personality Tests Are Leading Us to Miseducate Our Children, Mismanage Our Companies, and Misunderstand Ourselves. New York: Free Press.

Salganik, Matthew J., Peter S. Dodds, and Duncan J. Watts. 2006. “Experimental Study of Inequality and Unpredictability in an Artificial Cultural Market.” Science 311:854–56. http://www.princeton.edu/~mjs3/musiclab.shtml.

Salganik, Matthew J., and Duncan J. Watts. 2008. “Leading the Herd Astray: An Experimental Study of Self-Fulfilling Prophecies in an Artificial Cultural Market.” Social Psychological Quarterly 71:338–55. http://www.princeton.edu/~mjs3/salganik_watts08.pdf.

Shalizi, Cosma Rohilla. n.d. Advanced Data Analysis from an Elementary Point of View. Cambridge, England: Cambridge University Press. http://www.stat.cmu.edu/~cshalizi/ADAfaEPoV.

Spearman, Charles. 1904. “‘General Intelligence,’ Objectively Determined and Measured.” American Journal of Psychology 15:201–93. http://psychclassics.yorku.ca/Spearman/.

Thurstone, L. L. 1934. “The Vectors of Mind.” Psychological Review 41:1–32. http://psychclassics.yorku.ca/Thurstone/.

Recommender Systems I — Broad Ideas and Technical Issues

36-462/662, Data Mining, Spring 2020

Lecture 23 (7 April 2020)

1 Recommender systems

1.1 The basic idea

1.2 Very simple (dumb) baselines

3 Maximization

4 Summing up

5 Backup: Factor models in data mining

5.1 Factor models take off from PCA

5.2 PCA is not a model

5.3 This is where factor analysis comes in

5.4 The factor model

5.4.1 Summary of the factor model assumptions

5.4.2 Some consequences of the assumptions

5.5 Geometry

5.6 How do we estimate a factor model?

5.6.1 Suppose we knew \(\Uniquenesses\)

5.6.2 Suppose we knew \(\FactorLoadings\)

5.6.3 “One person’s vicious circle is another’s iterative approximation”:

5.7 Estimating factor scores

5.8 Example: Back to the dresses from HW 7

5.9 Factor models and high-dimensional variance estimation

5.10 Checking assumptions

5.11 Caution: the rotation problem

5.12 Applications

5.13 Cambridge Analytica

References (in addition to the background reading on the course homepage)

Recommender Systems I — Broad Ideas and Technical Issues

36-462/662, Data Mining, Spring 2020

Lecture 23 (7 April 2020)

1 Recommender systems

1.1 The basic idea

1.2 Very simple (dumb) baselines

2 Common approaches: nearest neighbors, matrix factorization, social recommendation

2.1 Nearest neighbors

2.1.1 Content-based nearest neighbors

2.1.2 Item-based nearest neighbors

2.2 Dimension reduction

2.3 Interpreting factor models

2.4 Social recommendations

2.5 Combining approaches

2.6 Some obstacles to all approaches

2.6.1 Missing values are information

2.6.2 Tastes change

3 Maximization

4 Summing up

5 Backup: Factor models in data mining

5.1 Factor models take off from PCA

5.2 PCA is not a model

5.3 This is where factor analysis comes in

5.4 The factor model

5.4.1 Summary of the factor model assumptions

5.4.2 Some consequences of the assumptions

5.5 Geometry

5.6 How do we estimate a factor model?

5.6.1 Suppose we knew \(\Uniquenesses\)

5.6.2 Suppose we knew \(\FactorLoadings\)

5.6.3 “One person’s vicious circle is another’s iterative approximation”:

5.7 Estimating factor scores

5.8 Example: Back to the dresses from HW 7

5.9 Factor models and high-dimensional variance estimation

5.10 Checking assumptions

5.11 Caution: the rotation problem

5.12 Applications

5.13 Cambridge Analytica

References (in addition to the background reading on the course homepage)