\[ \newcommand{\X}{\mathbf{x}} \newcommand{\w}{\mathbf{w}} \newcommand{\V}{\mathbf{v}} \newcommand{\S}{\mathbf{s}} \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\SampleVar}[1]{\widehat{\mathrm{Var}}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \DeclareMathOperator{\tr}{tr} \DeclareMathOperator*{\argmin}{argmin} \newcommand{\FactorLoadings}{\mathbf{\Gamma}} \newcommand{\Uniquenesses}{\mathbf{\psi}} \]

Recap

Start with \(n\) items in a data base (\(n\) big)
Represent items as \(p\)-dimensional vectors of features (\(p\) too big for comfort), data is now \(\X\), dimension \([n \times p]\)
Principal components analysis:
- Find the best \(q\)-dimensional linear approximation to the data
- Equivalent to finding \(q\) directions of maximum variance through the data
- Equivalent to finding top \(q\) eigenvalues and eigenvectors of \(\frac{1}{n}\X^T \X =\) sample variance matrix of the data
- New features = PC scores = projections on to the eigenvectors
- Variances along those directions = eigenvalues

PCA is not a model

PCA says nothing about what the data should look like
PCA makes no predictions new data (or old data!)
PCA just finds a linear approximation to these data
What would be a PCA-like model?

This is where factor analysis comes in

Remember PCA: \[ \S = \X \w \] and \[ \X = \S \w^T \]

(because \(\w^T = \w^{-1}\))

If we use only \(q\) PCs, then \[ \S_q = \X \w_q \] but \[ \X \neq \S_q \w_q^T \]

Usual approach in statistics when the equations don’t hold: the error is random noise

The factor model

\(\vec{X}\) is \(p\)-dimensional, manifest, unhidden or observable

\(\vec{F}\) is \(q\)-dimensional, \(q < p\) but latent or hidden or unobserved

The model: \[\begin{eqnarray*} \vec{X} & = & \FactorLoadings \vec{F} + \vec{\epsilon}\\ (\text{observables}) & = & (\text{factor loadings}) (\text{factor scores}) + (\text{noise}) \end{eqnarray*}\]

The factor model

\[ \vec{X} = \FactorLoadings \vec{F} + \vec{\epsilon} \]

\(\FactorLoadings =\) a \([p \times q]\) matrix of factor loadings
- Analogous to \(\w_q\) in PCA but without the orthonormal restrictions (some people also write \(\w\) for the loadings)
- Analogous to \(\beta\) in a linear regression
Assumption: \(\vec{\epsilon}\) is uncorrelated with \(\vec{F}\) and has \(\Expect{\vec{\epsilon}} = 0\)
- \(p\)-dimensional vector (unlike the scalar noise in linear regression)
Assumption: \(\Var{\vec{\epsilon}} \equiv \Uniquenesses\) is diagonal (i.e., no correlation across dimensions of the noise)
- Sometimes called the uniquenesses or the unique variance components
- Analogous to \(\sigma^2\) in a linear regression
- Some people write it \(\mathbf{\Sigma}\), others use that for \(\Var{\vec{X}}\)
- Means: all correlation between observables comes from the factors
Not really an assumption: \(\Var{\vec{F}} = \mathbf{I}\)
- Not an assumption because we could always de-correlate, as in homework 2
Assumption: \(\vec{\epsilon}\) is uncorrelated across units
- As we assume in linear regression…

The factor model: summary

\[\begin{eqnarray} \vec{X} & = & \FactorLoadings \vec{F} + \vec{\epsilon}\\ \Cov{\vec{F}, \vec{\epsilon}} & = & \mathbf{0}\\ \Var{\vec{\epsilon}} & \equiv & \Uniquenesses, ~ \text{diagonal} \Expect{\vec{\epsilon}} & = & \vec{0}\\ \Var{\vec{F}} & = & \mathbf{I} \end{eqnarray}\]

Some consequences of the assumptions

\[\begin{eqnarray} \vec{X} & = & \FactorLoadings \vec{F} + \vec{\epsilon}\\ \Expect{\vec{X}} & = & \FactorLoadings \Expect{\vec{F}} \end{eqnarray}\]

Typically: center all variables so we can take \(\Expect{\vec{F}} = 0\)

Some consequences of the assumptions

\[\begin{eqnarray} \vec{X} & = & \FactorLoadings \vec{F} + \vec{\epsilon}\\ \Var{\vec{X}} & = & \FactorLoadings \Var{\vec{F}} \FactorLoadings^T + \Var{\vec{\epsilon}}\\ & = & \FactorLoadings \FactorLoadings^T + \Uniquenesses \end{eqnarray}\]

\(\FactorLoadings\) is \(p\times q\) so this is low-rank-plus-diagonal
- or low-rank-plus-noise
- Contrast with PCA: that approximates the variance matrix as purely low-rank

Some consequences of the assumptions

\[\begin{eqnarray} \vec{X} & = & \FactorLoadings \vec{F} + \vec{\epsilon}\\ \Var{\vec{X}} & = & \FactorLoadings \FactorLoadings^T + \Uniquenesses\\ \Cov{X_i, X_j} & = & \text{what?} \end{eqnarray}\]

Geometry

As \(\vec{F}\) varies over \(q\) dimensions, \(\w \vec{F}\) sweeps out a \(q\)-dimensional subspace in \(p\)-dimensional space
Then \(\vec{\epsilon}\) perturbs out of this subspace
If \(\Var{\vec{\epsilon}} = \mathbf{0}\) then we’d be exactly in the \(q\)-dimensional space, and we’d expect correspondence between factors and principal components
- (Modulo the rotation problem, to be discussed)
If the noise isn’t zero, factors \(\neq\) PCs
- In extremes: the largest direction of variation could come from a big entry in \(\Uniquenesses\), not from the linear structure at all

How do we estimate?

\[ \vec{X} = \FactorLoadings \vec{F} + \vec{\epsilon} \]

Can’t regress \(\vec{X}\) on \(\vec{F}\) because we never see \(\vec{F}\)

Suppose we knew \(\Uniquenesses\)

we’d say \[\begin{eqnarray} \Var{\vec{X}} & = & \FactorLoadings\FactorLoadings^T + \Uniquenesses\\ \Var{\vec{X}} - \Uniquenesses & = & \FactorLoadings\FactorLoadings^T \end{eqnarray}\]
LHS is \(\Var{\FactorLoadings\vec{F}}\) so we know it’s symmetric and non-negative-definite
\(\therefore\) We can eigendecompose LHS as \[\begin{eqnarray} \Var{\vec{X}} - \Uniquenesses & = &\mathbf{v} \mathbf{\lambda} \mathbf{v}^T\\ & = & (\mathbf{v} \mathbf{\lambda}^{1/2}) (\mathbf{v} \mathbf{\lambda}^{1/2})^T \end{eqnarray}\]
- \(\mathbf{\lambda} =\) diagonal matrix of eigenvalues, only \(q\) of which are non-zero
Set \(\FactorLoadings = \mathbf{v} \mathbf{\lambda}^{1/2}\) and everything’s consistent

Suppose we knew \(\FactorLoadings\)

then we’d say \[\begin{eqnarray} \Var{\vec{X}} & = & \FactorLoadings\FactorLoadings^T + \Uniquenesses\\ \Var{\vec{X}} - \FactorLoadings\FactorLoadings^T & = & \Uniquenesses \end{eqnarray}\]

“One person’s vicious circle is another’s iterative approximation”:

Start with a guess about \(\Uniquenesses\)
- Suitable guess: regress each observable on the others, residual variance is \(\Uniquenesses_{ii}\)
Until the estimates converge:
- Use \(\Uniquenesses\) to find \(\FactorLoadings\) (by eigen-magic)
- Use \(\FactorLoadings\) to find \(\Uniquenesses\) (by subtraction)
Once we have the loadings (and uniquenesses), we can estimate the scores

small.height <- 80
small.width <- 60
source("../../hw/03/eigendresses.R")
dress.images <- image.directory.df(path = "../../hw/03/amazon-dress-jpgs/", 
    pattern = "*.jpg", width = small.width, height = small.height)

library("cate")  # For high-dimensional (p > n) factor models
# function is a little fussy and needs its input to be a matrix rather than
# a data frame Also, it has a bunch of estimation methods, but the default
# one doesn't work nicely when some observables have zero variance (here,
# white pixels at the edges of every image --- use something a little more
# robust)
dresses.fa.1 <- factor.analysis(as.matrix(dress.images), r = 1, method = "pc")
summary(dresses.fa.1)  # Factor loadings, factor scores, uniqunesses

##       Length Class  Mode   
## Gamma 14400  -none- numeric
## Z       205  -none- numeric
## Sigma 14400  -none- numeric

par(mfrow = c(1, 2))
plot(vector.to.image(dresses.fa.1$Gamma, height = small.height, width = small.width))
plot(vector.to.image(-dresses.fa.1$Gamma, height = small.height, width = small.width))

par(mfrow = c(1, 1))

dresses.fa.5 <- factor.analysis(as.matrix(dress.images), r = 5, method = "pc")
summary(dresses.fa.5)

##       Length Class  Mode   
## Gamma 72000  -none- numeric
## Z      1025  -none- numeric
## Sigma 14400  -none- numeric

dress vs model, width of dress, pose, pose, pose (?)

Recover image no. 1 from the factor scores

par(mfrow = c(1, 2))
plot(vector.to.image(dress.images[1, ], height = small.height, width = small.width))
plot(vector.to.image(dresses.fa.5$Gamma %*% dresses.fa.5$Z[1, ], height = small.height, 
    width = small.width))

par(mfrow = c(1, 1))

Checking assumptions

Can’t check assumptions about \(\vec{F}\) or \(\vec{\epsilon}\) directly
Can check whether \(\Var{\vec{X}}\) is low-rank-plus-noise
- Need to know how far we should expect \(\Var{\vec{X}}\) to be from low-rank-plus-noise
- Can simulate
- Exact theory if you assume everything’s Gaussian
Other models can also give low-rank-plus-noise covariance
- See readings from ADA

Caution: the rotation problem

Remember the factor model: \[ \vec{X} = \FactorLoadings \vec{F} + \vec{\epsilon} \] with \(\Var{\vec{F}} = \mathbf{I}\), \(\Cov{\vec{F}, \vec{\epsilon}} = \mathbf{0}\), \(\Var{\vec{\epsilon}} = \Uniquenesses\)
Now consider \(\vec{G} = \mathbf{r} \vec{F}\) for any orthogonal matrix \(\mathbf{r}\) \[\begin{eqnarray} \vec{G} & = & \mathbf{r} \vec{F}\\ \Var{\vec{G}} &= & \mathbf{r}\Var{\vec{F}}\mathbf{r}^T\\ & = & \mathbf{r}\mathbf{I}\mathbf{r}^T = \mathbf{I}\\ \Cov{\vec{G}, \vec{\epsilon}} & = & \mathbf{r}\Cov{\vec{F}, \vec{\epsilon}} = \mathbf{0}\\ \vec{F} & = & \mathbf{r}^{-1} \vec{G} = \mathbf{r}^{T} \vec{G}\\ \vec{X} & = & \FactorLoadings \mathbf{r}^T \vec{G} + \vec{\epsilon}\\ & = & \FactorLoadings^{\prime} \vec{G} + \vec{\epsilon}\\ \end{eqnarray}\]
Once we’ve found one factor solution, we can rotate to another, and nothing observable changes
In other words: we’re free to use any coordinate system we like for the latent variables
Really a problem if we want to interpret the factors
- Different rotations make exactly the same predictions about the data
- If we prefer one over another, it cannot be because one of them fits the data better or has more empirical support (at least not this data)
- On the other hand, if our initial estimate of \(\FactorLoadings\) is hard to interpret, we can always try rotating to make it easier to tell stories about
Rotation is no problem at all if we just want to predict

Applications

Factor analysis begins with Spearman (1904) (IQ testing)
Thurstone (1934) made it a general tool for psychometrics
“Five factor model” of personality (openness, conscientiousness, extraversion, agreeableness, neuroticism): Basically, FA on personality quizzes
- A complicated story of dictionaries, lunatic asylums, and linear algebra (Paul 2004)
- Fails goodness-of-fit tests but that doesn’t stop the psychologists (Borsboom 2006)
More recent applications:
- Netflix again
- Cambridge Analytica
Not covered here: spatial and especially spatio-temporal data

Netflix

Each movie loads on to each factor
- E.g., one might load highly on “stuff blowing up”, “in space”, “dark and brooding” but not at all or negatively on “romantic comedy”, “tearjerker”, “bodily humor”
- Discovery: We don’t need to tell the system these are the dimensions movies vary along; it will find as many factors as we ask
- Interpretation: The factors it finds might not have humanly-comprehensible interpretations like “stuff blowing up” or “family secrets”
- Rotation:
Each user also has a score on each factor
- E.g., I might have high scores for “stuff blowing up”, “in space” and “romantic comedy”, but negative scores for “tearjerker”, “bodily humor” and “family secrets”
Ratings are inner products plus noise
- Observable-specific noise helps capture ratings of movies which are extra variable, even given their loadings
Further reading: see readings for last lecture

Cambridge Analytica

UK political-operations firm
Starting point: Data sets of Facebook likes plus a five-factor personality test
- Ran regressions to link likes to personality factors
Then snarfed a lot of data from other people about their Facebook likes
Then extrapolated to personality scores
Then sold this as the basis for targeting political ads in 2016 both in the UK and the US
- Five-factor personality scores do correlate to political preferences (see, e.g., here), but so do education and IQ, which are all correlated with each other
- Cambridge Analytica claimed to be able to target the inner psyches of voters and tap their hidden fears and desires
- Not clear how well it worked or even how much of what they actually did used the estimated personality scores

Cambridge Analytica

At a technical level, Cambridge Analytica made (or claimed to make) a lot of extrapolations
- From Facebook likes among initial app users to latent factor scores
- From Facebook likes among friends of app users to latent factor scores
- From factor scores to ad effectiveness
- How did modeling error and noise propagate along this chain?
Again, not clear that this worked any better than traditional demographics or even that the psychological parts were used in practice
As with lots of data firms, a big contrast in rhetoric:
- To customers, claims of doing magic
- To regulators / governments, claims of being just polling / advising agency
- The recent (Netflix!) documentary takes their earlier PR at face value…
Clearly they were shady, but they don’t seem to have been very effective
- “They meant ill, but they were incompetent” is not altogether comforting
Further readings: linked to from the course homepage

Summary

Factor analysis models the data as linear functions of a few latent variables plus noise
Extends principal components to an actual statistical model
Not the only way to get linear-plus-noise structure in the data
Next time: Fast linear summaries, start of nonlinear structure

Backup: Estimating factor scores

PC scores were just projection
Estimating factor scores isn’t so easy!
Factor model: \[ \vec{X} = \FactorLoadings \vec{F} + \vec{\epsilon} \]
It’d be convenient to estimate factor scores as \[ \FactorLoadings^{-1} \vec{X} \] but \(\FactorLoadings^{-1}\) doesn’t exist!
Typical approach: optimal linear estimator
We know (from 401) that the optimal linear estimator of any \(Y\) from any \(\vec{Z}\) is \[ \Cov{Y, \vec{Z}} \Var{\vec{Z}}^{-1} \vec{Z} \]
- (ignoring the intercept because everything’s centered)
- i.e., column vector of optimal coefficients is \(\Var{\vec{Z}}^{-1} \Cov{\vec{Z}, Y}\)
Here \[ \Cov{\vec{X}, \vec{F}} = \FactorLoadings\Var{F} = \FactorLoadings \] and \[ \Var{\vec{X}} = \FactorLoadings\FactorLoadings^T + \Uniquenesses \] so the optimal linear factor score estimates are \[ \FactorLoadings^T (\FactorLoadings\FactorLoadings^T + \Uniquenesses)^{-1} \vec{X} \]

Backup: Factor models and high-dimensional variance estimation

With \(p\) observable features, a variance matrix has \(p(p+1)/2\) entries (by symmetry)
Ordinarily, to estimate \(k\) parameters requires \(n \geq k\) data points, so we’d need at least \(p(p+1)/2\) data points to get a variance matrix
- So it looks like we need \(n=O(p^2)\) data points to estimate variance matrices
- Trouble if \(p=10^6\) or even \(10^4\)
A \(q\)-factor model only has \(pq+p=p(q+1)\) parameters
- So we can get away with only \(O(p)\) data points
- What’s going on in the data example above, where \(p= 14400\) but \(n = 205\)?

References

Borsboom, Denny. 2006. “The Attack of the Psychometricians.” Psychometrika 71:425–40. https://doi.org/10.1007/s11336-006-1447-6.

Paul, Annie Murphy. 2004. The Cult of Personality: How Personality Tests Are Leading Us to Miseducate Our Children, Mismanage Our Companies, and Misunderstand Ourselves. New York: Free Press.

Spearman, Charles. 1904. “‘General Intelligence,’ Objectively Determined and Measured.” American Journal of Psychology 15:201–93. http://psychclassics.yorku.ca/Spearman/.

Thurstone, L. L. 1934. “The Vectors of Mind.” Psychological Review 41:1–32. http://psychclassics.yorku.ca/Thurstone/.

Factor Analysis

Recap

PCA is not a model

This is where factor analysis comes in

The factor model

The factor model

The factor model: summary

Some consequences of the assumptions

Some consequences of the assumptions

Some consequences of the assumptions

Geometry

Geometry

How do we estimate?

Suppose we knew \(\Uniquenesses\)

Suppose we knew \(\FactorLoadings\)

“One person’s vicious circle is another’s iterative approximation”:

Recover image no. 1 from the factor scores

Checking assumptions

Caution: the rotation problem

Applications

Netflix

Cambridge Analytica

Cambridge Analytica

Summary

Backup: Estimating factor scores

Backup: Factor models and high-dimensional variance estimation

References