Factor Analysis
36-462/662, Fall 2019
16 September 2019
\[
\newcommand{\X}{\mathbf{x}}
\newcommand{\w}{\mathbf{w}}
\newcommand{\V}{\mathbf{v}}
\newcommand{\S}{\mathbf{s}}
\newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]}
\newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]}
\newcommand{\SampleVar}[1]{\widehat{\mathrm{Var}}\left[ #1 \right]}
\newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]}
\DeclareMathOperator{\tr}{tr}
\DeclareMathOperator*{\argmin}{argmin}
\newcommand{\FactorLoadings}{\mathbf{\Gamma}}
\newcommand{\Uniquenesses}{\mathbf{\psi}}
\]
Recap
- Start with \(n\) items in a data base (\(n\) big)
- Represent items as \(p\)-dimensional vectors of features (\(p\) too big for comfort), data is now \(\X\), dimension \([n \times p]\)
- Principal components analysis:
- Find the best \(q\)-dimensional linear approximation to the data
- Equivalent to finding \(q\) directions of maximum variance through the data
- Equivalent to finding top \(q\) eigenvalues and eigenvectors of \(\frac{1}{n}\X^T \X =\) sample variance matrix of the data
- New features = PC scores = projections on to the eigenvectors
- Variances along those directions = eigenvalues
PCA is not a model
- PCA says nothing about what the data should look like
- PCA makes no predictions new data (or old data!)
- PCA just finds a linear approximation to these data
- What would be a PCA-like model?
This is where factor analysis comes in
Remember PCA: \[
\S = \X \w
\] and \[
\X = \S \w^T
\]
(because \(\w^T = \w^{-1}\))
If we use only \(q\) PCs, then \[
\S_q = \X \w_q
\] but \[
\X \neq \S_q \w_q^T
\]
- Usual approach in statistics when the equations don’t hold: the error is random noise
The factor model
\(\vec{X}\) is \(p\)-dimensional, manifest, unhidden or observable
\(\vec{F}\) is \(q\)-dimensional, \(q < p\) but latent or hidden or unobserved
The model: \[\begin{eqnarray*}
\vec{X} & = & \FactorLoadings \vec{F} + \vec{\epsilon}\\
(\text{observables}) & = & (\text{factor loadings}) (\text{factor scores}) + (\text{noise})
\end{eqnarray*}\]
The factor model
\[
\vec{X} = \FactorLoadings \vec{F} + \vec{\epsilon}
\]
- \(\FactorLoadings =\) a \([p \times q]\) matrix of factor loadings
- Analogous to \(\w_q\) in PCA but without the orthonormal restrictions (some people also write \(\w\) for the loadings)
- Analogous to \(\beta\) in a linear regression
- Assumption: \(\vec{\epsilon}\) is uncorrelated with \(\vec{F}\) and has \(\Expect{\vec{\epsilon}} = 0\)
- \(p\)-dimensional vector (unlike the scalar noise in linear regression)
- Assumption: \(\Var{\vec{\epsilon}} \equiv \Uniquenesses\) is diagonal (i.e., no correlation across dimensions of the noise)
- Sometimes called the uniquenesses or the unique variance components
- Analogous to \(\sigma^2\) in a linear regression
- Some people write it \(\mathbf{\Sigma}\), others use that for \(\Var{\vec{X}}\)
- Means: all correlation between observables comes from the factors
- Not really an assumption: \(\Var{\vec{F}} = \mathbf{I}\)
- Not an assumption because we could always de-correlate, as in homework 2
- Assumption: \(\vec{\epsilon}\) is uncorrelated across units
- As we assume in linear regression…
The factor model: summary
\[\begin{eqnarray}
\vec{X} & = & \FactorLoadings \vec{F} + \vec{\epsilon}\\
\Cov{\vec{F}, \vec{\epsilon}} & = & \mathbf{0}\\
\Var{\vec{\epsilon}} & \equiv & \Uniquenesses, ~ \text{diagonal}
\Expect{\vec{\epsilon}} & = & \vec{0}\\
\Var{\vec{F}} & = & \mathbf{I}
\end{eqnarray}\]
Some consequences of the assumptions
\[\begin{eqnarray}
\vec{X} & = & \FactorLoadings \vec{F} + \vec{\epsilon}\\
\Expect{\vec{X}} & = & \FactorLoadings \Expect{\vec{F}}
\end{eqnarray}\]
- Typically: center all variables so we can take \(\Expect{\vec{F}} = 0\)
Some consequences of the assumptions
\[\begin{eqnarray}
\vec{X} & = & \FactorLoadings \vec{F} + \vec{\epsilon}\\
\Var{\vec{X}} & = & \FactorLoadings \Var{\vec{F}} \FactorLoadings^T + \Var{\vec{\epsilon}}\\
& = & \FactorLoadings \FactorLoadings^T + \Uniquenesses
\end{eqnarray}\]
- \(\FactorLoadings\) is \(p\times q\) so this is low-rank-plus-diagonal
- or low-rank-plus-noise
- Contrast with PCA: that approximates the variance matrix as purely low-rank
Some consequences of the assumptions
\[\begin{eqnarray}
\vec{X} & = & \FactorLoadings \vec{F} + \vec{\epsilon}\\
\Var{\vec{X}} & = & \FactorLoadings \FactorLoadings^T + \Uniquenesses\\
\Cov{X_i, X_j} & = & \text{what?}
\end{eqnarray}\]
Geometry
As \(\vec{F}\) varies over \(q\) dimensions, \(\w \vec{F}\) sweeps out a \(q\)-dimensional subspace in \(p\)-dimensional space
Then \(\vec{\epsilon}\) perturbs out of this subspace
- If \(\Var{\vec{\epsilon}} = \mathbf{0}\) then we’d be exactly in the \(q\)-dimensional space, and we’d expect correspondence between factors and principal components
- (Modulo the rotation problem, to be discussed)
- If the noise isn’t zero, factors \(\neq\) PCs
- In extremes: the largest direction of variation could come from a big entry in \(\Uniquenesses\), not from the linear structure at all
Geometry
How do we estimate?
\[
\vec{X} = \FactorLoadings \vec{F} + \vec{\epsilon}
\]
Can’t regress \(\vec{X}\) on \(\vec{F}\) because we never see \(\vec{F}\)
Suppose we knew \(\Uniquenesses\)
- we’d say \[\begin{eqnarray}
\Var{\vec{X}} & = & \FactorLoadings\FactorLoadings^T + \Uniquenesses\\
\Var{\vec{X}} - \Uniquenesses & = & \FactorLoadings\FactorLoadings^T
\end{eqnarray}\]
- LHS is \(\Var{\FactorLoadings\vec{F}}\) so we know it’s symmetric and non-negative-definite
- \(\therefore\) We can eigendecompose LHS as \[\begin{eqnarray}
\Var{\vec{X}} - \Uniquenesses & = &\mathbf{v} \mathbf{\lambda} \mathbf{v}^T\\
& = & (\mathbf{v} \mathbf{\lambda}^{1/2}) (\mathbf{v} \mathbf{\lambda}^{1/2})^T
\end{eqnarray}\]
- \(\mathbf{\lambda} =\) diagonal matrix of eigenvalues, only \(q\) of which are non-zero
- Set \(\FactorLoadings = \mathbf{v} \mathbf{\lambda}^{1/2}\) and everything’s consistent
Suppose we knew \(\FactorLoadings\)
then we’d say \[\begin{eqnarray}
\Var{\vec{X}} & = & \FactorLoadings\FactorLoadings^T + \Uniquenesses\\
\Var{\vec{X}} - \FactorLoadings\FactorLoadings^T & = & \Uniquenesses
\end{eqnarray}\]
“One person’s vicious circle is another’s iterative approximation”:
- Start with a guess about \(\Uniquenesses\)
- Suitable guess: regress each observable on the others, residual variance is \(\Uniquenesses_{ii}\)
- Until the estimates converge:
- Use \(\Uniquenesses\) to find \(\FactorLoadings\) (by eigen-magic)
- Use \(\FactorLoadings\) to find \(\Uniquenesses\) (by subtraction)
- Once we have the loadings (and uniquenesses), we can estimate the scores
small.height <- 80
small.width <- 60
source("../../hw/03/eigendresses.R")
dress.images <- image.directory.df(path = "../../hw/03/amazon-dress-jpgs/",
pattern = "*.jpg", width = small.width, height = small.height)
library("cate") # For high-dimensional (p > n) factor models
# function is a little fussy and needs its input to be a matrix rather than
# a data frame Also, it has a bunch of estimation methods, but the default
# one doesn't work nicely when some observables have zero variance (here,
# white pixels at the edges of every image --- use something a little more
# robust)
dresses.fa.1 <- factor.analysis(as.matrix(dress.images), r = 1, method = "pc")
summary(dresses.fa.1) # Factor loadings, factor scores, uniqunesses
## Length Class Mode
## Gamma 14400 -none- numeric
## Z 205 -none- numeric
## Sigma 14400 -none- numeric
par(mfrow = c(1, 2))
plot(vector.to.image(dresses.fa.1$Gamma, height = small.height, width = small.width))
plot(vector.to.image(-dresses.fa.1$Gamma, height = small.height, width = small.width))
par(mfrow = c(1, 1))
dresses.fa.5 <- factor.analysis(as.matrix(dress.images), r = 5, method = "pc")
summary(dresses.fa.5)
## Length Class Mode
## Gamma 72000 -none- numeric
## Z 1025 -none- numeric
## Sigma 14400 -none- numeric
dress vs model, width of dress, pose, pose, pose (?)
Recover image no. 1 from the factor scores
par(mfrow = c(1, 2))
plot(vector.to.image(dress.images[1, ], height = small.height, width = small.width))
plot(vector.to.image(dresses.fa.5$Gamma %*% dresses.fa.5$Z[1, ], height = small.height,
width = small.width))
par(mfrow = c(1, 1))
Checking assumptions
- Can’t check assumptions about \(\vec{F}\) or \(\vec{\epsilon}\) directly
- Can check whether \(\Var{\vec{X}}\) is low-rank-plus-noise
- Need to know how far we should expect \(\Var{\vec{X}}\) to be from low-rank-plus-noise
- Can simulate
- Exact theory if you assume everything’s Gaussian
- Other models can also give low-rank-plus-noise covariance
Caution: the rotation problem
Remember the factor model: \[
\vec{X} = \FactorLoadings \vec{F} + \vec{\epsilon}
\] with \(\Var{\vec{F}} = \mathbf{I}\), \(\Cov{\vec{F}, \vec{\epsilon}} = \mathbf{0}\), \(\Var{\vec{\epsilon}} = \Uniquenesses\)
- Now consider \(\vec{G} = \mathbf{r} \vec{F}\) for any orthogonal matrix \(\mathbf{r}\) \[\begin{eqnarray}
\vec{G} & = & \mathbf{r} \vec{F}\\
\Var{\vec{G}} &= & \mathbf{r}\Var{\vec{F}}\mathbf{r}^T\\
& = & \mathbf{r}\mathbf{I}\mathbf{r}^T = \mathbf{I}\\
\Cov{\vec{G}, \vec{\epsilon}} & = & \mathbf{r}\Cov{\vec{F}, \vec{\epsilon}} = \mathbf{0}\\
\vec{F} & = & \mathbf{r}^{-1} \vec{G} = \mathbf{r}^{T} \vec{G}\\
\vec{X} & = & \FactorLoadings \mathbf{r}^T \vec{G} + \vec{\epsilon}\\
& = & \FactorLoadings^{\prime} \vec{G} + \vec{\epsilon}\\
\end{eqnarray}\]
- Once we’ve found one factor solution, we can rotate to another, and nothing observable changes
- In other words: we’re free to use any coordinate system we like for the latent variables
- Really a problem if we want to interpret the factors
- Different rotations make exactly the same predictions about the data
- If we prefer one over another, it cannot be because one of them fits the data better or has more empirical support (at least not this data)
- On the other hand, if our initial estimate of \(\FactorLoadings\) is hard to interpret, we can always try rotating to make it easier to tell stories about
Rotation is no problem at all if we just want to predict
Applications
- Factor analysis begins with Spearman (1904) (IQ testing)
- Thurstone (1934) made it a general tool for psychometrics
- “Five factor model” of personality (openness, conscientiousness, extraversion, agreeableness, neuroticism): Basically, FA on personality quizzes
- A complicated story of dictionaries, lunatic asylums, and linear algebra (Paul 2004)
- Fails goodness-of-fit tests but that doesn’t stop the psychologists (Borsboom 2006)
- More recent applications:
- Netflix again
- Cambridge Analytica
- Not covered here: spatial and especially spatio-temporal data
Netflix
- Each movie loads on to each factor
- E.g., one might load highly on “stuff blowing up”, “in space”, “dark and brooding” but not at all or negatively on “romantic comedy”, “tearjerker”, “bodily humor”
- Discovery: We don’t need to tell the system these are the dimensions movies vary along; it will find as many factors as we ask
- Interpretation: The factors it finds might not have humanly-comprehensible interpretations like “stuff blowing up” or “family secrets”
- Rotation:
- Each user also has a score on each factor
- E.g., I might have high scores for “stuff blowing up”, “in space” and “romantic comedy”, but negative scores for “tearjerker”, “bodily humor” and “family secrets”
- Ratings are inner products plus noise
- Observable-specific noise helps capture ratings of movies which are extra variable, even given their loadings
- Further reading: see readings for last lecture
Cambridge Analytica
- UK political-operations firm
- Starting point: Data sets of Facebook likes plus a five-factor personality test
- Ran regressions to link likes to personality factors
- Then snarfed a lot of data from other people about their Facebook likes
- Then extrapolated to personality scores
- Then sold this as the basis for targeting political ads in 2016 both in the UK and the US
- Five-factor personality scores do correlate to political preferences (see, e.g., here), but so do education and IQ, which are all correlated with each other
- Cambridge Analytica claimed to be able to target the inner psyches of voters and tap their hidden fears and desires
- Not clear how well it worked or even how much of what they actually did used the estimated personality scores
Cambridge Analytica
- At a technical level, Cambridge Analytica made (or claimed to make) a lot of extrapolations
- From Facebook likes among initial app users to latent factor scores
- From Facebook likes among friends of app users to latent factor scores
- From factor scores to ad effectiveness
- How did modeling error and noise propagate along this chain?
- Again, not clear that this worked any better than traditional demographics or even that the psychological parts were used in practice
- As with lots of data firms, a big contrast in rhetoric:
- To customers, claims of doing magic
- To regulators / governments, claims of being just polling / advising agency
- The recent (Netflix!) documentary takes their earlier PR at face value…
- Clearly they were shady, but they don’t seem to have been very effective
- “They meant ill, but they were incompetent” is not altogether comforting
- Further readings: linked to from the course homepage
Summary
- Factor analysis models the data as linear functions of a few latent variables plus noise
- Extends principal components to an actual statistical model
- Not the only way to get linear-plus-noise structure in the data
- Next time: Fast linear summaries, start of nonlinear structure
Backup: Estimating factor scores
- PC scores were just projection
Estimating factor scores isn’t so easy!
- Factor model: \[
\vec{X} = \FactorLoadings \vec{F} + \vec{\epsilon}
\]
It’d be convenient to estimate factor scores as \[
\FactorLoadings^{-1} \vec{X}
\] but \(\FactorLoadings^{-1}\) doesn’t exist!
- Typical approach: optimal linear estimator
- We know (from 401) that the optimal linear estimator of any \(Y\) from any \(\vec{Z}\) is \[
\Cov{Y, \vec{Z}} \Var{\vec{Z}}^{-1} \vec{Z}
\]
- (ignoring the intercept because everything’s centered)
- i.e., column vector of optimal coefficients is \(\Var{\vec{Z}}^{-1} \Cov{\vec{Z}, Y}\)
Here \[
\Cov{\vec{X}, \vec{F}} = \FactorLoadings\Var{F} = \FactorLoadings
\] and \[
\Var{\vec{X}} = \FactorLoadings\FactorLoadings^T + \Uniquenesses
\] so the optimal linear factor score estimates are \[
\FactorLoadings^T (\FactorLoadings\FactorLoadings^T + \Uniquenesses)^{-1} \vec{X}
\]
Backup: Factor models and high-dimensional variance estimation
- With \(p\) observable features, a variance matrix has \(p(p+1)/2\) entries (by symmetry)
- Ordinarily, to estimate \(k\) parameters requires \(n \geq k\) data points, so we’d need at least \(p(p+1)/2\) data points to get a variance matrix
- So it looks like we need \(n=O(p^2)\) data points to estimate variance matrices
- Trouble if \(p=10^6\) or even \(10^4\)
- A \(q\)-factor model only has \(pq+p=p(q+1)\) parameters
- So we can get away with only \(O(p)\) data points
- What’s going on in the data example above, where \(p= 14400\) but \(n = 205\)?
References
Paul, Annie Murphy. 2004. The Cult of Personality: How Personality Tests Are Leading Us to Miseducate Our Children, Mismanage Our Companies, and Misunderstand Ourselves. New York: Free Press.