Factor Analysis

36-462/662, Fall 2019

16 September 2019

\[ \newcommand{\X}{\mathbf{x}} \newcommand{\w}{\mathbf{w}} \newcommand{\V}{\mathbf{v}} \newcommand{\S}{\mathbf{s}} \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\SampleVar}[1]{\widehat{\mathrm{Var}}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \DeclareMathOperator{\tr}{tr} \DeclareMathOperator*{\argmin}{argmin} \newcommand{\FactorLoadings}{\mathbf{\Gamma}} \newcommand{\Uniquenesses}{\mathbf{\psi}} \]

Recap

PCA is not a model

This is where factor analysis comes in

Remember PCA: \[ \S = \X \w \] and \[ \X = \S \w^T \]

(because \(\w^T = \w^{-1}\))

If we use only \(q\) PCs, then \[ \S_q = \X \w_q \] but \[ \X \neq \S_q \w_q^T \]

The factor model

\(\vec{X}\) is \(p\)-dimensional, manifest, unhidden or observable

\(\vec{F}\) is \(q\)-dimensional, \(q < p\) but latent or hidden or unobserved

The model: \[\begin{eqnarray*} \vec{X} & = & \FactorLoadings \vec{F} + \vec{\epsilon}\\ (\text{observables}) & = & (\text{factor loadings}) (\text{factor scores}) + (\text{noise}) \end{eqnarray*}\]

The factor model

\[ \vec{X} = \FactorLoadings \vec{F} + \vec{\epsilon} \]

The factor model: summary

\[\begin{eqnarray} \vec{X} & = & \FactorLoadings \vec{F} + \vec{\epsilon}\\ \Cov{\vec{F}, \vec{\epsilon}} & = & \mathbf{0}\\ \Var{\vec{\epsilon}} & \equiv & \Uniquenesses, ~ \text{diagonal} \Expect{\vec{\epsilon}} & = & \vec{0}\\ \Var{\vec{F}} & = & \mathbf{I} \end{eqnarray}\]

Some consequences of the assumptions

\[\begin{eqnarray} \vec{X} & = & \FactorLoadings \vec{F} + \vec{\epsilon}\\ \Expect{\vec{X}} & = & \FactorLoadings \Expect{\vec{F}} \end{eqnarray}\]

Some consequences of the assumptions

\[\begin{eqnarray} \vec{X} & = & \FactorLoadings \vec{F} + \vec{\epsilon}\\ \Var{\vec{X}} & = & \FactorLoadings \Var{\vec{F}} \FactorLoadings^T + \Var{\vec{\epsilon}}\\ & = & \FactorLoadings \FactorLoadings^T + \Uniquenesses \end{eqnarray}\]

Some consequences of the assumptions

\[\begin{eqnarray} \vec{X} & = & \FactorLoadings \vec{F} + \vec{\epsilon}\\ \Var{\vec{X}} & = & \FactorLoadings \FactorLoadings^T + \Uniquenesses\\ \Cov{X_i, X_j} & = & \text{what?} \end{eqnarray}\]

Geometry

Geometry

How do we estimate?

\[ \vec{X} = \FactorLoadings \vec{F} + \vec{\epsilon} \]

Can’t regress \(\vec{X}\) on \(\vec{F}\) because we never see \(\vec{F}\)

Suppose we knew \(\Uniquenesses\)

Suppose we knew \(\FactorLoadings\)

then we’d say \[\begin{eqnarray} \Var{\vec{X}} & = & \FactorLoadings\FactorLoadings^T + \Uniquenesses\\ \Var{\vec{X}} - \FactorLoadings\FactorLoadings^T & = & \Uniquenesses \end{eqnarray}\]

“One person’s vicious circle is another’s iterative approximation”:

small.height <- 80
small.width <- 60
source("../../hw/03/eigendresses.R")
dress.images <- image.directory.df(path = "../../hw/03/amazon-dress-jpgs/", 
    pattern = "*.jpg", width = small.width, height = small.height)
library("cate")  # For high-dimensional (p > n) factor models
# function is a little fussy and needs its input to be a matrix rather than
# a data frame Also, it has a bunch of estimation methods, but the default
# one doesn't work nicely when some observables have zero variance (here,
# white pixels at the edges of every image --- use something a little more
# robust)
dresses.fa.1 <- factor.analysis(as.matrix(dress.images), r = 1, method = "pc")
summary(dresses.fa.1)  # Factor loadings, factor scores, uniqunesses
##       Length Class  Mode   
## Gamma 14400  -none- numeric
## Z       205  -none- numeric
## Sigma 14400  -none- numeric

par(mfrow = c(1, 2))
plot(vector.to.image(dresses.fa.1$Gamma, height = small.height, width = small.width))
plot(vector.to.image(-dresses.fa.1$Gamma, height = small.height, width = small.width))

par(mfrow = c(1, 1))

dresses.fa.5 <- factor.analysis(as.matrix(dress.images), r = 5, method = "pc")
summary(dresses.fa.5)
##       Length Class  Mode   
## Gamma 72000  -none- numeric
## Z      1025  -none- numeric
## Sigma 14400  -none- numeric

dress vs model, width of dress, pose, pose, pose (?)

Recover image no. 1 from the factor scores

par(mfrow = c(1, 2))
plot(vector.to.image(dress.images[1, ], height = small.height, width = small.width))
plot(vector.to.image(dresses.fa.5$Gamma %*% dresses.fa.5$Z[1, ], height = small.height, 
    width = small.width))

par(mfrow = c(1, 1))

Checking assumptions

Caution: the rotation problem

Applications

Netflix

Cambridge Analytica

Cambridge Analytica

Summary

Backup: Estimating factor scores

Backup: Factor models and high-dimensional variance estimation

References

Borsboom, Denny. 2006. “The Attack of the Psychometricians.” Psychometrika 71:425–40. https://doi.org/10.1007/s11336-006-1447-6.

Paul, Annie Murphy. 2004. The Cult of Personality: How Personality Tests Are Leading Us to Miseducate Our Children, Mismanage Our Companies, and Misunderstand Ourselves. New York: Free Press.

Spearman, Charles. 1904. “‘General Intelligence,’ Objectively Determined and Measured.” American Journal of Psychology 15:201–93. http://psychclassics.yorku.ca/Spearman/.

Thurstone, L. L. 1934. “The Vectors of Mind.” Psychological Review 41:1–32. http://psychclassics.yorku.ca/Thurstone/.