\[ \newcommand{\X}{\mathbf{x}} \newcommand{\w}{\mathbf{w}} \newcommand{\V}{\mathbf{v}} \newcommand{\S}{\mathbf{s}} \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\SampleVar}[1]{\widehat{\mathrm{Var}}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \newcommand{\TrueRegFunc}{\mu} \newcommand{\EstRegFunc}{\widehat{\TrueRegFunc}} \DeclareMathOperator{\tr}{tr} \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator{\dof}{DoF} \DeclareMathOperator{\det}{det} \newcommand{\TrueNoise}{\epsilon} \newcommand{\EstNoise}{\widehat{\TrueNoise}} \]

In our last episode…

We have multivariate data \(\X\) (dimension = \([n\times p]\))
We want the best \(q\)-dimensional linear approximation
Solution: Principal components analysis
- Take the \(q\) leading eigenvectors of \(\V \equiv \frac{1}{n}\X^T \X =\) sample/empirical covariance matrix
- These eigenvectors = the principal components
- Project the data on to the principal components
Assemble the eigenvectors into \(\w\) (\([p\times q]\))
Scores for the data are \(\X\w \equiv \S\) (\([n\times q]\))
- There are \(n\) scores on PC \(i\), \(\vec{x}_1 \cdot \vec{w}_i, \vec{x}_2 \cdot \vec{w}_2, \ldots \vec{x}_n \cdot \vec{w}_i\)
- The matrix \(\S\) assembles all the scores
Approximations are \((\X\w) \w^T = \S \w^T\) (\([n\times p]\))

Some properties of the PCs

The principal components are orthonormal
- \(\vec{w}_i \cdot \vec{w}_i = 1\)
- \(\vec{w}_i \cdot \vec{w}_j = 0\) (unless \(i=j\))
- Or in matrix form: \(\w^T\w = \mathbf{I}\)
PC1 is the direction of maximum variance through the data
- That variance is \(\lambda_1\), biggest eigenvalue of \(\V\)
PC \(i+1\) is the direction of maximum variance \(\perp\) PC1, PC2, \(\ldots\) PC \(i\)
- That variance is \(\lambda_{i+1}\)

Some properties of the eigenvalues

All eigenvalues \(\geq 0\)
In general, \(p\) non-zero eigenvalues (but not always)
If the data are exactly in a \(q\)-dimensional subspace, then exactly \(q\) non-zero eigenvalues
If \(n < p\), at most \(n\) non-zero eigenvalues
- Two points define a line, three define a plane, …

Some properties of PCA as a whole

If we use all \(p\) principal components, we have the eigendecomposition of \(\V\): \[ \V = \w \mathbf{\Lambda} \mathbf{w}^T \] \(\mathbf{\Lambda}=\) diagonal matrix of eigenvalues \(\lambda_1, \ldots \lambda_p\)
If we use all \(p\) principal components, \[ \X = \S\w^T \]
If we use only the top \(q\) PCs, we get:
- the best rank-\(q\) approximation to \(\V\)
- the best dimension-\(q\) approximation to \(\X\)

Some properties of PC scores

Average score on each PC \(=0\) (b/c we centered the data)
Variance of score on PC \(i\) \(=\lambda_i\) (by construction)
Covariance of score on PC \(i\) with score on PC \(j\) \(=0\)

\[\begin{eqnarray} \Var{\text{scores}} & = & \frac{1}{n} \S^T \S\\ & = & \frac{1}{n} (\X\w)^T(\X\w)\\ & = & \frac{1}{n}\w^T \X^T \X \w\\ & = & \w^T \V\w ~\text{ by definition of} ~ \V\\ & = & \w^T ( \w \mathbf{\Lambda} \mathbf{w}^T) \w ~\text{by eigendecomposition}\\ & = & (\w^T \w) \mathbf{\Lambda} (\w^T\w)\\ & = & \mathbf{\Lambda} \end{eqnarray}\]

Another way to think about PCA

The original coordinates are correlated
There is always another coordinate system with uncorrelated coordinates
We’re rotating to that coordinate system
- Rotating to new coordinates \(\Rightarrow\) multiplying by an orthogonal matrix
- That matrix is \(\mathbf{w}\)
- The new coordinates are the scores

PCA can be used for any multivariate data

Nothing in the math of PCA cares about where the data came from
- \(n\) measurements on \(p\) variables is all that matters
Areas of application:
- Reducing multiple measurements
- Dealing with collinearity or high-dimensional covariates in regression (see backup)
- Recommendation engines (e.g. Netflix (Feuerverger, He, and Khatri 2012))
  - \(n\) users ratings or engagement with \(p\) items of content, used to predict which users will like/engage with which items
  - Your feed on Facebook, Twitter, etc. is a recommendation engine that’s trying to maximize your engagement
- Gene expression levels in molecular biology (Wall, Rechtsteiner, and Rocha 2003)
- Fashion
… and, of course, spatio-temporal data

PCA with spatial data

\(n\) locations for \(p\) variables
- We saw this last time with the states
- Each PC is a \(p\)-dimensional vector (in “feature space”)
- Scores are distributed over physical space
  - That is, each location has a score (on each PC)
vs. \(n\) variables at \(p\) locations
- Each PC is a spatial pattern
- One score for each original variable
- See backup for examples of doing this with the states from last time
These are almost interchangeable
- Again, see backups for the math

A famous example

For each of \(n\) variant genes (“alleles”):
- Measure prevalence of allele at \(p\) different locations
- Originally by blood tests, now gene sequencing
- Each observation is a fraction \(\in [0,1]\)
Accumulated over many thousands of genes (\(n\)) and of locations (\(p\))
Outstanding reference: Cavalli-Sforza, Menozzi, and Piazza (1994)
- or the good-parts version, Cavalli-Sforza (2000)

Some maps from Cavalli-Sforza, Menozzi, and Piazza (1993)

World PC1

(\(\approx 35\%\) of between-population variance)

Some maps from Cavalli-Sforza, Menozzi, and Piazza (1993)

World PC2

(\(\approx 18\%\) of between-population variance)

Some maps from Cavalli-Sforza, Menozzi, and Piazza (1993)

World PC3

(\(\approx 12\%\) of between-population variance)

Some maps from Cavalli-Sforza, Menozzi, and Piazza (1993)

green = PC1, blue = PC2, red = PC3
No sharp breaks…
- This is \(\approx 65\%\) of between-population variance
- \(\approx 90\%\) of variance across people is within populations
Can do similar analyses on smaller geographic scales and learn more about history (see backups)

PCA with multiple time series

One approach: \(n\) time points for \(p\) variables
- Each PC is a \(p\)-dimensional vector \(\Rightarrow\) a pattern across variables
Flip perspective: \(n\) variables for \(p\) time-points
- Each PC is a \(n\)-dimensional vector \(\Rightarrow\) a pattern across time

Irish wind data

##   year month day   RPT   VAL   ROS   KIL   SHA  BIR   DUB   CLA   MUL   CLO
## 1   61     1   1 15.04 14.96 13.17  9.29 13.96 9.87 13.67 10.25 10.83 12.58
## 2   61     1   2 14.71 16.88 10.83  6.50 12.62 7.67 11.50 10.04  9.79  9.67
## 3   61     1   3 18.50 16.88 12.33 10.13 11.17 6.17 11.25  8.04  8.50  7.67
## 4   61     1   4 10.58  6.63 11.75  4.58  4.54 2.88  8.63  1.79  5.83  5.88
## 5   61     1   5 13.33 13.25 11.42  6.17 10.71 8.21 11.92  6.54 10.92 10.34
## 6   61     1   6 13.21  8.12  9.96  6.67  5.37 4.50 10.67  4.42  7.17  7.50
##     BEL   MAL                time
## 1 18.50 15.04 1961-01-01 12:00:00
## 2 17.54 13.83 1961-01-02 12:00:00
## 3 12.75 12.71 1961-01-03 12:00:00
## 4  5.46 10.88 1961-01-04 12:00:00
## 5 12.92 11.83 1961-01-05 12:00:00
## 6  8.12 13.17 1961-01-06 12:00:00

Irish wind data — one time series

Irish wind data — all the time series

PCA: \(n = 6574\), \(p=12\)

wind.pca.1 <- prcomp(wind[, 4:15])
wind.pca.1$sdev

##  [1] 15.149749  4.806761  3.848214  2.840283  2.796445  1.932717  1.809999
##  [8]  1.559231  1.408849  1.355770  1.164033  1.079990

PC1: The eigenvector

plot(-wind.pca.1$rotation[, 1], ylim = c(0, 1))
text(1:12, -wind.pca.1$rotation[, 1], pos = 3, labels = colnames(wind)[4:15])

A pattern over space

PC1: The eigenvector

A function of space

PC1: The scores

A function of time

Try to describe the first component here

PCA with spatio-temporal data

… we just did this!
\(n\) spatial locations, each with a time series of length \(p\)
- PCs are spatial patterns
\(n\) time points for \(p\) locations
- PCs are time series / temporal patterns
- Trends, or components of trends?

Interpreting PCA results

PCs are linear combinations of the original coordinates
- \(\Rightarrow\) PCs change as you add or remove coordinates
- Put in 1000 measures of education and PC1 is education…
  - Sometimes this is even what you want (Zeller and Carmines 1980)
Very tempting to reify the PCs
- i.e., to “make them a thing”
- sometimes totally appropriate…
- sometimes not at all appropriate…
- Be very careful when the only evidence is the PCA (Glymour 1998)
- Smoothing artifacts can be deadly (Novembre and Stephens 2008)
  - In essence: Yule and Slutsky strike again

PCA is exploratory analysis, not statistical inference

We assumed no model
- Pro: That’s the best linear approximation to these data, no matter what
- Con: doesn’t tell us where the data came from, what other data will look like, or how much our results are driven by noise
Prediction: PCA predicts nothing
Inference: If \(\V \rightarrow \Var{X}\) then PCs \(\rightarrow\) eigenvectors of \(\Var{X}\)
- But PCA doesn’t need this assumption
- Doesn’t tell us about uncertainty
- We will see ways to tackle this by simulation

Some alternatives to PCA

Independent component analysis (ICA)
- PCA analyzes data into uncorrelated components
- But uncorrelated \(\neq\) independent (unless everything’s Gaussian)
- ICA tries to break data into statistically-independent additive components
- Measure dependence between components, minimize
- Good overview: Stone (2004)
Nonlinear approximation
- PCA finds low-dimensional linear approximation
- What if the real structure has curves?
- Locally linear embedding, spectral component analysis, …
- Could-do-worse overview: Shalizi (n.d.)

Summing up

PCA rotates to new, uncorrelated coordinates
Using the first \(q\) PCs gives the best \(q\)-dimensional approximation to the data
These are the \(q\) directions of largest variance
We can make either the basis vectors or the scores into spatio-temporal patterns
Interpretation needs domain knowledge, and some caution
PCA does no inference or prediction

Some more maps from Cavalli-Sforza, Menozzi, and Piazza (1993)

PC1 for Europe and Southwest Asia
- Agriculture starts in the Fertile Crescent
- Farmers (not just farming) spread out

Some more maps from Cavalli-Sforza, Menozzi, and Piazza (1993)

PC3 for Europe and Southwest Asia
- Something centered on the plains (“steppes”) north of the Caucasus mountains and the Black Sea
- Archaeology: Domestication of the horse, chariots
- Linguistics: origins of the Indo-European languages
- Most likely explanation: Indo-European-speaking barbarians using horses and chariots to expand out from the steppes (Anthony 2007)
- Similar patterns for Central and South Asia

Principal Components Regression

Say we want to linearly regress \(Y\) on \(X\), with \(X\) being \(p\)-dimensional
If \(p\) is too big, for whatever reason, we could do PCA on \(X\) and get a \(q\)-dimensional \(S\)
Then we regress \(Y\) on \(S\) and get our predictions that way
This is principal components regression
Variant: \(X=(Z,U)\) where \(Z\) is the variable we really care about and \(U\) is a big mess of controls
- Do PCA on \(U\), get \(q\)-dimensional scores \(S\), and then regress \(Y\) on \(Z\) and \(S\)
- This is less common
Major application of both forms of PC regression: \(n < p\), we’ve measured too many variables on each observation to get a fit by ordinary least squares
- Also called high-dimensional regression
Another major application: dealing with collinearity or multi-collinearity
- Multicollinearity means that the \(p\) predictor variables really live in a lower, \(q\)-dimensional subspace
- Could drop predictor variables until they’re no longer collinear
- Or you could just find the scores on the \(q\) non-zero principal components
No guarantee that the directions of maximum variance through the predictors are the directions along which \(Y\) varies \(\Rightarrow\) PC regression might not be very good
There are other ways of dealing with high-dimensional or collinear predictors…
- Ridge regression, the lasso, the elastic net, etc.
- But some recent work suggest PC regression is surprisingly good compared to those modern methods (Dhillon et al. 2013)

Orthogonal matrices

Matrix \(\mathbf{o}\) is orthogonal iff \(\mathbf{o}^T = \mathbf{o}^{-1}\)
\(\Leftrightarrow\) the columns of \(\mathbf{o}\) are orthonormal vectors
- Why we say “orthogonal matrix” rather than “orthonormal matrix” is lost in the mists of 19th century German mathematics
Every rotation (around the origin) corresponds to an orthogonal matrix
- E.g. to rotate by an angle \(\theta\) in two dimensions, the matrix is \(\mathbf{o} = \left[ \begin{array}{cc} \cos{\theta} & \sin{\theta} \\ -\sin{\theta} & \cos{\theta} \end{array}\right]\)
  - If that looks funny compared to what you saw in an earlier class, remember we’re writing vectors as \([1 \times p]\) matrices, so the rotation works as \(\X\mathbf{o}\)
- You can check that \(\mathbf{o}^T \mathbf{o} = \left[\begin{array}{cc} 1 & 0 \\ 0 & 1\end{array}\right] = \mathbf{I}\) (because \(\cos^2{\theta} + \sin^2{\theta} = 1\))
Are there orthogonal matrices which aren’t rotations?

Recall the states…

state.pca <- prcomp(state.x77, scale. = TRUE)
signif(state.pca$rotation[, 1:2], 2)

##               PC1    PC2
## Population  0.130  0.410
## Income     -0.300  0.520
## Illiteracy  0.470  0.053
## Life Exp   -0.410 -0.082
## Murder      0.440  0.310
## HS Grad    -0.420  0.300
## Frost      -0.360 -0.150
## Area       -0.033  0.590

states are locations, PCs are patterns of variables

Each score is spatially distributed

Try it the other way

Turn the data on its side

state.vars.pca <- prcomp(t(scale(state.x77)))  # What's t()?
length(state.vars.pca$sdev)  # Why 8?

## [1] 8

head(signif(state.vars.pca$rotation[, 1:2]), 4)

##                 PC1        PC2
## Alabama  -0.2801370 0.03161830
## Alaska    0.0147876 0.56532600
## Arizona  -0.0700666 0.00872764
## Arkansas -0.1653660 0.03283480

signif(state.vars.pca$x[, 1], 2)

## Population     Income Illiteracy   Life Exp     Murder    HS Grad      Frost 
##      -2.60       2.90      -6.80       4.90      -6.70       4.80       4.30 
##       Area 
##      -0.69

The states turned on their sides…

This is the same as the other map (up to rounding errors)
This is no coincidence (see below)

PCA of \(\X\) vs. PCA of \(\X^T\)

Starting from \(\X\) (an \(n\times p\) matrix)
- Eigenvectors \(\vec{w}_i\), eigenvalues \(\lambda_i\)
- \(\X = \S \w^T\)
- Eigendecomposition of the variance matrix is \(n^{-1} \X^T \X = \w \mathbf{\Lambda} \w^T\) (a \(p\times p\) matrix)
Starting from \(\X^T\) (a \(p\times n\) matrix)
- Eigenvectors \(\vec{u}_i\), eigenvalues \(\psi_i\). diagonal matrix \(\mathbf{\Psi}\)
- Eigendecomposition of this variance is \(p^{-1} \X \X^T = \mathbf{u}\mathbf{\Psi}\mathbf{u}^T\) (an \(n\times n\) matrix)

\[\begin{eqnarray} \mathbf{u}\mathbf{\Psi}\mathbf{u}^T & = & p^{-1} \X \X^T\\ \mathbf{u}\mathbf{\Psi}\mathbf{u}^T & = & p^{-1} \S \w^T (\S \w^T)^T\\ \mathbf{u}\mathbf{\Psi}\mathbf{u}^T & = & p^{-1} \S \w^T \w \S^T\\ \mathbf{u}\mathbf{\Psi}^{1/2} \mathbf{\Psi}^{1/2}\mathbf{u}^T & = & p^{-1/2} \S \w^T \w \S^T p^{-1/2}\\ (\mathbf{u}\mathbf{\Psi}^{1/2}) (\mathbf{u}\mathbf{\Psi}^{1/2})^T & = & p^{-1/2} \S \S^T p^{-1/2}\\ \mathbf{u} & = & p^{-1/2} \mathbf{\Psi}^{-1/2} \S \end{eqnarray}\]

New PC1 vector \(\propto\) old scores on PC1, etc.

No, really, PCA doesn’t do statistical inference

PCA resembles a statistical model called factor analysis
The factor model is \(\vec{X} = \mathbf{\omega} \vec{F} + \vec{\epsilon}\)
- \(\vec{X}\) is the \(p\)-dimensional random vector we observe
- \(\vec{F}\) is the \(q\)-dimensional random vector of hidden (“latent”) factors or factor scores
  - Usually assume \(q < p\)
  - Usually assume \(\Var{\vec{F}} = \mathbf{I}\)
- \(\mathbf{\omega}\) is a \([p\times q]\) matrix of loadings
- \(\vec{\epsilon}\) is random noise uncorrelated with \(\vec{F}\)
This is a statistical model which can generate new data
It’s can also make predictions:
- Distribution of new data points
- If \(\vec{X}\) has some missing values, can still estimate \(\vec{F}\) and then use that to predict the unobserved entries in \(\vec{X}\)
  - This is really what recommendation engines do
This looks similar, but PCs \(\neq\) factors
- \(\mathbf{\omega} \neq \mathbf{w}\)
- Scores on PCs aren’t even estimates of factor values
We’ll come back to factor models later, but you could do worse than to read the factor-analysis chapter of Shalizi (n.d.)

Other alternatives to PCA

Slow feature analysis
- PCs of multiple time series are trend-ish
- Trends ought to change slowly
- SFA finds components with high correlation over time
Forecastable component analysis (Goerg 2013)
- Finds highly predictable components

References

Anthony, David W. 2007. The Horse, the Wheel and Language: How Bronze-Age Riders from the Eurasian Steppes Shaped the Modern World. Princeton: Princeton University Press.

Cavalli-Sforza, Luigi L. 2000. Genes, Peoples, and Languages. New York: North Point Press.

Cavalli-Sforza, Luigi L., Paolo Menozzi, and Alberto Piazza. 1993. “Demic Expansions and Human Evolution.” Science 259:639–46. https://doi.org/10.1126/science.8430313.

———. 1994. The History and Geography of Human Genes. Princeton: Princeton University Press.

Dhillon, Paramveer S., Dean P. Foster, Sham M. Kakade, and Lyle H. Ungar. 2013. “A Risk Comparison of Ordinary Least Squares Vs Ridge Regression.” Journal of Machine Lerning Research 14:1505–11. http://jmlr.org/papers/v14/dhillon13a.html.

Feuerverger, Andrey, Yu He, and Shashi Khatri. 2012. “Statistical Significance of the Netflix Challenge.” Statistical Science 27:202–31. https://doi.org/10.1214/11-STS368.

Glymour, Clark. 1998. “What Went Wrong? Reflections on Science by Observation and The Bell Curve.” Philosophy of Science 65:1–32. http://www.hss.cmu.edu/philosophy/glymour/glymour1998.pdf.

Goerg, Georg M. 2013. “Forecastable Component Analysis (Foreca).” In Proceedings of the 30th International Conference on Machine Learning [Icml 2013], edited by Sanjoy Dasgupta and David McAllester, 28:64–72. 2. http://proceedings.mlr.press/v28/goerg13.html.

Novembre, John, and Matthew Stephens. 2008. “Interpreting Principal Component Analyses of Spatial Population Genetic Variation.” Nature Genetics 40:646–49. https://doi.org/10.1038/ng.139.

Shalizi, Cosma Rohilla. n.d. Advanced Data Analysis from an Elementary Point of View. Cambridge, England: Cambridge University Press. http://www.stat.cmu.edu/~cshalizi/ADAfaEPoV.

Stone, James V. 2004. Independent Component Analysis: A Tutorial Introduction. Cambridge, Massachusetts: MIT Press.

Wall, Michael E., Andreas Rechtsteiner, and Luis M. Rocha. 2003. “Singular Value Decomposition and Principal Component Analysis.” In A Practical Approach to Microarray Data Analysis, edited by D. P. Berrar, W. Dubitsky, and M. Granzow, 91–109. Norwell, Massachusetts: Kluwer. https://arxiv.org/abs/physics/0208101.

Zeller, Richard A., and Edward G. Carmines. 1980. Measurement in the Social Sciences: The Link Between Theory and Data. Cambridge, England: Cambridge University Press.

Principal Components Analysis II

In our last episode…

Some properties of the PCs

Some properties of the eigenvalues

Some properties of PCA as a whole

Some properties of PC scores

Another way to think about PCA

PCA can be used for any multivariate data

PCA with spatial data

A famous example

Some maps from Cavalli-Sforza, Menozzi, and Piazza (1993)

Some maps from Cavalli-Sforza, Menozzi, and Piazza (1993)

Some maps from Cavalli-Sforza, Menozzi, and Piazza (1993)

Some maps from Cavalli-Sforza, Menozzi, and Piazza (1993)

PCA with multiple time series

Irish wind data

Irish wind data

Irish wind data

Irish wind data — one time series

Irish wind data — all the time series

PCA: \(n = 6574\), \(p=12\)

PC1: The eigenvector

PC1: The eigenvector

PC1: The scores

Try to describe the first component here

PCA with spatio-temporal data

Interpreting PCA results

PCA is exploratory analysis, not statistical inference

Some alternatives to PCA

Summing up

Details and asides

Some more maps from Cavalli-Sforza, Menozzi, and Piazza (1993)

Some more maps from Cavalli-Sforza, Menozzi, and Piazza (1993)

Principal Components Regression

Orthogonal matrices

Recall the states…

Each score is spatially distributed

Try it the other way

The states turned on their sides…

PCA of \(\X\) vs. PCA of \(\X^T\)

No, really, PCA doesn’t do statistical inference

Other alternatives to PCA

References