\[ \newcommand{\X}{\mathbf{x}} \newcommand{\w}{\mathbf{w}} \newcommand{\V}{\mathbf{v}} \newcommand{\S}{\mathbf{s}} \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\SampleVar}[1]{\widehat{\mathrm{Var}}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \newcommand{\TrueRegFunc}{\mu} \newcommand{\EstRegFunc}{\widehat{\TrueRegFunc}} \DeclareMathOperator{\tr}{tr} \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator{\dof}{DoF} \DeclareMathOperator{\det}{det} \newcommand{\TrueNoise}{\epsilon} \newcommand{\EstNoise}{\widehat{\TrueNoise}} \]

In our last episode…

We have multivariate data \(\X\) (dimension = \([n\times p]\))
We want the best \(q\)-dimensional linear approximation
Solution: Principal components analysis
- Take the \(q\) leading eigenvectors of \(\V \equiv \frac{1}{n}\X^T \X =\) sample/empirical covariance matrix
- These eigenvectors = the principal components
- Project the data on to the principal components
Assemble the eigenvectors into \(\w\) (\([p\times q]\))
- Scores for the data are \(\X\w \equiv \S\) (\([n\times q]\))
- Approximations are \((\X\w) \w^T = \S \w^T\) (\([n\times p]\))

Some properties of the PCs

The principal components are orthonormal
- \(\vec{w}_i \cdot \vec{w}_i = 1\)
- \(\vec{w}_i \cdot \vec{w}_j = 0\) (unless \(i=j\))
- \(\w^T\w = \mathbf{I}\)
PC1 is the direction of maximum variance through the data
- That variance is \(\lambda_1\), biggest eigenvalue of \(\V\)
PC \(i+1\) is the direction of maximum variance \(\perp\) PC1, PC2, \(\ldots\) PC \(i\)
- That variance is \(\lambda_{i+1}\)

Some properties of the eigenvalues

All eigenvalues \(\geq 0\)
In general, \(p\) non-zero eigenvalues
If the data are exactly in a \(q\)-dimensional subspace, then exactly \(q\) non-zero eigenvalues
If \(n < p\), at most \(n\) non-zero eigenvalues
- Two points define a line, three define a plane, …

Some properties of PC scores

Average score on each PC \(=0\) (b/c we centered the data)
Variance of score on PC \(i\) \(=\lambda_i\) (by construction)
Covariance of score on PC \(i\) with score on PC \(j\) \(=0\)

\[\begin{eqnarray} \Var{\text{scores}} & = & \frac{1}{n} \S^T \S\\ & = & \frac{1}{n} (\X\w)^T(\X\w)\\ & = & \frac{1}{n}\w^T \X^T \X \w\\ & = & \w^T \V\w = \w^T \mathbf{\Lambda} \w\\ & = & \mathbf{\Lambda} \w^T\w\\ & = & \mathbf{\Lambda} \end{eqnarray}\]

Some properties of PCA as a whole

If we use all \(p\) principal components, we have the eigendecomposition of \(\V\): \[ \V = \w \mathbf{\Lambda} \mathbf{w}^T \] \(\mathbf{\Lambda}=\) diagonal matrix of eigenvalues \(\lambda_1, \ldots \lambda_p\)
If we use all \(p\) principal components, \[ \X = \S\w^T \]
If we use only the top \(q\) PCs, we get:
- the best rank-\(q\) approximation to \(\V\)
- the best dimension-\(q\) approximation to \(\X\)

Another way to think about PCA

The original coordinates are correlated
There is always another coordinate system with uncorrelated coordinates
We’re rotating to that coordinate system
- Rotating to new coordinates \(\Rightarrow\) multiplying by an orthogonal matrix
- That matrix is \(\mathbf{w}\)
- The new coordinates are the scores

PCA can be used for any multivariate data

Nothing in the math of PCA cares about where the data came from
- \(n\) measurements on \(p\) variables is all that matters
Areas of application:
- Reducing multiple measurements
- Dealing with collinearity in regression
- Recommendation engines (e.g. Netflix)
- Gene expression levels in molecular biology
- Fashion
… and, of course, spatio-temporal data

PCA with spatial data

\(n\) locations for \(p\) variables
- We saw this!
- Each PC is a \(p\)-dimensional vector
- Scores are distributed over space
vs. \(n\) variables at \(p\) locations
- Each PC is a spatial pattern
- One score for each original variable

Recall the states…

state.pca <- prcomp(state.x77, scale. = TRUE)
signif(state.pca$rotation[, 1:2], 2)

##               PC1    PC2
## Population  0.130  0.410
## Income     -0.300  0.520
## Illiteracy  0.470  0.053
## Life Exp   -0.410 -0.082
## Murder      0.440  0.310
## HS Grad    -0.420  0.300
## Frost      -0.360 -0.150
## Area       -0.033  0.590

states are locations, PCs are patterns of variables

Each score is spatially distributed

Try it the other way

Turn the data on its side

state.vars.pca <- prcomp(t(scale(state.x77)))  # What's t()?
length(state.vars.pca$sdev)  # Why 8?

## [1] 8

head(signif(state.vars.pca$rotation[, 1:2]), 2)

##                PC1       PC2
## Alabama -0.2801370 0.0316183
## Alaska   0.0147876 0.5653260

signif(state.vars.pca$x[, 1], 2)

## Population     Income Illiteracy   Life Exp     Murder    HS Grad 
##      -2.60       2.90      -6.80       4.90      -6.70       4.80 
##      Frost       Area 
##       4.30      -0.69

The states turned on their sides…

Not exactly the same

… but pretty close. No coincidence! (See end)

2nd principal component

A famous example

For each of \(n\) variant genes (“alleles”):
- Measure prevalence of allele at \(p\) different locations
- Originally by blood tests, now gene sequencing
- Each observation is a fraction
Accumulated over many thousands of genes and of locations
Outstanding reference: Cavalli-Sforza, Menozzi, and Piazza (1994)
- or the good-parts version, Cavalli-Sforza (2000)

Some maps from Cavalli-Sforza, Menozzi, and Piazza (1993)

World PC1

(\(\approx 35\%\) of between-population variance)

Some maps from Cavalli-Sforza, Menozzi, and Piazza (1993)

World PC2

(\(\approx 18\%\) of between-population variance)

Some maps from Cavalli-Sforza, Menozzi, and Piazza (1993)

World PC3

(\(\approx 12\%\) of between-population variance)

Some maps from Cavalli-Sforza, Menozzi, and Piazza (1993)

green = PC1, blue = PC2, red = PC3
No sharp breaks…
- This is \(\approx 65\%\) of between-population variance
- \(\approx 90\%\) of variance across people is within populations

Some maps from Cavalli-Sforza, Menozzi, and Piazza (1993)

PC1 for Europe and Southwest Asia
- Agriculture starts in the Fertile Crescent
- Farmers (not just farming) spread

Some maps from Cavalli-Sforza, Menozzi, and Piazza (1993)

PC3 for Europe and Southwest Asia
- Something centered on the steppes north of the Caucasus mountains and the Black Sea
- Archaeology: Domestication of the horse, chariots
- Linguistics: origins of the Indo-European languages
- Most likely: Indo-European speaking barbarians expanding out from the steppes (Anthony 2007)
- Similar patterns for Asia

PCA with multiple time series

Obvious approach: \(n\) time points for \(p\) variables
- Each PC is a \(p\)-dimensional vector \(\Rightarrow\) a pattern across variables
Flip perspective: \(n\) variables for \(p\) time-points
- Each PC is a \(n\)-dimensional vector \(\Rightarrow\) a pattern across time

Irish wind data

##   year month day   RPT   VAL   ROS   KIL   SHA  BIR   DUB   CLA   MUL
## 1   61     1   1 15.04 14.96 13.17  9.29 13.96 9.87 13.67 10.25 10.83
## 2   61     1   2 14.71 16.88 10.83  6.50 12.62 7.67 11.50 10.04  9.79
## 3   61     1   3 18.50 16.88 12.33 10.13 11.17 6.17 11.25  8.04  8.50
## 4   61     1   4 10.58  6.63 11.75  4.58  4.54 2.88  8.63  1.79  5.83
## 5   61     1   5 13.33 13.25 11.42  6.17 10.71 8.21 11.92  6.54 10.92
## 6   61     1   6 13.21  8.12  9.96  6.67  5.37 4.50 10.67  4.42  7.17
##     CLO   BEL   MAL                time
## 1 12.58 18.50 15.04 1961-01-01 12:00:00
## 2  9.67 17.54 13.83 1961-01-02 12:00:00
## 3  7.67 12.75 12.71 1961-01-03 12:00:00
## 4  5.88  5.46 10.88 1961-01-04 12:00:00
## 5 10.34 12.92 11.83 1961-01-05 12:00:00
## 6  7.50  8.12 13.17 1961-01-06 12:00:00

Irish wind data — one time series

Irish wind data — all the time series

PCA: \(n = 6574\), \(p=12\)

wind.pca.1 <- prcomp(wind[, 4:15])
wind.pca.1$sdev

##  [1] 15.149749  4.806761  3.848214  2.840283  2.796445  1.932717  1.809999
##  [8]  1.559231  1.408849  1.355770  1.164033  1.079990

PC1: The eigenvector

plot(-wind.pca.1$rotation[, 1], ylim = c(0, 1))
text(1:12, -wind.pca.1$rotation[, 1], pos = 3, labels = colnames(wind)[4:15])

A pattern over space

PC1: The eigenvector

A function of space

PC1: The scores

A function of time

Break for in-class exercise:

Describe the first component

PCA with spatio-temporal data

… we just did this!
\(n\) spatial locations, each with a time series of length \(p\)
- PCs are spatial patterns
\(n\) time points for \(p\) locations
- PCs are time series / temporal patterns
- Trends, or components of trends?

Interpreting PCA results

PCs are linear combinations of the original coordinates
- \(\Rightarrow\) PCs change as you add or remove coordinates
- Put in 1000 measures of education and PC1 is education…
Very tempting to reify the PCs
- i.e., to “make them a thing”
- sometimes totally appropriate…
- sometimes not at all appropriate…
- Be very careful when the only evidence is the PCA
- Smoothing artifacts can be deadly (Novembre and Stephens 2008)

PCA is exploratory analysis, not statistical inference

We assumed no model
- Pro: That’s the best linear approximation, no matter what
- Con: doesn’t tell us how much our results are driven by noise
Prediction: PCA predicts nothing
Inference: If \(\V \rightarrow \Var{X}\) then then PCs \(\rightarrow\) eigenvectors of \(\Var{X}\)
- But PCA doesn’t need this assumption
- Doesn’t tell us about uncertainty
- Will see ways to tackle this by simulation

Some alternatives to PCA

Independent component analysis (ICA)
- PCA analyzes data into uncorrelated components
- But uncorrelated \(\neq\) independent (unless everything’s Gaussian)
- ICA analyzes data into statistically-independent components
- Measure dependence between components, minimize
- Good overview: Stone (2004)
Nonlinear approximation
- PCA finds low-dimensional linear approximation
- What of things are curved?
- Locally linear embedding, spectral component analysis, …
- Could do worse overview: Shalizi (n.d.)

Summing up

PCA rotates to new, uncorrelated coordinates
Using the first \(q\) PCs gives the best \(q\)-dimensional approximation to the data
These are the \(q\) directions of largest variance
We can make either the basis vectors or the scores into spatio-temporal patterns
Interpretation needs domain knowledge, and some caution
No inference or prediction: that comes next

Orthogonal matrices

Matrix \(\mathbf{o}\) is orthogonal iff \(\mathbf{o}^T = \mathbf{o}^{-1}\)
\(\Leftrightarrow\) the columns of \(\mathbf{o}\) are orthonormal vectors
- Why we say “orthogonal matrix” rather than “orthonormal matrix” is lost in the mists of 19th century German mathematics
Every rotation (around the origin) corresponds to an orthogonal matrix
Are there orthogonal matrices which aren’t rotations?

PCA of \(\X\) vs. PCA of \(\X^T\)

Starting from \(\X\):
- Eigenvectors \(\vec{w}_i\), eigenvalues \(\lambda_i\)
- \(\X = \S \w^T\)
- \(n^{-1} \X^T \X = \w \mathbf{\Lambda} \w^T\)
Starting from \(\X^T\)
- Eigenvectors \(\vec{u}_i\), eigenvalues \(\psi_i\)
- \(n^{-1} \X \X^T = \mathbf{u}\mathbf{\Psi}\mathbf{u}^T\)

PCA of \(\X\) vs. PCA of \(\X^T\)

\[\begin{eqnarray} \mathbf{u}\mathbf{\Psi}\mathbf{u}^T & = & n^{-1} \X \X^T\\ \mathbf{u}\mathbf{\Psi}\mathbf{u}^T & = & n^{-1} \S \w^T \w \S^T\\ \mathbf{u}\mathbf{\Psi}^{1/2} \mathbf{\Psi}^{1/2}\mathbf{u}^T & = & n^{-1/2} \S \w^T \w \S^T n^{-1/2}\\ (\mathbf{u}\mathbf{\Psi}^{1/2}) (\mathbf{u}\mathbf{\Psi}^{1/2})^T & = & n^{-1/2} \S \S^T n^{-1/2}\\ \mathbf{u} & = & n^{-1/2} \mathbf{\Psi}^{-1/2} \S \end{eqnarray}\]

New PC1 \(\propto\) old scores on PC1, etc.

No, really, PCA doesn’t do statistical inference

PCA is close to a statistical model called factor analysis
- but PCs \(\neq\) factors
- Will see factor models late in the course

Other alternatives to PCA

Slow feature analysis
- PCs of multiple time series are trend-ish
- Trends ought to change slowly
- SFA finds components with high correlation over time
Forecastable component analysis (Goerg 2013)
- we’ll come back to after looking at prediction

References

Anthony, David W. 2007. The Horse, the Wheel and Language: How Bronze-Age Riders from the Eurasian Steppes Shaped the Modern World. Princeton: Princeton University Press.

Cavalli-Sforza, Luigi L. 2000. Genes, Peoples, and Languages. New York: North Point Press.

Cavalli-Sforza, Luigi L., Paolo Menozzi, and Alberto Piazza. 1993. “Demic Expansions and Human Evolution.” Science 259:639–46. https://doi.org/10.1126/science.8430313.

———. 1994. The History and Geography of Human Genes. Princeton: Princeton University Press.

Goerg, Georg M. 2013. “Forecastable Component Analysis (Foreca).” In Proceedings of the 30th International Conference on Machine Learning [Icml 2013], edited by Sanjoy Dasgupta and David McAllester, 28:64–72. 2. http://proceedings.mlr.press/v28/goerg13.html.

Novembre, John, and Matthew Stephens. 2008. “Interpreting Principal Component Analyses of Spatial Population Genetic Variation.” Nature Genetics 40:646–49. https://doi.org/10.1038/ng.139.

Shalizi, Cosma Rohilla. n.d. Advanced Data Analysis from an Elementary Point of View. Cambridge, England: Cambridge University Press. http://www.stat.cmu.edu/~cshalizi/ADAfaEPoV.

Stone, James V. 2004. Independent Component Analysis: A Tutorial Introduction. Cambridge, Massachusetts: MIT Press.

Principal Components Analysis II

In our last episode…

Some properties of the PCs

Some properties of the eigenvalues

Some properties of PC scores

Some properties of PCA as a whole

Another way to think about PCA

PCA can be used for any multivariate data

PCA with spatial data

Recall the states…

Each score is spatially distributed

Try it the other way

The states turned on their sides…

Not exactly the same

2nd principal component

A famous example

Some maps from Cavalli-Sforza, Menozzi, and Piazza (1993)

Some maps from Cavalli-Sforza, Menozzi, and Piazza (1993)

Some maps from Cavalli-Sforza, Menozzi, and Piazza (1993)

Some maps from Cavalli-Sforza, Menozzi, and Piazza (1993)

Some maps from Cavalli-Sforza, Menozzi, and Piazza (1993)

Some maps from Cavalli-Sforza, Menozzi, and Piazza (1993)

PCA with multiple time series

Irish wind data

Irish wind data

Irish wind data

Irish wind data — one time series

Irish wind data — all the time series

PCA: \(n = 6574\), \(p=12\)

PC1: The eigenvector

PC1: The eigenvector

PC1: The scores

Break for in-class exercise:

PCA with spatio-temporal data

Interpreting PCA results

PCA is exploratory analysis, not statistical inference

Some alternatives to PCA

Summing up

Orthogonal matrices

PCA of \(\X\) vs. PCA of \(\X^T\)

PCA of \(\X\) vs. PCA of \(\X^T\)

No, really, PCA doesn’t do statistical inference

Other alternatives to PCA

References