Principal Components Analysis II
36-467/667
11 September 2018
\[
\newcommand{\X}{\mathbf{x}}
\newcommand{\w}{\mathbf{w}}
\newcommand{\V}{\mathbf{v}}
\newcommand{\S}{\mathbf{s}}
\newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]}
\newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]}
\newcommand{\SampleVar}[1]{\widehat{\mathrm{Var}}\left[ #1 \right]}
\newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]}
\newcommand{\TrueRegFunc}{\mu}
\newcommand{\EstRegFunc}{\widehat{\TrueRegFunc}}
\DeclareMathOperator{\tr}{tr}
\DeclareMathOperator*{\argmin}{argmin}
\DeclareMathOperator{\dof}{DoF}
\DeclareMathOperator{\det}{det}
\newcommand{\TrueNoise}{\epsilon}
\newcommand{\EstNoise}{\widehat{\TrueNoise}}
\]
In our last episode…
- We have multivariate data \(\X\) (dimension = \([n\times p]\))
- We want the best \(q\)-dimensional linear approximation
- Solution: Principal components analysis
- Take the \(q\) leading eigenvectors of \(\V \equiv \frac{1}{n}\X^T \X =\) sample/empirical covariance matrix
- These eigenvectors = the principal components
- Project the data on to the principal components
- Assemble the eigenvectors into \(\w\) (\([p\times q]\))
- Scores for the data are \(\X\w \equiv \S\) (\([n\times q]\))
- Approximations are \((\X\w) \w^T = \S \w^T\) (\([n\times p]\))
Some properties of the PCs
- The principal components are orthonormal
- \(\vec{w}_i \cdot \vec{w}_i = 1\)
- \(\vec{w}_i \cdot \vec{w}_j = 0\) (unless \(i=j\))
- \(\w^T\w = \mathbf{I}\)
- PC1 is the direction of maximum variance through the data
- That variance is \(\lambda_1\), biggest eigenvalue of \(\V\)
- PC \(i+1\) is the direction of maximum variance \(\perp\) PC1, PC2, \(\ldots\) PC \(i\)
- That variance is \(\lambda_{i+1}\)
Some properties of the eigenvalues
- All eigenvalues \(\geq 0\)
- In general, \(p\) non-zero eigenvalues
- If the data are exactly in a \(q\)-dimensional subspace, then exactly \(q\) non-zero eigenvalues
- If \(n < p\), at most \(n\) non-zero eigenvalues
- Two points define a line, three define a plane, …
Some properties of PC scores
- Average score on each PC \(=0\) (b/c we centered the data)
- Variance of score on PC \(i\) \(=\lambda_i\) (by construction)
- Covariance of score on PC \(i\) with score on PC \(j\) \(=0\)
\[\begin{eqnarray}
\Var{\text{scores}} & = & \frac{1}{n} \S^T \S\\
& = & \frac{1}{n} (\X\w)^T(\X\w)\\
& = & \frac{1}{n}\w^T \X^T \X \w\\
& = & \w^T \V\w = \w^T \mathbf{\Lambda} \w\\
& = & \mathbf{\Lambda} \w^T\w\\
& = & \mathbf{\Lambda}
\end{eqnarray}\]
Some properties of PCA as a whole
- If we use all \(p\) principal components, we have the eigendecomposition of \(\V\): \[
\V = \w \mathbf{\Lambda} \mathbf{w}^T
\] \(\mathbf{\Lambda}=\) diagonal matrix of eigenvalues \(\lambda_1, \ldots \lambda_p\)
- If we use all \(p\) principal components, \[
\X = \S\w^T
\]
- If we use only the top \(q\) PCs, we get:
- the best rank-\(q\) approximation to \(\V\)
- the best dimension-\(q\) approximation to \(\X\)
Another way to think about PCA
- The original coordinates are correlated
- There is always another coordinate system with uncorrelated coordinates
- We’re rotating to that coordinate system
- Rotating to new coordinates \(\Rightarrow\) multiplying by an orthogonal matrix
- That matrix is \(\mathbf{w}\)
- The new coordinates are the scores
PCA can be used for any multivariate data
- Nothing in the math of PCA cares about where the data came from
- \(n\) measurements on \(p\) variables is all that matters
- Areas of application:
- Reducing multiple measurements
- Dealing with collinearity in regression
- Recommendation engines (e.g. Netflix)
- Gene expression levels in molecular biology
- Fashion
- … and, of course, spatio-temporal data
PCA with spatial data
- \(n\) locations for \(p\) variables
- We saw this!
- Each PC is a \(p\)-dimensional vector
- Scores are distributed over space
- vs. \(n\) variables at \(p\) locations
- Each PC is a spatial pattern
- One score for each original variable
Recall the states…
state.pca <- prcomp(state.x77, scale. = TRUE)
signif(state.pca$rotation[, 1:2], 2)
## PC1 PC2
## Population 0.130 0.410
## Income -0.300 0.520
## Illiteracy 0.470 0.053
## Life Exp -0.410 -0.082
## Murder 0.440 0.310
## HS Grad -0.420 0.300
## Frost -0.360 -0.150
## Area -0.033 0.590
states are locations, PCs are patterns of variables
Each score is spatially distributed
Try it the other way
Turn the data on its side
state.vars.pca <- prcomp(t(scale(state.x77))) # What's t()?
length(state.vars.pca$sdev) # Why 8?
## [1] 8
head(signif(state.vars.pca$rotation[, 1:2]), 2)
## PC1 PC2
## Alabama -0.2801370 0.0316183
## Alaska 0.0147876 0.5653260
signif(state.vars.pca$x[, 1], 2)
## Population Income Illiteracy Life Exp Murder HS Grad
## -2.60 2.90 -6.80 4.90 -6.70 4.80
## Frost Area
## 4.30 -0.69
The states turned on their sides…
Not exactly the same
… but pretty close. No coincidence! (See end)
2nd principal component
A famous example
- For each of \(n\) variant genes (“alleles”):
- Measure prevalence of allele at \(p\) different locations
- Originally by blood tests, now gene sequencing
- Each observation is a fraction
- Accumulated over many thousands of genes and of locations
- Outstanding reference: Cavalli-Sforza, Menozzi, and Piazza (1994)
- or the good-parts version, Cavalli-Sforza (2000)
Some maps from Cavalli-Sforza, Menozzi, and Piazza (1993)
World PC1
(\(\approx 35\%\) of between-population variance)
Some maps from Cavalli-Sforza, Menozzi, and Piazza (1993)
World PC2
(\(\approx 18\%\) of between-population variance)
Some maps from Cavalli-Sforza, Menozzi, and Piazza (1993)
World PC3
(\(\approx 12\%\) of between-population variance)
Some maps from Cavalli-Sforza, Menozzi, and Piazza (1993)
- green = PC1, blue = PC2, red = PC3
- No sharp breaks…
- This is \(\approx 65\%\) of between-population variance
- \(\approx 90\%\) of variance across people is within populations
Some maps from Cavalli-Sforza, Menozzi, and Piazza (1993)
- PC1 for Europe and Southwest Asia
- Agriculture starts in the Fertile Crescent
- Farmers (not just farming) spread
Some maps from Cavalli-Sforza, Menozzi, and Piazza (1993)
- PC3 for Europe and Southwest Asia
- Something centered on the steppes north of the Caucasus mountains and the Black Sea
- Archaeology: Domestication of the horse, chariots
- Linguistics: origins of the Indo-European languages
- Most likely: Indo-European speaking barbarians expanding out from the steppes (Anthony 2007)
- Similar patterns for Asia
PCA with multiple time series
- Obvious approach: \(n\) time points for \(p\) variables
- Each PC is a \(p\)-dimensional vector \(\Rightarrow\) a pattern across variables
- Flip perspective: \(n\) variables for \(p\) time-points
- Each PC is a \(n\)-dimensional vector \(\Rightarrow\) a pattern across time
Irish wind data
Irish wind data
## year month day RPT VAL ROS KIL SHA BIR DUB CLA MUL
## 1 61 1 1 15.04 14.96 13.17 9.29 13.96 9.87 13.67 10.25 10.83
## 2 61 1 2 14.71 16.88 10.83 6.50 12.62 7.67 11.50 10.04 9.79
## 3 61 1 3 18.50 16.88 12.33 10.13 11.17 6.17 11.25 8.04 8.50
## 4 61 1 4 10.58 6.63 11.75 4.58 4.54 2.88 8.63 1.79 5.83
## 5 61 1 5 13.33 13.25 11.42 6.17 10.71 8.21 11.92 6.54 10.92
## 6 61 1 6 13.21 8.12 9.96 6.67 5.37 4.50 10.67 4.42 7.17
## CLO BEL MAL time
## 1 12.58 18.50 15.04 1961-01-01 12:00:00
## 2 9.67 17.54 13.83 1961-01-02 12:00:00
## 3 7.67 12.75 12.71 1961-01-03 12:00:00
## 4 5.88 5.46 10.88 1961-01-04 12:00:00
## 5 10.34 12.92 11.83 1961-01-05 12:00:00
## 6 7.50 8.12 13.17 1961-01-06 12:00:00
Irish wind data
Irish wind data — one time series
Irish wind data — all the time series
PCA: \(n = 6574\), \(p=12\)
wind.pca.1 <- prcomp(wind[, 4:15])
wind.pca.1$sdev
## [1] 15.149749 4.806761 3.848214 2.840283 2.796445 1.932717 1.809999
## [8] 1.559231 1.408849 1.355770 1.164033 1.079990
PC1: The eigenvector
plot(-wind.pca.1$rotation[, 1], ylim = c(0, 1))
text(1:12, -wind.pca.1$rotation[, 1], pos = 3, labels = colnames(wind)[4:15])
A pattern over space
PC1: The eigenvector
A function of space
PC1: The scores
A function of time
Break for in-class exercise:
Describe the first component
PCA with spatio-temporal data
- … we just did this!
- \(n\) spatial locations, each with a time series of length \(p\)
- \(n\) time points for \(p\) locations
- PCs are time series / temporal patterns
- Trends, or components of trends?
Interpreting PCA results
- PCs are linear combinations of the original coordinates
- \(\Rightarrow\) PCs change as you add or remove coordinates
- Put in 1000 measures of education and PC1 is education…
- Very tempting to reify the PCs
- i.e., to “make them a thing”
- sometimes totally appropriate…
- sometimes not at all appropriate…
- Be very careful when the only evidence is the PCA
- Smoothing artifacts can be deadly (Novembre and Stephens 2008)
PCA is exploratory analysis, not statistical inference
- We assumed no model
- Pro: That’s the best linear approximation, no matter what
- Con: doesn’t tell us how much our results are driven by noise
- Prediction: PCA predicts nothing
- Inference: If \(\V \rightarrow \Var{X}\) then then PCs \(\rightarrow\) eigenvectors of \(\Var{X}\)
- But PCA doesn’t need this assumption
- Doesn’t tell us about uncertainty
- Will see ways to tackle this by simulation
Some alternatives to PCA
- Independent component analysis (ICA)
- PCA analyzes data into uncorrelated components
- But uncorrelated \(\neq\) independent (unless everything’s Gaussian)
- ICA analyzes data into statistically-independent components
- Measure dependence between components, minimize
- Good overview: Stone (2004)
- Nonlinear approximation
- PCA finds low-dimensional linear approximation
- What of things are curved?
- Locally linear embedding, spectral component analysis, …
- Could do worse overview: Shalizi (n.d.)
Summing up
- PCA rotates to new, uncorrelated coordinates
- Using the first \(q\) PCs gives the best \(q\)-dimensional approximation to the data
- These are the \(q\) directions of largest variance
- We can make either the basis vectors or the scores into spatio-temporal patterns
- Interpretation needs domain knowledge, and some caution
- No inference or prediction: that comes next
Orthogonal matrices
- Matrix \(\mathbf{o}\) is orthogonal iff \(\mathbf{o}^T = \mathbf{o}^{-1}\)
- \(\Leftrightarrow\) the columns of \(\mathbf{o}\) are orthonormal vectors
- Why we say “orthogonal matrix” rather than “orthonormal matrix” is lost in the mists of 19th century German mathematics
- Every rotation (around the origin) corresponds to an orthogonal matrix
- Are there orthogonal matrices which aren’t rotations?
PCA of \(\X\) vs. PCA of \(\X^T\)
- Starting from \(\X\):
- Eigenvectors \(\vec{w}_i\), eigenvalues \(\lambda_i\)
- \(\X = \S \w^T\)
- \(n^{-1} \X^T \X = \w \mathbf{\Lambda} \w^T\)
- Starting from \(\X^T\)
- Eigenvectors \(\vec{u}_i\), eigenvalues \(\psi_i\)
- \(n^{-1} \X \X^T = \mathbf{u}\mathbf{\Psi}\mathbf{u}^T\)
PCA of \(\X\) vs. PCA of \(\X^T\)
\[\begin{eqnarray}
\mathbf{u}\mathbf{\Psi}\mathbf{u}^T & = & n^{-1} \X \X^T\\
\mathbf{u}\mathbf{\Psi}\mathbf{u}^T & = & n^{-1} \S \w^T \w \S^T\\
\mathbf{u}\mathbf{\Psi}^{1/2} \mathbf{\Psi}^{1/2}\mathbf{u}^T & = & n^{-1/2} \S \w^T \w \S^T n^{-1/2}\\
(\mathbf{u}\mathbf{\Psi}^{1/2}) (\mathbf{u}\mathbf{\Psi}^{1/2})^T & = & n^{-1/2} \S \S^T n^{-1/2}\\
\mathbf{u} & = & n^{-1/2} \mathbf{\Psi}^{-1/2} \S
\end{eqnarray}\]
New PC1 \(\propto\) old scores on PC1, etc.
No, really, PCA doesn’t do statistical inference
- PCA is close to a statistical model called factor analysis
- but PCs \(\neq\) factors
- Will see factor models late in the course
Other alternatives to PCA
- Slow feature analysis
- PCs of multiple time series are trend-ish
- Trends ought to change slowly
- SFA finds components with high correlation over time
- Forecastable component analysis (Goerg 2013)
- we’ll come back to after looking at prediction
References
Anthony, David W. 2007. The Horse, the Wheel and Language: How Bronze-Age Riders from the Eurasian Steppes Shaped the Modern World. Princeton: Princeton University Press.
Cavalli-Sforza, Luigi L. 2000. Genes, Peoples, and Languages. New York: North Point Press.
———. 1994. The History and Geography of Human Genes. Princeton: Princeton University Press.
Goerg, Georg M. 2013. “Forecastable Component Analysis (Foreca).” In Proceedings of the 30th International Conference on Machine Learning [Icml 2013], edited by Sanjoy Dasgupta and David McAllester, 28:64–72. 2. http://proceedings.mlr.press/v28/goerg13.html.
Novembre, John, and Matthew Stephens. 2008. “Interpreting Principal Component Analyses of Spatial Population Genetic Variation.” Nature Genetics 40:646–49. https://doi.org/10.1038/ng.139.
Stone, James V. 2004. Independent Component Analysis: A Tutorial Introduction. Cambridge, Massachusetts: MIT Press.