Principal Components Analysis

36-467/667

11 September 2019

\[ \newcommand{\X}{\mathbf{x}} \newcommand{\w}{\mathbf{w}} \newcommand{\V}{\mathbf{v}} \newcommand{\S}{\mathbf{s}} \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\SampleVar}[1]{\widehat{\mathrm{Var}}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \DeclareMathOperator{\tr}{tr} \DeclareMathOperator*{\argmin}{argmin} \]

Previously…

What the data looks like

Finding the “principal” component

Projections

How well does the projection approximate the original?

Do it for one vector first:

\[\begin{eqnarray} {\|\vec{x_i} - (\vec{w}\cdot\vec{x_i})\vec{w}\|}^2 & =& \left(\vec{x_i} - (\vec{w}\cdot\vec{x_i})\vec{w}\right)\cdot\left(\vec{x_i} - (\vec{w}\cdot\vec{x_i})\vec{w}\right)\\ & = & \vec{x_i}\cdot\vec{x_i} -\vec{x_i}\cdot (\vec{w}\cdot\vec{x_i})\vec{w}\\ \nonumber & & - (\vec{w}\cdot\vec{x_i})\vec{w}\cdot\vec{x_i} + (\vec{w}\cdot\vec{x_i})\vec{w}\cdot(\vec{w}\cdot\vec{x_i})\vec{w}\\ & = & {\|\vec{x_i}\|}^2 -2(\vec{w}\cdot\vec{x_i})^2 + (\vec{w}\cdot\vec{x_i})^2\vec{w}\cdot\vec{w}\\ & = & \vec{x_i}\cdot\vec{x_i} - (\vec{w}\cdot\vec{x_i})^2 \end{eqnarray}\]

How well does the projection approximate the original?

\[\begin{eqnarray} MSE(\vec{w}) & = & \frac{1}{n}\sum_{i=1}^{n}{\|\vec{x_i}\|^2 -{(\vec{w}\cdot\vec{x_i})}^2}\\ & = & \frac{1}{n}\left(\sum_{i=1}^{n}{\|\vec{x_i}\|^2} -\sum_{i=1}^{n}{(\vec{w}\cdot\vec{x_i})^2}\right) \end{eqnarray}\]

Minimizing MSE is maximizing variance

\[\begin{eqnarray} L(w) & = & \frac{1}{n}\sum_{i=1}^{n}{{(\vec{w}\cdot\vec{x_i})}^2}\\ & = & {\left(\frac{1}{n}\sum_{i=1}^{n}{\vec{x_i}\cdot\vec{w}}\right)}^2 + \SampleVar{\vec{w}\cdot\vec{x_i}} \end{eqnarray}\]

(\(\Expect{Z^2} = (\Expect{Z})^2 + \Var{Z}\))

But \[ \frac{1}{n}\sum_{i=1}^{n}{\vec{x_i} \cdot \vec{w}} = 0 \] (Why?)

Minimizing MSE is maximizing variance

so \[ L(\vec{w}) = \SampleVar{\vec{w}\cdot\vec{x_i}} \]

Minimizing MSE is maximizing variance

The direction which gives us the best approximation of the data is the direction with the greatest variance

OK, how do we find this magic direction?

Matrix form: all lengths of projections \(=\mathbf{x}\mathbf{w}\) \([n\times 1]\)

\[\begin{eqnarray} \SampleVar{\vec{w}\cdot\vec{x_i}} & = & \frac{1}{n}\sum_{i}{{\left(\vec{x_i} \cdot \vec{w}\right)}^2}\\ & = & \frac{1}{n}{\left(\X \w\right)}^{T} \left(\X \w\right)\\ & = & \frac{1}{n} \w^T \X^T \X \w\\ & = & \w^T \frac{\X^T \X}{n} \w\\ \end{eqnarray}\]

Fact: \(\V \equiv \frac{\X^T \X}{n} =\) sample covariance matrix of the vectors

OK, how do we find this magic direction?

\[\begin{eqnarray} \mathcal{L}(\w,\lambda) & \equiv & \w^T\V\w - \lambda(\w^T \w -1)\\ \frac{\partial \mathcal{L}}{\partial \lambda} & = & \w^T \w -1\\ \frac{\partial \mathcal{L}}{\partial \w} & = & 2\V\w - 2\lambda\w \end{eqnarray}\] \[\begin{eqnarray} \w^T \w & = & 1\\ \V \w & = & \lambda \w \end{eqnarray}\]

The magic direction is an eigenvector

\[\begin{eqnarray} \w^T \w & = & 1\\ \V \w & = & \lambda \w \end{eqnarray}\]

THIS IS AN EIGENVALUE/EIGENVECTOR EQUATION!

At the solution, \[ \SampleVar{\vec{w}\cdot\vec{x_i}} = \w^T \V \w = \w^T \lambda \w = \lambda \] so the maximum is the leading eigenvector of \(\V\)

About the sample covariance matrix

\(\V\) is a special matrix: symmetric and non-negative definite:

\[\begin{eqnarray} \text{Lead eigenvector of} \V & = & 1^{\mathrm{st}} \text{principal component}\\ & = & \text{Direction of maximum variance}\\ & = & \text{Best 1D approximation to the data} \end{eqnarray}\]

Multiple principle components

Some properties of the eigenvalues

Some properties of the PCs

Some properties of PC scores

\[ \S = \X \w \]

\[\begin{eqnarray} \Var{\text{scores}} & = & \frac{1}{n} \S^T \S\\ \end{eqnarray}\]

Show that this \(=\mathbf{\Lambda}\)

(Hint: \(\V = \w \mathbf{\Lambda} \w^T\))

Some properties of PC scores

\[ \S = \X \w \]

\[\begin{eqnarray} \Var{\text{scores}} & = & \frac{1}{n} \S^T \S\\ & = & \frac{1}{n} (\X\w)^T(\X\w)\\ & = & \frac{1}{n}\w^T \X^T \X \w\\ & = & \w^T \V\w = \w^T (\w \mathbf{\Lambda} \mathbf{w}^T) \w & = & (\w^T \w) \mathbf{\Lambda} (\w^T\w)\\ & = & \mathbf{\Lambda} \end{eqnarray}\]

Some properties of PCA as a whole

Another way to think about PCA

Interpreting PCA results

PCA is exploratory analysis, not statistical inference

In R

Example: The axis of historical complexity

Turchin et al. (2018)

9 variable measuring social complexity, over time, for dozens of locations across the world

Dates from -9600 to 1900

summary(soccomp[, complexities], digits = 2)
##      PolPop       PolTerr          CapPop        levels      government  
##  Min.   :1.4   Min.   :-0.22   Min.   :1.4   Min.   :0.0   Min.   :0.00  
##  1st Qu.:4.2   1st Qu.: 3.65   1st Qu.:3.5   1st Qu.:1.8   1st Qu.:0.24  
##  Median :6.0   Median : 5.18   Median :4.3   Median :3.0   Median :0.62  
##  Mean   :5.5   Mean   : 4.78   Mean   :4.2   Mean   :2.9   Mean   :0.55  
##  3rd Qu.:6.8   3rd Qu.: 5.97   3rd Qu.:5.1   3rd Qu.:4.0   3rd Qu.:0.86  
##  Max.   :8.5   Max.   : 7.40   Max.   :6.3   Max.   :6.6   Max.   :1.00  
##     infrastr       writing         texts          money    
##  Min.   :0.00   Min.   :0.00   Min.   :0.00   Min.   :0.0  
##  1st Qu.:0.34   1st Qu.:0.26   1st Qu.:0.10   1st Qu.:1.8  
##  Median :0.75   Median :0.82   Median :0.93   Median :4.0  
##  Mean   :0.64   Mean   :0.65   Mean   :0.63   Mean   :3.4  
##  3rd Qu.:0.90   3rd Qu.:0.86   3rd Qu.:0.97   3rd Qu.:5.0  
##  Max.   :1.00   Max.   :1.00   Max.   :1.00   Max.   :6.0

Example: The axis of historical complexity

soccomp.pca <- prcomp(soccomp[, complexities], scale = TRUE)

What are the parts of the return value?

str(soccomp.pca)
## List of 5
##  $ sdev    : num [1:9] 2.634 0.733 0.646 0.581 0.481 ...
##  $ rotation: num [1:9, 1:9] 0.351 0.32 0.339 0.341 0.332 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:9] "PolPop" "PolTerr" "CapPop" "levels" ...
##   .. ..$ : chr [1:9] "PC1" "PC2" "PC3" "PC4" ...
##  $ center  : Named num [1:9] 5.515 4.779 4.229 2.923 0.552 ...
##   ..- attr(*, "names")= chr [1:9] "PolPop" "PolTerr" "CapPop" "levels" ...
##  $ scale   : Named num [1:9] 1.59 1.561 1.112 1.449 0.325 ...
##   ..- attr(*, "names")= chr [1:9] "PolPop" "PolTerr" "CapPop" "levels" ...
##  $ x       : num [1:414, 1:9] -4.35 -4.24 -4.11 -3.67 -3.51 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : NULL
##   .. ..$ : chr [1:9] "PC1" "PC2" "PC3" "PC4" ...
##  - attr(*, "class")= chr "prcomp"

Example: The axis of historical complexity

plot(cumsum(soccomp.pca$sdev^2)/sum(soccomp.pca$sdev^2), xlab = "Number of components used", 
    ylab = expression(R^2), ylim = c(0, 1))
abline(h = 0.75, lty = "dashed")
abline(h = 0.9, lty = "dotted")

One principal component already keeps more than 75% of the variance (3 components keep just under 90%)

Example: The axis of historical complexity

plot(soccomp.pca$rotation[, 1], xlab = "", main = "PCs of the complexity measures", 
    xaxt = "n", ylab = "PC weight", ylim = c(-1, 1), type = "l")
axis(side = 1, at = 1:9, labels = colnames(soccomp)[complexities], las = 2, 
    cex.axis = 0.5)
lines(soccomp.pca$rotation[, 2], col = "red")
lines(soccomp.pca$rotation[, 3], col = "blue")
legend("bottomright", legend = c("PC1", "PC2", "PC3"), lty = "solid", col = c("black", 
    "red", "blue"))
abline(h = 0, col = "grey")

Example: The axis of historical complexity

soccomp$PC1 <- soccomp.pca$x[, 1]
with(soccomp, plot(PC1 ~ Time))

Example: The axis of historical complexity

Example: The axis of historical complexity

plot(soccomp.pca$x[, 1], soccomp.pca$x[, 2], xlab = "Score on PC1", ylab = "Score on PC2")

Some more data-mining-y applications

Latent semantic indexing (Deerwester et al. 1990; Landauer and Dumais 1997)

LSI for the New York Times stories

source("http://www.stat.cmu.edu/~cshalizi/dm/19/hw/01/nytac-and-bow.R")
art.stories <- read.directory("nyt_corpus/art")
music.stories <- read.directory("nyt_corpus/music")
art.BoW.list <- lapply(art.stories, table)
music.BoW.list <- lapply(music.stories, table)
nyt.BoW.frame <- make.BoW.frame(c(art.BoW.list, music.BoW.list), row.names = c(paste("art", 
    1:length(art.BoW.list), sep = "."), paste("music", 1:length(music.BoW.list), 
    sep = ".")))
nyt.nice.frame <- div.by.euc.length(idf.weight(nyt.BoW.frame))
nyt.pca <- prcomp(nyt.nice.frame)

LSI for the New York Times stories

First component:

nyt.latent.sem <- nyt.pca$rotation
signif(sort(nyt.latent.sem[, 1], decreasing = TRUE)[1:30], 2)
##       music        trio     theater   orchestra   composers       opera 
##       0.110       0.084       0.083       0.067       0.059       0.058 
##    theaters           m    festival        east     program           y 
##       0.055       0.054       0.051       0.049       0.048       0.048 
##      jersey     players   committee      sunday        june     concert 
##       0.047       0.047       0.046       0.045       0.045       0.045 
##    symphony       organ     matinee   misstated instruments           p 
##       0.044       0.044       0.043       0.042       0.041       0.041 
##         X.d       april      samuel        jazz     pianist     society 
##       0.041       0.040       0.040       0.039       0.038       0.038
signif(sort(nyt.latent.sem[, 1], decreasing = FALSE)[1:30], 2)
##       she       her        ms         i      said    mother    cooper 
##    -0.260    -0.240    -0.200    -0.150    -0.130    -0.110    -0.100 
##        my  painting   process paintings        im        he       mrs 
##    -0.094    -0.088    -0.071    -0.070    -0.068    -0.065    -0.065 
##        me  gagosian       was   picasso     image sculpture      baby 
##    -0.063    -0.062    -0.058    -0.057    -0.056    -0.056    -0.055 
##   artists      work    photos       you    nature    studio       out 
##    -0.055    -0.054    -0.051    -0.051    -0.050    -0.050    -0.050 
##      says      like 
##    -0.050    -0.049

LSI for the New York Times stories

Second component:

signif(sort(nyt.latent.sem[, 2], decreasing = TRUE)[1:30], 2)
##         art      museum      images     artists   donations     museums 
##       0.150       0.120       0.095       0.092       0.075       0.073 
##    painting         tax   paintings   sculpture     gallery  sculptures 
##       0.073       0.070       0.065       0.060       0.055       0.051 
##     painted       white    patterns      artist      nature     service 
##       0.050       0.050       0.047       0.047       0.046       0.046 
##  decorative        feet     digital      statue       color    computer 
##       0.043       0.043       0.043       0.042       0.042       0.041 
##       paris         war collections     diamond       stone     dealers 
##       0.041       0.041       0.041       0.041       0.041       0.040
signif(sort(nyt.latent.sem[, 2], decreasing = FALSE)[1:30], 2)
##          her          she      theater        opera           ms 
##       -0.220       -0.220       -0.160       -0.130       -0.130 
##            i         hour   production         sang     festival 
##       -0.083       -0.081       -0.075       -0.075       -0.074 
##        music      musical        songs        vocal    orchestra 
##       -0.070       -0.070       -0.068       -0.067       -0.067 
##           la      singing      matinee  performance         band 
##       -0.065       -0.065       -0.061       -0.061       -0.060 
##       awards    composers         says           my           im 
##       -0.058       -0.058       -0.058       -0.056       -0.056 
##         play     broadway       singer       cooper performances 
##       -0.056       -0.055       -0.052       -0.051       -0.051

Visualization

story.classes <- c(rep("art", times = length(art.stories)), rep("music", times = length(music.stories)))
plot(nyt.pca$x[, 1:2], pch = ifelse(story.classes == "music", "m", "a"), col = ifelse(story.classes == 
    "music", "blue", "red"))

Multidimensional scaling

Netflix

Low-rank approximation

Low-rank approximation

Back to Netflix (Feuerverger, He, and Khatri 2012)

Summing up

Backup: Gory details of PCA in matrix form

Use \(k\) directions in a \(p\times k\) matrix \(\w\)

Require: \(\mathbf{w}^T\mathbf{w} = \mathbf{I}\), the basis vectors are orthonormal

\(\X \w =\) matrix of projection lengths \([n\times k]\)

\(\X \w \w^T =\) matrix of projected vectors \([n\times p]\)

\(\X - \X \w \w^T =\) matrix of vector residuals \([n\times p]\)

\((\X-\X\w\w^T)(\X-\X\w\w^T)^T =\) matrix of inner products of vector residuals \([n\times n]\)

\(\tr{((\X-\X\w\w^T)(\X-\X\w\w^T)^T)} =\) sum of squared errors \([1\times 1]\)

Backup: The gory details

\[\begin{eqnarray} MSE(\w) & = & \frac{1}{n} \tr{((\X-\X\w\w^T)(\X^T - \w\w^T \X^T))}\\ & = & \frac{1}{n} \tr{(\X \X^T - \X\w\w^T\X^T - \X\w\w^T\X^T + \X\w\w^T\w\w^T\X^T)}\\ & = & \frac{1}{n}\left(\tr{(\X\X^T)} - 2\tr{(\X\w\w^T\X^T)} + \tr{(\X\w\w^T\X^T)}\right)\\ & = & \frac{1}{n}\tr{(\X\X^T)} - \frac{1}{n}\tr{(\X\w\w^T\X^T)} \end{eqnarray}\]

so maximize \(\frac{1}{n}\tr{(\X\w\w^T\X^T)}\)

Backup: The gory details

“trace is cyclic” so \[ \tr{(\X\w\w^T\X^T)} = \tr{(\X^T\X\w\w^T)} = \tr{(\w^T\X^T\X\w)} \] so we want to maximize \[ \tr{\left(\w^T \frac{\X^T \X}{n}\w\right)} \] under the constraint \[ \w^T \w = \mathbf{I} \]

This is the same form we saw before, so it has the same sort of solution: each column of \(\w\) must be an eigenvector of \(\V\).

Backup: The Lagrange multiplier trick

Backup: The Lagrange multiplier trick

Backup: The Lagrange multiplier trick

\[ \max_{w, \lambda}{\mathcal{L}(w,\lambda)} \]

Backup: The Lagrange multiplier trick

Backup: The Lagrange multiplier trick

References

Deerwester, Scott, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman. 1990. “Indexing by Latent Semantic Analysis.” Journal of the American Society for Information Science 41:391–407. https://doi.org/10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9.

Feuerverger, Andrey, Yu He, and Shashi Khatri. 2012. “Statistical Significance of the Netflix Challenge.” Statistical Science 27:202–31. https://doi.org/10.1214/11-STS368.

Landauer, Thomas K., and Susan T. Dumais. 1997. “A Solution to Plato’s Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge.” Psychological Review 104:211–40. http://lsa.colorado.edu/papers/plato/plato.annote.html.

Turchin, Peter, Thomas E. Currie, Harvey Whitehouse, Pieter François, Kevin Feeney, Daniel Mullins, Daniel Hoyer, et al. 2018. “Quantitative Historical Analysis Uncovers a Single Dimension of Complexity That Structures Global Variation in Human Social Organization.” Proceedings of the National Academy of Sciences (USA) 115:E144–E151. https://doi.org/10.1073/pnas.1708800115.