\[ \newcommand{\X}{\mathbf{x}} \newcommand{\w}{\mathbf{w}} \newcommand{\V}{\mathbf{v}} \newcommand{\S}{\mathbf{s}} \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\SampleVar}[1]{\widehat{\mathrm{Var}}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \DeclareMathOperator{\tr}{tr} \DeclareMathOperator*{\argmin}{argmin} \]

Reminders

Data is $n$ vectors, each of dimension $p$, stacked into the $[n\times p]$ matrix $\X$
- So $x_{ij} =$ value of feature $j$ in data-point $i$
- Assume we’ve centered the data
The (sample) variance-covariance matrix of the features is $\V = n^{-1}\X^T\X$ (dimension $[p\times p]$)
- So $v_{jk} = n^{-1}\sum_{i=1}^{n}{x_{ij} x_{ik}} =$ sample covariance between feature $j$ and feature $k$
$\V$ has eigenvalues $\lambda_1, \lambda_2, \ldots \lambda_p \geq 0$ and eigenvectors $\vec{w}_1, \vec{w}_2, \ldots \vec{w}_p$
- All eigenvectors are orthonormal,$\vec{w}_i \cdot \vec{w}_j = \delta_{ij}$ (where “Kronecker” $\delta_{ij} = 1$ if $i=j$, $=0$ if $i\neq j$)
We can stack the eigenvectors into a $[p \times p]$ matrix $\w$
- Each column is a different eigenvector, and $\w^T\w = \w\w^T = \mathbf{I}$
We can stack the eigenvalues into a diagonal $[p \times p]$ matrix $\mathbf{\Lambda}$
Then $\V = \w \mathbf{\Lambda} \w^T$
the scores are $\S = \X \w =$ projections of each data vector on to each eigenvector
- Variance matrix of the scores $= n^{-1}\S^T \S = \mathbf{\Lambda}$
- Scores on different PCs are uncorrelated

USA, $\approx 1977$

Dataset pre-loaded in R:

head(state.x77)

##            Population Income Illiteracy Life Exp Murder HS Grad Frost   Area
## Alabama          3615   3624        2.1    69.05   15.1    41.3    20  50708
## Alaska            365   6315        1.5    69.31   11.3    66.7   152 566432
## Arizona          2212   4530        1.8    70.55    7.8    58.1    15 113417
## Arkansas         2110   3378        1.9    70.66   10.1    39.9    65  51945
## California      21198   5114        1.1    71.71   10.3    62.6    20 156361
## Colorado         2541   4884        0.7    72.06    6.8    63.9   166 103766

Principal components of the USA, $\approx 1977$

state.pca <- prcomp(state.x77,scale.=TRUE)
str(state.pca)

## List of 5
##  $ sdev    : num [1:8] 1.897 1.277 1.054 0.841 0.62 ...
##  $ rotation: num [1:8, 1:8] 0.126 -0.299 0.468 -0.412 0.444 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:8] "Population" "Income" "Illiteracy" "Life Exp" ...
##   .. ..$ : chr [1:8] "PC1" "PC2" "PC3" "PC4" ...
##  $ center  : Named num [1:8] 4246.42 4435.8 1.17 70.88 7.38 ...
##   ..- attr(*, "names")= chr [1:8] "Population" "Income" "Illiteracy" "Life Exp" ...
##  $ scale   : Named num [1:8] 4464.49 614.47 0.61 1.34 3.69 ...
##   ..- attr(*, "names")= chr [1:8] "Population" "Income" "Illiteracy" "Life Exp" ...
##  $ x       : num [1:50, 1:8] 3.79 -1.053 0.867 2.382 0.241 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:50] "Alabama" "Alaska" "Arizona" "Arkansas" ...
##   .. ..$ : chr [1:8] "PC1" "PC2" "PC3" "PC4" ...
##  - attr(*, "class")= chr "prcomp"

The principal component vectors

The weight/loading matrix $\w$ gets called $rotation (why?):

signif(state.pca$rotation[,1:2], 2)

##               PC1    PC2
## Population  0.130  0.410
## Income     -0.300  0.520
## Illiteracy  0.470  0.053
## Life Exp   -0.410 -0.082
## Murder      0.440  0.310
## HS Grad    -0.420  0.300
## Frost      -0.360 -0.150
## Area       -0.033  0.590

Each column is an eigenvector of $\V$

-Break for in-class exercise

What kind of state would get a large positive score on the 1st PC, and what kind of state would get a large negative score?
What kind of state would get a large positive score on the 2nd PC, and what kind of state would get a large negative score?

The eigenvalues / variances along each PC

signif(state.pca$sdev, 2)

## [1] 1.90 1.30 1.10 0.84 0.62 0.55 0.38 0.34

Standard deviations along each principal component $=\sqrt{\lambda_i}$

If we keep $k$ components, \[ R^2 = \frac{\sum_{i=1}^{k}{\lambda_i}}{\sum_{j=1}^{p}{\lambda_j}} \]

(Denominator $=\tr{\V}$ — why?)

Scree plot (1)

plot(state.pca$sdev^2, xlab="PC", ylab="Variance", type="b")

Scree plot (2)

plot(cumsum(state.pca$sdev^2), xlab="Number of PCs", ylab="Cumulative variance",
     type="b", ylim=c(0, sum(state.pca$sdev^2)))

Scores on the principal components

signif(state.pca$x[1:10, 1:2], 2)

##               PC1   PC2
## Alabama      3.80 -0.23
## Alaska      -1.10  5.50
## Arizona      0.87  0.75
## Arkansas     2.40 -1.30
## California   0.24  3.50
## Colorado    -2.10  0.51
## Connecticut -1.90 -0.24
## Delaware    -0.42 -0.51
## Florida      1.20  1.10
## Georgia      3.30  0.11

Columns here are $\vec{x}_i \cdot \vec{w}_1$ and $\vec{x}_i \cdot \vec{w}_2$

So, for instance, \[ s_{\text{Alabama}, 1} = 0.13 x_{\text{Alabama}, \text{Population}} + -0.3 x_{\text{Alabama}, \text{Income}} + \ldots -0.033 x_{\text{Alabama}, \text{Area}} \]

(after centering and scaling the features)

PC1 is kinda southern (1)

signif(head(sort(state.pca$x[,1])),2)

##    Minnesota North Dakota         Iowa         Utah     Nebraska     Colorado 
##         -2.4         -2.4         -2.3         -2.3         -2.2         -2.1

signif(tail(sort(state.pca$x[,1])),2)

## North Carolina        Georgia South Carolina        Alabama    Mississippi 
##            2.7            3.3            3.7            3.8            4.0 
##      Louisiana 
##            4.2

PC1 is kinda southern (2)

size of state abbreviation $\propto$ projection on to PC1
coordinates = state capitols, except for AK and HI

PC1 is kinda southern (3)

(Correlation of PC1 with having been in the Confederacy is 0.8)

PCA + regression

Run PCA on the features in $\mathbf{x}$
Take the top $q$ principal components
Then regress $y$ on the scores on PC1, PC2, PC$q$
Can lose information, but:
- New features are uncorrelated
- Can be a good way to deal with multicollinearity
- Can be a good way to deal with high-dimensional, $n < p$ problems

Once More with PCA