Once More with PCA

36-462/662, Spring 2020

27 February 2020

\[ \newcommand{\X}{\mathbf{x}} \newcommand{\w}{\mathbf{w}} \newcommand{\V}{\mathbf{v}} \newcommand{\S}{\mathbf{s}} \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\SampleVar}[1]{\widehat{\mathrm{Var}}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \DeclareMathOperator{\tr}{tr} \DeclareMathOperator*{\argmin}{argmin} \]

Reminders

USA, \(\approx 1977\)

Dataset pre-loaded in R:

head(state.x77)
##            Population Income Illiteracy Life Exp Murder HS Grad Frost   Area
## Alabama          3615   3624        2.1    69.05   15.1    41.3    20  50708
## Alaska            365   6315        1.5    69.31   11.3    66.7   152 566432
## Arizona          2212   4530        1.8    70.55    7.8    58.1    15 113417
## Arkansas         2110   3378        1.9    70.66   10.1    39.9    65  51945
## California      21198   5114        1.1    71.71   10.3    62.6    20 156361
## Colorado         2541   4884        0.7    72.06    6.8    63.9   166 103766

Principal components of the USA, \(\approx 1977\)

state.pca <- prcomp(state.x77,scale.=TRUE)
str(state.pca)
## List of 5
##  $ sdev    : num [1:8] 1.897 1.277 1.054 0.841 0.62 ...
##  $ rotation: num [1:8, 1:8] 0.126 -0.299 0.468 -0.412 0.444 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:8] "Population" "Income" "Illiteracy" "Life Exp" ...
##   .. ..$ : chr [1:8] "PC1" "PC2" "PC3" "PC4" ...
##  $ center  : Named num [1:8] 4246.42 4435.8 1.17 70.88 7.38 ...
##   ..- attr(*, "names")= chr [1:8] "Population" "Income" "Illiteracy" "Life Exp" ...
##  $ scale   : Named num [1:8] 4464.49 614.47 0.61 1.34 3.69 ...
##   ..- attr(*, "names")= chr [1:8] "Population" "Income" "Illiteracy" "Life Exp" ...
##  $ x       : num [1:50, 1:8] 3.79 -1.053 0.867 2.382 0.241 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:50] "Alabama" "Alaska" "Arizona" "Arkansas" ...
##   .. ..$ : chr [1:8] "PC1" "PC2" "PC3" "PC4" ...
##  - attr(*, "class")= chr "prcomp"

The principal component vectors

The weight/loading matrix \(\w\) gets called $rotation (why?):

signif(state.pca$rotation[,1:2], 2)
##               PC1    PC2
## Population  0.130  0.410
## Income     -0.300  0.520
## Illiteracy  0.470  0.053
## Life Exp   -0.410 -0.082
## Murder      0.440  0.310
## HS Grad    -0.420  0.300
## Frost      -0.360 -0.150
## Area       -0.033  0.590

Each column is an eigenvector of \(\V\)

-Break for in-class exercise

  1. What kind of state would get a large positive score on the 1st PC, and what kind of state would get a large negative score?
  2. What kind of state would get a large positive score on the 2nd PC, and what kind of state would get a large negative score?

The eigenvalues / variances along each PC

signif(state.pca$sdev, 2)
## [1] 1.90 1.30 1.10 0.84 0.62 0.55 0.38 0.34

Standard deviations along each principal component \(=\sqrt{\lambda_i}\)

If we keep \(k\) components, \[ R^2 = \frac{\sum_{i=1}^{k}{\lambda_i}}{\sum_{j=1}^{p}{\lambda_j}} \]

(Denominator \(=\tr{\V}\) — why?)

Scree plot (1)

plot(state.pca$sdev^2, xlab="PC", ylab="Variance", type="b")

Scree plot (2)

plot(cumsum(state.pca$sdev^2), xlab="Number of PCs", ylab="Cumulative variance",
     type="b", ylim=c(0, sum(state.pca$sdev^2)))

Scores on the principal components

signif(state.pca$x[1:10, 1:2], 2)
##               PC1   PC2
## Alabama      3.80 -0.23
## Alaska      -1.10  5.50
## Arizona      0.87  0.75
## Arkansas     2.40 -1.30
## California   0.24  3.50
## Colorado    -2.10  0.51
## Connecticut -1.90 -0.24
## Delaware    -0.42 -0.51
## Florida      1.20  1.10
## Georgia      3.30  0.11

Columns here are \(\vec{x}_i \cdot \vec{w}_1\) and \(\vec{x}_i \cdot \vec{w}_2\)

So, for instance, \[ s_{\text{Alabama}, 1} = 0.13 x_{\text{Alabama}, \text{Population}} + -0.3 x_{\text{Alabama}, \text{Income}} + \ldots -0.033 x_{\text{Alabama}, \text{Area}} \]

(after centering and scaling the features)

PC1 is kinda southern (1)

signif(head(sort(state.pca$x[,1])),2)
##    Minnesota North Dakota         Iowa         Utah     Nebraska     Colorado 
##         -2.4         -2.4         -2.3         -2.3         -2.2         -2.1
signif(tail(sort(state.pca$x[,1])),2)
## North Carolina        Georgia South Carolina        Alabama    Mississippi 
##            2.7            3.3            3.7            3.8            4.0 
##      Louisiana 
##            4.2

PC1 is kinda southern (2)

PC1 is kinda southern (3)

PCA + regression