36-467/667
Lecture 5 (15 September 2020)
\[ \newcommand{\X}{\mathbf{x}} \newcommand{\w}{\mathbf{w}} \newcommand{\V}{\mathbf{v}} \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\SampleVar}[1]{\widehat{\mathrm{Var}}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \newcommand{\TrueRegFunc}{\mu} \newcommand{\EstRegFunc}{\widehat{\TrueRegFunc}} \DeclareMathOperator{\tr}{tr} \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator{\dof}{DoF} \DeclareMathOperator{\det}{det} \newcommand{\TrueNoise}{\epsilon} \newcommand{\EstNoise}{\widehat{\TrueNoise}} \]
Which components?
Can we get the data to tell us what the “right” components are?
Write \(\vec{x}_i\) for row \(i\) (\(1\times p\) matrix)
We also don’t want to distort the data too much
What’s the best \(\vec{w}\)?
Do it for one vector first:
\[\begin{eqnarray} {\|\vec{x_i} - (\vec{w}\cdot\vec{x_i})\vec{w}\|}^2 & =& \left(\vec{x_i} - (\vec{w}\cdot\vec{x_i})\vec{w}\right)\cdot\left(\vec{x_i} - (\vec{w}\cdot\vec{x_i})\vec{w}\right)\\ & = & \vec{x_i}\cdot\vec{x_i} -\vec{x_i}\cdot (\vec{w}\cdot\vec{x_i})\vec{w}\\ \nonumber & & - (\vec{w}\cdot\vec{x_i})\vec{w}\cdot\vec{x_i} + (\vec{w}\cdot\vec{x_i})\vec{w}\cdot(\vec{w}\cdot\vec{x_i})\vec{w}\\ & = & {\|\vec{x_i}\|}^2 -2(\vec{w}\cdot\vec{x_i})^2 + (\vec{w}\cdot\vec{x_i})^2\vec{w}\cdot\vec{w}\\ & = & \|\vec{x_i}\|^2 - (\vec{w}\cdot\vec{x_i})^2 \end{eqnarray}\](This is the Pythagorean theorem!)
Add up across all the data vectors:
\[\begin{eqnarray} MSE(\vec{w}) & = & \frac{1}{n}\sum_{i=1}^{n}{ {\|\vec{x_i} - (\vec{w}\cdot\vec{x_i})\vec{w}\|}^2}\\ & = & \frac{1}{n}\sum_{i=1}^{n}{\|\vec{x_i}\|^2 -{(\vec{w}\cdot\vec{x_i})}^2}\\ & = & \frac{1}{n}\left(\sum_{i=1}^{n}{\|\vec{x_i}\|^2} -\sum_{i=1}^{n}{(\vec{w}\cdot\vec{x_i})^2}\right) \end{eqnarray}\](Because: \(\Expect{Z^2} = (\Expect{Z})^2 + \Var{Z}\))
But \[ \frac{1}{n}\sum_{i=1}^{n}{\vec{x_i} \cdot \vec{w}} = 0 \]
(Why?) (Ans.: centering)
Upshot: \[ L(\vec{w}) = \SampleVar{\vec{w}\cdot\vec{x_i}} \]
The direction which gives us the best approximation of the data is the direction with the greatest variance
Matrix form: all the lengths of projections is \(\mathbf{x}\mathbf{w}\) \([n\times 1]\)
\[\begin{eqnarray} \SampleVar{\vec{w}\cdot\vec{x_i}} & = & \frac{1}{n}\sum_{i}{{\left(\vec{x_i} \cdot \vec{w}\right)}^2}\\ & = & \frac{1}{n}{\left(\X \w\right)}^{T} \left(\X \w\right)\\ & = & \frac{1}{n} \w^T \X^T \X \w\\ & = & \w^T \frac{\X^T \X}{n} \w\\ \end{eqnarray}\]THIS IS AN EIGENVALUE/EIGENVECTOR EQUATION!
At the solution, \[ \SampleVar{\vec{w}\cdot\vec{x_i}} = \w^T \V \w = \w^T \lambda \w = \lambda \] so the maximum is the leading eigenvector of \(\V\)
\(\V\) is a special matrix: symmetric and non-negative definite
head(demo.x)
## [,1] [,2]
## [1,] 0.17907160 -0.05045956
## [2,] -0.03782083 0.11144036
## [3,] 0.47718080 -0.72431165
## [4,] 0.30424556 -0.20026501
## [5,] 0.21790606 -0.09084620
## [6,] -0.14559262 0.16293325
var(demo.x)
## [,1] [,2]
## [1,] 0.07487205 -0.07348313
## [2,] -0.07348313 0.08571690
eigen(var(demo.x))
## eigen() decomposition
## $values
## [1] 0.153977402 0.006611554
##
## $vectors
## [,1] [,2]
## [1,] -0.6805912 -0.7326634
## [2,] 0.7326634 -0.6805912
prcomp
is the best PCA commandDataset pre-loaded in R:
head(state.x77)
## Population Income Illiteracy Life Exp Murder HS Grad Frost Area
## Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708
## Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432
## Arizona 2212 4530 1.8 70.55 7.8 58.1 15 113417
## Arkansas 2110 3378 1.9 70.66 10.1 39.9 65 51945
## California 21198 5114 1.1 71.71 10.3 62.6 20 156361
## Colorado 2541 4884 0.7 72.06 6.8 63.9 166 103766
state.pca <- prcomp(state.x77, scale. = TRUE)
str(state.pca)
## List of 5
## $ sdev : num [1:8] 1.897 1.277 1.054 0.841 0.62 ...
## $ rotation: num [1:8, 1:8] 0.126 -0.299 0.468 -0.412 0.444 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:8] "Population" "Income" "Illiteracy" "Life Exp" ...
## .. ..$ : chr [1:8] "PC1" "PC2" "PC3" "PC4" ...
## $ center : Named num [1:8] 4246.42 4435.8 1.17 70.88 7.38 ...
## ..- attr(*, "names")= chr [1:8] "Population" "Income" "Illiteracy" "Life Exp" ...
## $ scale : Named num [1:8] 4464.49 614.47 0.61 1.34 3.69 ...
## ..- attr(*, "names")= chr [1:8] "Population" "Income" "Illiteracy" "Life Exp" ...
## $ x : num [1:50, 1:8] 3.79 -1.053 0.867 2.382 0.241 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:50] "Alabama" "Alaska" "Arizona" "Arkansas" ...
## .. ..$ : chr [1:8] "PC1" "PC2" "PC3" "PC4" ...
## - attr(*, "class")= chr "prcomp"
The weight/loading matrix \(\w\) gets called $rotation
(why?):
signif(state.pca$rotation[, 1:2], 2)
## PC1 PC2
## Population 0.130 0.410
## Income -0.300 0.520
## Illiteracy 0.470 0.053
## Life Exp -0.410 -0.082
## Murder 0.440 0.310
## HS Grad -0.420 0.300
## Frost -0.360 -0.150
## Area -0.033 0.590
Each column is an eigenvector of \(\V\)
signif(state.pca$sdev, 2)
## [1] 1.90 1.30 1.10 0.84 0.62 0.55 0.38 0.34
Standard deviations along each principal component \(=\sqrt{\lambda_i}\)
If we keep \(k\) components, \[ R^2 = \frac{\sum_{i=1}^{k}{\lambda_i}}{\sum_{j=1}^{p}{\lambda_j}} \]
(Denominator \(=\tr{\V}\) [why?])
signif(state.pca$x[1:10, 1:2], 2)
## PC1 PC2
## Alabama 3.80 -0.23
## Alaska -1.10 5.50
## Arizona 0.87 0.75
## Arkansas 2.40 -1.30
## California 0.24 3.50
## Colorado -2.10 0.51
## Connecticut -1.90 -0.24
## Delaware -0.42 -0.51
## Florida 1.20 1.10
## Georgia 3.30 0.11
Columns are \(\vec{x}_i \cdot \vec{w}_1\) and \(\vec{x}_i \cdot \vec{w}_2\)
Use \(k\) directions in a \(p\times k\) matrix \(\w\)
Require: \(\mathbf{w}^T\mathbf{w} = \mathbf{I}\), the basis vectors are orthonormal
\(\X \w =\) matrix of projection lengths \([n\times k]\)
\(\X \w \w^T =\) matrix of projected vectors \([n\times p]\)
\(\X - \X \w \w^T =\) matrix of vector residuals \([n\times p]\)
\((\X-\X\w\w^T)(\X-\X\w\w^T)^T =\) matrix of inner products of vector residuals \([n\times n]\)
\(\tr{((\X-\X\w\w^T)(\X-\X\w\w^T)^T)} =\) sum of squared errors \([1\times 1]\)
so maximize \(\frac{1}{n}\tr{(\X\w\w^T\X^T)}\)
“trace is cyclic” so \[ \tr{(\X\w\w^T\X^T)} = \tr{(\X^T\X\w\w^T)} = \tr{(\w^T\X^T\X\w)} \] so we want to maximize \[ \tr{\left(\w^T \frac{\X^T \X}{n}\w\right)} \] under the constraint \[ \w^T \w = \mathbf{I} \]
This is the same form we saw before, so it has the same sort of solution: each column of \(\w\) must be an eigenvector of \(\V\).
\[ \max_{w, \lambda}{\mathcal{L}(w,\lambda)} \]