36-467/667
6 September 2018
\[ \newcommand{\X}{\mathbf{x}} \newcommand{\w}{\mathbf{w}} \newcommand{\V}{\mathbf{v}} \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\SampleVar}[1]{\widehat{\mathrm{Var}}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \newcommand{\TrueRegFunc}{\mu} \newcommand{\EstRegFunc}{\widehat{\TrueRegFunc}} \DeclareMathOperator{\tr}{tr} \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator{\dof}{DoF} \DeclareMathOperator{\det}{det} \newcommand{\TrueNoise}{\epsilon} \newcommand{\EstNoise}{\widehat{\TrueNoise}} \]
Which components?
Can we get the data to tell us what the “right” components are?
Write \(\vec{x}_i\) for row \(i\) (\(1\times p\) matrix)
We also don’t want to distort the data too much
What’s the best \(\vec{w}\)?
\(\vec{x}_i \cdot \vec{w} =\) length of \(\vec{x}_i\)’s projection on to the direction of \(\vec{w}\)
\((\vec{x}_i \cdot \vec{w})\vec{w} =\) the actual projected vector
Do it for one vector first:
\[\begin{eqnarray} {\|\vec{x_i} - (\vec{w}\cdot\vec{x_i})\vec{w}\|}^2 & =& \left(\vec{x_i} - (\vec{w}\cdot\vec{x_i})\vec{w}\right)\cdot\left(\vec{x_i} - (\vec{w}\cdot\vec{x_i})\vec{w}\right)\\ & = & \vec{x_i}\cdot\vec{x_i} -\vec{x_i}\cdot (\vec{w}\cdot\vec{x_i})\vec{w}\\ \nonumber & & - (\vec{w}\cdot\vec{x_i})\vec{w}\cdot\vec{x_i} + (\vec{w}\cdot\vec{x_i})\vec{w}\cdot(\vec{w}\cdot\vec{x_i})\vec{w}\\ & = & {\|\vec{x_i}\|}^2 -2(\vec{w}\cdot\vec{x_i})^2 + (\vec{w}\cdot\vec{x_i})^2\vec{w}\cdot\vec{w}\\ & = & \vec{x_i}\cdot\vec{x_i} - (\vec{w}\cdot\vec{x_i})^2 \end{eqnarray}\]Add up across all the data vectors:
\[\begin{eqnarray} MSE(\vec{w}) & = & \frac{1}{n}\sum_{i=1}^{n}{\|\vec{x_i}\|^2 -{(\vec{w}\cdot\vec{x_i})}^2}\\ & = & \frac{1}{n}\left(\sum_{i=1}^{n}{\|\vec{x_i}\|^2} -\sum_{i=1}^{n}{(\vec{w}\cdot\vec{x_i})^2}\right) \end{eqnarray}\](\(\Expect{Z^2} = (\Expect{Z})^2 + \Var{Z}\))
But \[ \frac{1}{n}\sum_{i=1}^{n}{\vec{x_i} \cdot \vec{w}} = 0 \] Why?
so \[ L(\vec{w}) = \SampleVar{\vec{w}\cdot\vec{x_i}} \]
The direction which gives us the best approximation of the data is the direction with the greatest variance
Matrix form: all the lengths of projections is \(\mathbf{x}\mathbf{w}\) \([n\times 1]\)
\[\begin{eqnarray} \SampleVar{\vec{w}\cdot\vec{x_i}} & = & \frac{1}{n}\sum_{i}{{\left(\vec{x_i} \cdot \vec{w}\right)}^2}\\ & = & \frac{1}{n}{\left(\X \w\right)}^{T} \left(\X \w\right)\\ & = & \frac{1}{n} \w^T \X^T \X \w\\ & = & \w^T \frac{\X^T \X}{n} \w\\ \end{eqnarray}\]Fact: \(\V \equiv \frac{\X^T \X}{n} =\) sample covariance matrix of the vectors
We need to maximize \[\begin{equation} \SampleVar{\vec{w}\cdot\vec{x_i}} = \w^T \V \w \end{equation}\] Constraint: \(\vec{w}\) has length 1 \(\Leftrightarrow\) \(\w^T \w = 1\)
Add a Lagrange multiplier \(\lambda\)
\[\begin{eqnarray} \mathcal{L}(\w,\lambda) & \equiv & \w^T\V\w - \lambda(\w^T \w -1)\\ \frac{\partial \mathcal{L}}{\partial \lambda} & = & \w^T \w -1\\ \frac{\partial \mathcal{L}}{\partial \w} & = & 2\V\w - 2\lambda\w \end{eqnarray}\]Set to zero:
\[\begin{eqnarray} \w^T \w & = & 1\\ \V \w & = & \lambda \w \end{eqnarray}\]THIS IS AN EIGENVALUE/EIGENVECTOR EQUATION!
At the solution, \[ \SampleVar{\vec{w}\cdot\vec{x_i}} = \w^T \V \w = \w^T \lambda \w = \lambda \] so the maximum is the leading eigenvector of \(\V\)
\(\V\) is a special matrix: symmetric and non-negative definite:
prcomp
is the best PCA commandDataset pre-loaded in R:
## Population Income Illiteracy Life Exp Murder HS Grad Frost
## Alabama 3615 3624 2.1 69.05 15.1 41.3 20
## Alaska 365 6315 1.5 69.31 11.3 66.7 152
## Arizona 2212 4530 1.8 70.55 7.8 58.1 15
## Arkansas 2110 3378 1.9 70.66 10.1 39.9 65
## California 21198 5114 1.1 71.71 10.3 62.6 20
## Colorado 2541 4884 0.7 72.06 6.8 63.9 166
## Area
## Alabama 50708
## Alaska 566432
## Arizona 113417
## Arkansas 51945
## California 156361
## Colorado 103766
The weight/loading matrix \(\w\) gets called $rotation
(why?):
## PC1 PC2
## Population 0.130 0.410
## Income -0.300 0.520
## Illiteracy 0.470 0.053
## Life Exp -0.410 -0.082
## Murder 0.440 0.310
## HS Grad -0.420 0.300
## Frost -0.360 -0.150
## Area -0.033 0.590
Each column is an eigenvector of \(\V\)
## [1] 1.90 1.30 1.10 0.84 0.62 0.55 0.38 0.34
Standard deviations along each principal component \(=\sqrt{\lambda_i}\)
If we keep \(k\) components, \[ R^2 = \frac{\sum_{i=1}^{k}{\lambda_i}}{\sum_{j=1}^{p}{\lambda_j}} \]
(Denominator \(=\tr{\V}\) — why?)
## PC1 PC2
## Alabama 3.80 -0.23
## Alaska -1.10 5.50
## Arizona 0.87 0.75
## Arkansas 2.40 -1.30
## California 0.24 3.50
## Colorado -2.10 0.51
## Connecticut -1.90 -0.24
## Delaware -0.42 -0.51
## Florida 1.20 1.10
## Georgia 3.30 0.11
Columns are \(\vec{x}_i \cdot \vec{w}_1\) and \(\vec{x}_i \cdot \vec{w}_2\)
size of state abbreviation \(\propto\) projection on to PC1
coordinates = state capitols, except for AK and HI
Use \(k\) directions in a \(p\times k\) matrix \(\w\)
Require: \(\mathbf{w}^T\mathbf{w} = \mathbf{I}\), the basis vectors are orthonormal
\(\X \w =\) matrix of projection lengths \([n\times k]\)
\(\X \w \w^T =\) matrix of projected vectors \([n\times p]\)
\(\X - \X \w \w^T =\) matrix of vector residuals \([n\times p]\)
\((\X-\X\w\w^T)(\X-\X\w\w^T)^T =\) matrix of inner products of vector residuals \([n\times n]\)
\(\tr{((\X-\X\w\w^T)(\X-\X\w\w^T)^T)} =\) sum of squared errors \([1\times 1]\)
so maximize \(\frac{1}{n}\tr{(\X\w\w^T\X^T)}\)
“trace is cyclic” so \[ \tr{(\X\w\w^T\X^T)} = \tr{(\X^T\X\w\w^T)} = \tr{(\w^T\X^T\X\w)} \] so we want to maximize \[ \tr{\left(\w^T \frac{\X^T \X}{n}\w\right)} \] under the constraint \[ \w^T \w = \mathbf{I} \]
This is the same form we saw before, so it has the same sort of solution: each column of \(\w\) must be an eigenvector of \(\V\).
\[ \max_{w, \lambda}{\mathcal{L}(w,\lambda)} \]