Principal Components Analysis I

36-467/667

6 September 2018

\[ \newcommand{\X}{\mathbf{x}} \newcommand{\w}{\mathbf{w}} \newcommand{\V}{\mathbf{v}} \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\SampleVar}[1]{\widehat{\mathrm{Var}}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \newcommand{\TrueRegFunc}{\mu} \newcommand{\EstRegFunc}{\widehat{\TrueRegFunc}} \DeclareMathOperator{\tr}{tr} \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator{\dof}{DoF} \DeclareMathOperator{\det}{det} \newcommand{\TrueNoise}{\epsilon} \newcommand{\EstNoise}{\widehat{\TrueNoise}} \]

In our last episode…

What the data looks like

Finding a “principal” component

Projections

How well does the projection approximate the original?

Do it for one vector first:

\[\begin{eqnarray} {\|\vec{x_i} - (\vec{w}\cdot\vec{x_i})\vec{w}\|}^2 & =& \left(\vec{x_i} - (\vec{w}\cdot\vec{x_i})\vec{w}\right)\cdot\left(\vec{x_i} - (\vec{w}\cdot\vec{x_i})\vec{w}\right)\\ & = & \vec{x_i}\cdot\vec{x_i} -\vec{x_i}\cdot (\vec{w}\cdot\vec{x_i})\vec{w}\\ \nonumber & & - (\vec{w}\cdot\vec{x_i})\vec{w}\cdot\vec{x_i} + (\vec{w}\cdot\vec{x_i})\vec{w}\cdot(\vec{w}\cdot\vec{x_i})\vec{w}\\ & = & {\|\vec{x_i}\|}^2 -2(\vec{w}\cdot\vec{x_i})^2 + (\vec{w}\cdot\vec{x_i})^2\vec{w}\cdot\vec{w}\\ & = & \vec{x_i}\cdot\vec{x_i} - (\vec{w}\cdot\vec{x_i})^2 \end{eqnarray}\]

How well does the projection approximate the original?

Add up across all the data vectors:

\[\begin{eqnarray} MSE(\vec{w}) & = & \frac{1}{n}\sum_{i=1}^{n}{\|\vec{x_i}\|^2 -{(\vec{w}\cdot\vec{x_i})}^2}\\ & = & \frac{1}{n}\left(\sum_{i=1}^{n}{\|\vec{x_i}\|^2} -\sum_{i=1}^{n}{(\vec{w}\cdot\vec{x_i})^2}\right) \end{eqnarray}\]

Minimizing MSE is maximizing variance

\[\begin{eqnarray} L(w) & = & \frac{1}{n}\sum_{i=1}^{n}{{(\vec{w}\cdot\vec{x_i})}^2}\\ & = & {\left(\frac{1}{n}\sum_{i=1}^{n}{\vec{x_i}\cdot\vec{w}}\right)}^2 + \SampleVar{\vec{w}\cdot\vec{x_i}} \end{eqnarray}\]

(\(\Expect{Z^2} = (\Expect{Z})^2 + \Var{Z}\))

But \[ \frac{1}{n}\sum_{i=1}^{n}{\vec{x_i} \cdot \vec{w}} = 0 \] Why?

Minimizing MSE is maximizing variance

so \[ L(\vec{w}) = \SampleVar{\vec{w}\cdot\vec{x_i}} \]

Minimizing MSE is maximizing variance

The direction which gives us the best approximation of the data is the direction with the greatest variance

OK, how do we find this magic direction?

Matrix form: all the lengths of projections is \(\mathbf{x}\mathbf{w}\) \([n\times 1]\)

\[\begin{eqnarray} \SampleVar{\vec{w}\cdot\vec{x_i}} & = & \frac{1}{n}\sum_{i}{{\left(\vec{x_i} \cdot \vec{w}\right)}^2}\\ & = & \frac{1}{n}{\left(\X \w\right)}^{T} \left(\X \w\right)\\ & = & \frac{1}{n} \w^T \X^T \X \w\\ & = & \w^T \frac{\X^T \X}{n} \w\\ \end{eqnarray}\]

Fact: \(\V \equiv \frac{\X^T \X}{n} =\) sample covariance matrix of the vectors

OK, how do we find this magic direction?

We need to maximize \[\begin{equation} \SampleVar{\vec{w}\cdot\vec{x_i}} = \w^T \V \w \end{equation}\] Constraint: \(\vec{w}\) has length 1 \(\Leftrightarrow\) \(\w^T \w = 1\)

Add a Lagrange multiplier \(\lambda\)

\[\begin{eqnarray} \mathcal{L}(\w,\lambda) & \equiv & \w^T\V\w - \lambda(\w^T \w -1)\\ \frac{\partial \mathcal{L}}{\partial \lambda} & = & \w^T \w -1\\ \frac{\partial \mathcal{L}}{\partial \w} & = & 2\V\w - 2\lambda\w \end{eqnarray}\]

Set to zero:

\[\begin{eqnarray} \w^T \w & = & 1\\ \V \w & = & \lambda \w \end{eqnarray}\]

The magic direction is an eigenvector

\[\begin{eqnarray} \w^T \w & = & 1\\ \V \w & = & \lambda \w \end{eqnarray}\]

THIS IS AN EIGENVALUE/EIGENVECTOR EQUATION!

At the solution, \[ \SampleVar{\vec{w}\cdot\vec{x_i}} = \w^T \V \w = \w^T \lambda \w = \lambda \] so the maximum is the leading eigenvector of \(\V\)

About the sample covariance matrix

\(\V\) is a special matrix: symmetric and non-negative definite:

\[\begin{eqnarray} \text{Lead eigenvector of} \V & = & 1^{\mathrm{st}} \text{principal component}\\ & = & \text{Direction of maximum variance}\\ & = & \text{Best 1D approximation to the data} \end{eqnarray}\]

Multiple principle components

In R

USA, \(\approx 1977\)

Dataset pre-loaded in R:

head(state.x77)
##            Population Income Illiteracy Life Exp Murder HS Grad Frost
## Alabama          3615   3624        2.1    69.05   15.1    41.3    20
## Alaska            365   6315        1.5    69.31   11.3    66.7   152
## Arizona          2212   4530        1.8    70.55    7.8    58.1    15
## Arkansas         2110   3378        1.9    70.66   10.1    39.9    65
## California      21198   5114        1.1    71.71   10.3    62.6    20
## Colorado         2541   4884        0.7    72.06    6.8    63.9   166
##              Area
## Alabama     50708
## Alaska     566432
## Arizona    113417
## Arkansas    51945
## California 156361
## Colorado   103766

Principal components of the USA, \(\approx 1977\)

state.pca <- prcomp(state.x77,scale.=TRUE)

Principal components of the USA, \(\approx 1977\)

The weight/loading matrix \(\w\) gets called $rotation (why?):

signif(state.pca$rotation[,1:2], 2)
##               PC1    PC2
## Population  0.130  0.410
## Income     -0.300  0.520
## Illiteracy  0.470  0.053
## Life Exp   -0.410 -0.082
## Murder      0.440  0.310
## HS Grad    -0.420  0.300
## Frost      -0.360 -0.150
## Area       -0.033  0.590

Each column is an eigenvector of \(\V\)

Break for in-class exercise

Principal components of the USA, \(\approx 1977\)

signif(state.pca$sdev, 2)
## [1] 1.90 1.30 1.10 0.84 0.62 0.55 0.38 0.34

Standard deviations along each principal component \(=\sqrt{\lambda_i}\)

If we keep \(k\) components, \[ R^2 = \frac{\sum_{i=1}^{k}{\lambda_i}}{\sum_{j=1}^{p}{\lambda_j}} \]

(Denominator \(=\tr{\V}\) — why?)

Principal components of the USA, \(\approx 1977\)

signif(state.pca$x[1:10, 1:2], 2)
##               PC1   PC2
## Alabama      3.80 -0.23
## Alaska      -1.10  5.50
## Arizona      0.87  0.75
## Arkansas     2.40 -1.30
## California   0.24  3.50
## Colorado    -2.10  0.51
## Connecticut -1.90 -0.24
## Delaware    -0.42 -0.51
## Florida      1.20  1.10
## Georgia      3.30  0.11

Columns are \(\vec{x}_i \cdot \vec{w}_1\) and \(\vec{x}_i \cdot \vec{w}_2\)

PC1 is kinda southern

PC1 is kinda the legacy of slavery

Summing up

The gory details

Use \(k\) directions in a \(p\times k\) matrix \(\w\)

Require: \(\mathbf{w}^T\mathbf{w} = \mathbf{I}\), the basis vectors are orthonormal

\(\X \w =\) matrix of projection lengths \([n\times k]\)

\(\X \w \w^T =\) matrix of projected vectors \([n\times p]\)

\(\X - \X \w \w^T =\) matrix of vector residuals \([n\times p]\)

\((\X-\X\w\w^T)(\X-\X\w\w^T)^T =\) matrix of inner products of vector residuals \([n\times n]\)

\(\tr{((\X-\X\w\w^T)(\X-\X\w\w^T)^T)} =\) sum of squared errors \([1\times 1]\)

The gory details

\[\begin{eqnarray} MSE(\w) & = & \frac{1}{n} \tr{((\X-\X\w\w^T)(\X^T - \w\w^T \X^T))}\\ & = & \frac{1}{n} \tr{(\X \X^T - \X\w\w^T\X^T - \X\w\w^T\X^T + \X\w\w^T\w\w^T\X^T)}\\ & = & \frac{1}{n}\left(\tr{(\X\X^T)} - 2\tr{(\X\w\w^T\X^T)} + \tr{(\X\w\w^T\X^T)}\right)\\ & = & \frac{1}{n}\tr{(\X\X^T)} - \frac{1}{n}\tr{(\X\w\w^T\X^T)} \end{eqnarray}\]

so maximize \(\frac{1}{n}\tr{(\X\w\w^T\X^T)}\)

The gory details

“trace is cyclic” so \[ \tr{(\X\w\w^T\X^T)} = \tr{(\X^T\X\w\w^T)} = \tr{(\w^T\X^T\X\w)} \] so we want to maximize \[ \tr{\left(\w^T \frac{\X^T \X}{n}\w\right)} \] under the constraint \[ \w^T \w = \mathbf{I} \]

This is the same form we saw before, so it has the same sort of solution: each column of \(\w\) must be an eigenvector of \(\V\).

The Lagrange multiplier trick

The Lagrange multiplier trick

The Lagrange multiplier trick

\[ \max_{w, \lambda}{\mathcal{L}(w,\lambda)} \]

The Lagrange multiplier trick

The Lagrange multiplier trick