\[ \newcommand{\X}{\mathbf{x}} \newcommand{\w}{\mathbf{w}} \newcommand{\V}{\mathbf{v}} \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\SampleVar}[1]{\widehat{\mathrm{Var}}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \newcommand{\TrueRegFunc}{\mu} \newcommand{\EstRegFunc}{\widehat{\TrueRegFunc}} \DeclareMathOperator{\tr}{tr} \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator{\dof}{DoF} \DeclareMathOperator{\det}{det} \newcommand{\TrueNoise}{\epsilon} \newcommand{\EstNoise}{\widehat{\TrueNoise}} \]

In our last episodes…

Data is a vector
Break the vector up into additive components
Each component is a vector is a pattern
Which components?
Before: components were eigenvectors of the smoother matrix
- Made sense to work with them because we were smoothing
Can we get the data to tell us what the “right” components are?

What the data looks like

Data is $\mathbf{x} = n\times p$ matrix
- Could be $p$-dimensional multivariate data at $n$ time points
- Or multivariate data at $n$ locations
- Or scalar data at $n$ time points and $p$ locations
Write $\vec{x}_i$ for row $i$ ($1\times p$ matrix)
First, center the data, just to reduce book-keeping
- i.e., subtract the mean of each column
- Optional: scale each column to equal variance

Finding a “principal” component

We don’t want to keep track of $p$ dimensions
We want one dimension
We also don’t want to distort the data too much
Pick a direction in the $p$-dimensional space, and a length-1 vector $\vec{w}$
What’s the best $\vec{w}$?

Projections

$\vec{x}_i \cdot \vec{w} =$ length of $\vec{x}_i$’s projection on to the direction of $\vec{w}$
$(\vec{x}_i \cdot \vec{w})\vec{w} =$ the actual projected vector

Projections

$\vec{x}_i \cdot \vec{w} =$ length of $\vec{x}_i$’s projection on to the direction of $\vec{w}$
$(\vec{x}_i \cdot \vec{w})\vec{w} =$ the actual projected vector
$\vec{x}_i - (\vec{x}_i \cdot \vec{w})\vec{w} =$ vector difference between data and projected vectors
$\|\vec{x}_i - (\vec{x}_i \cdot \vec{w})\vec{w}\|^2=$ squared error from replacing data vector with projected vector

How well does the projection approximate the original?

Do it for one vector first:

\[\begin{eqnarray} {\|\vec{x_i} - (\vec{w}\cdot\vec{x_i})\vec{w}\|}^2 & =& \left(\vec{x_i} - (\vec{w}\cdot\vec{x_i})\vec{w}\right)\cdot\left(\vec{x_i} - (\vec{w}\cdot\vec{x_i})\vec{w}\right)\\ & = & \vec{x_i}\cdot\vec{x_i} -\vec{x_i}\cdot (\vec{w}\cdot\vec{x_i})\vec{w}\\ \nonumber & & - (\vec{w}\cdot\vec{x_i})\vec{w}\cdot\vec{x_i} + (\vec{w}\cdot\vec{x_i})\vec{w}\cdot(\vec{w}\cdot\vec{x_i})\vec{w}\\ & = & {\|\vec{x_i}\|}^2 -2(\vec{w}\cdot\vec{x_i})^2 + (\vec{w}\cdot\vec{x_i})^2\vec{w}\cdot\vec{w}\\ & = & \|\vec{x_i}\|^2 - (\vec{w}\cdot\vec{x_i})^2 \end{eqnarray}\]

(This is the Pythagorean theorem!)

How well does the projection approximate the original?

Add up across all the data vectors:

\[\begin{eqnarray} MSE(\vec{w}) & = & \frac{1}{n}\sum_{i=1}^{n}{ {\|\vec{x_i} - (\vec{w}\cdot\vec{x_i})\vec{w}\|}^2}\\ & = & \frac{1}{n}\sum_{i=1}^{n}{\|\vec{x_i}\|^2 -{(\vec{w}\cdot\vec{x_i})}^2}\\ & = & \frac{1}{n}\left(\sum_{i=1}^{n}{\|\vec{x_i}\|^2} -\sum_{i=1}^{n}{(\vec{w}\cdot\vec{x_i})^2}\right) \end{eqnarray}\]

First bit doesn’t depend on $\vec{w}$, so doesn’t matter for minimizing
So we want to maximize \[ L(\vec{w}) = \frac{1}{n}\sum_{i=1}^{n}{{(\vec{w}\cdot\vec{x_i})}^2} \]

Minimizing MSE is maximizing variance

\[\begin{eqnarray} L(w) & = & \frac{1}{n}\sum_{i=1}^{n}{{(\vec{w}\cdot\vec{x_i})}^2}\\ & = & {\left(\frac{1}{n}\sum_{i=1}^{n}{\vec{x_i}\cdot\vec{w}}\right)}^2 + \SampleVar{\vec{w}\cdot\vec{x_i}} \end{eqnarray}\]

(Because: $\Expect{Z^2} = (\Expect{Z})^2 + \Var{Z}$)

But \[ \frac{1}{n}\sum_{i=1}^{n}{\vec{x_i} \cdot \vec{w}} = 0 \]

(Why?) (Ans.: centering)

Minimizing MSE is maximizing variance

Upshot: \[ L(\vec{w}) = \SampleVar{\vec{w}\cdot\vec{x_i}} \]

Minimizing MSE is maximizing variance

The direction which gives us the best approximation of the data is the direction with the greatest variance

OK, how do we find this magic direction?

Matrix form: all the lengths of projections is $\mathbf{x}\mathbf{w}$ $[n\times 1]$

\[\begin{eqnarray} \SampleVar{\vec{w}\cdot\vec{x_i}} & = & \frac{1}{n}\sum_{i}{{\left(\vec{x_i} \cdot \vec{w}\right)}^2}\\ & = & \frac{1}{n}{\left(\X \w\right)}^{T} \left(\X \w\right)\\ & = & \frac{1}{n} \w^T \X^T \X \w\\ & = & \w^T \frac{\X^T \X}{n} \w\\ \end{eqnarray}\]

Fact: $\V \equiv \frac{\X^T \X}{n} =$ sample covariance matrix of the vectors
- Because we’ve centered; $v_{jk} = n^{-1}\sum_{i=1}^{n}{x_{ij} x_{ik}} =$ sample covariance of coordinate $x^{(j)}$ with coordinate $x^{(k)}$
- Same reason $\X^T \X$ shows up in formulas for linear regression

OK, how do we find this magic direction?

We need to maximize \[\begin{equation} \SampleVar{\vec{w}\cdot\vec{x_i}} = \w^T \V \w \end{equation}\]
Constraint: $\vec{w}$ has length 1 $\Leftrightarrow$ $\w^T \w = 1$
Introduce a Lagrange multiplier $\lambda$
- If your Lagrange is rusty, see back-up slides \[\begin{eqnarray} \mathcal{L}(\w,\lambda) & \equiv & \w^T\V\w - \lambda(\w^T \w -1)\\ \frac{\partial \mathcal{L}}{\partial \lambda} & = & \w^T \w -1\\ \frac{\partial \mathcal{L}}{\partial \w} & = & 2\V\w - 2\lambda\w \end{eqnarray}\]
Set to zero: \[\begin{eqnarray} \w^T \w & = & 1\\ \V \w & = & \lambda \w \end{eqnarray}\]

The magic direction is an eigenvector

\[\begin{eqnarray} \w^T \w & = & 1\\ \V \w & = & \lambda \w \end{eqnarray}\]

THIS IS AN EIGENVALUE/EIGENVECTOR EQUATION!

At the solution, \[ \SampleVar{\vec{w}\cdot\vec{x_i}} = \w^T \V \w = \w^T \lambda \w = \lambda \] so the maximum is the leading eigenvector of $\V$

About the sample covariance matrix

$\V$ is a special matrix: symmetric and non-negative definite

Eigenvalues are all real (b/c symmetric)
If $\lambda_i \neq \lambda_j$, then $v_i \perp v_j$ (b/c symmetric)
Eigenvectors form a basis, and are orthonormal
- Or can be chosen to be orthonormal
All $\lambda_i \geq 0$ (b/c non-negative definite)

\[\begin{eqnarray} \text{Lead eigenvector of}\ \V & = & 1^{\mathrm{st}}\ \text{principal component}\\ & = & \text{Direction of maximum variance}\\ & = & \text{Best 1D approximation to the data} \end{eqnarray}\]

head(demo.x)

##             [,1]        [,2]
## [1,]  0.17907160 -0.05045956
## [2,] -0.03782083  0.11144036
## [3,]  0.47718080 -0.72431165
## [4,]  0.30424556 -0.20026501
## [5,]  0.21790606 -0.09084620
## [6,] -0.14559262  0.16293325

var(demo.x)

##             [,1]        [,2]
## [1,]  0.07487205 -0.07348313
## [2,] -0.07348313  0.08571690

eigen(var(demo.x))

## eigen() decomposition
## $values
## [1] 0.153977402 0.006611554
## 
## $vectors
##            [,1]       [,2]
## [1,] -0.6805912 -0.7326634
## [2,]  0.7326634 -0.6805912

Multiple principle components

What about approximating by a plane, hyper-plane, hyper-hyper-plane, etc.?
Intuition: take the direction of maximum variance $\perp$ the first principal component
Then direction of maximum variance $\perp$ the first two principal components
These are the eigenvectors of $\V$, in order of decreasing $\lambda$
The gory details can be found at the end of these slides

In R

prcomp is the best PCA command
Slightly complicated object returned
Understand it with a (spatial) data example

USA, $\approx 1977$

Dataset pre-loaded in R:

head(state.x77)

##            Population Income Illiteracy Life Exp Murder HS Grad Frost   Area
## Alabama          3615   3624        2.1    69.05   15.1    41.3    20  50708
## Alaska            365   6315        1.5    69.31   11.3    66.7   152 566432
## Arizona          2212   4530        1.8    70.55    7.8    58.1    15 113417
## Arkansas         2110   3378        1.9    70.66   10.1    39.9    65  51945
## California      21198   5114        1.1    71.71   10.3    62.6    20 156361
## Colorado         2541   4884        0.7    72.06    6.8    63.9   166 103766

Principal components of the USA, $\approx 1977$

state.pca <- prcomp(state.x77, scale. = TRUE)
str(state.pca)

## List of 5
##  $ sdev    : num [1:8] 1.897 1.277 1.054 0.841 0.62 ...
##  $ rotation: num [1:8, 1:8] 0.126 -0.299 0.468 -0.412 0.444 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:8] "Population" "Income" "Illiteracy" "Life Exp" ...
##   .. ..$ : chr [1:8] "PC1" "PC2" "PC3" "PC4" ...
##  $ center  : Named num [1:8] 4246.42 4435.8 1.17 70.88 7.38 ...
##   ..- attr(*, "names")= chr [1:8] "Population" "Income" "Illiteracy" "Life Exp" ...
##  $ scale   : Named num [1:8] 4464.49 614.47 0.61 1.34 3.69 ...
##   ..- attr(*, "names")= chr [1:8] "Population" "Income" "Illiteracy" "Life Exp" ...
##  $ x       : num [1:50, 1:8] 3.79 -1.053 0.867 2.382 0.241 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:50] "Alabama" "Alaska" "Arizona" "Arkansas" ...
##   .. ..$ : chr [1:8] "PC1" "PC2" "PC3" "PC4" ...
##  - attr(*, "class")= chr "prcomp"

Principal components of the USA, $\approx 1977$

The weight/loading matrix $\w$ gets called $rotation (why?):

signif(state.pca$rotation[, 1:2], 2)

##               PC1    PC2
## Population  0.130  0.410
## Income     -0.300  0.520
## Illiteracy  0.470  0.053
## Life Exp   -0.410 -0.082
## Murder      0.440  0.310
## HS Grad    -0.420  0.300
## Frost      -0.360 -0.150
## Area       -0.033  0.590

Each column is an eigenvector of $\V$

Principal components of the USA, $\approx 1977$

signif(state.pca$sdev, 2)

## [1] 1.90 1.30 1.10 0.84 0.62 0.55 0.38 0.34

Standard deviations along each principal component $=\sqrt{\lambda_i}$

If we keep $k$ components, \[ R^2 = \frac{\sum_{i=1}^{k}{\lambda_i}}{\sum_{j=1}^{p}{\lambda_j}} \]

(Denominator $=\tr{\V}$ [why?])

Principal components of the USA, $\approx 1977$

signif(state.pca$x[1:10, 1:2], 2)

##               PC1   PC2
## Alabama      3.80 -0.23
## Alaska      -1.10  5.50
## Arizona      0.87  0.75
## Arkansas     2.40 -1.30
## California   0.24  3.50
## Colorado    -2.10  0.51
## Connecticut -1.90 -0.24
## Delaware    -0.42 -0.51
## Florida      1.20  1.10
## Georgia      3.30  0.11

Columns are $\vec{x}_i \cdot \vec{w}_1$ and $\vec{x}_i \cdot \vec{w}_2$

PC1 is kinda southern

size of state abbreviation $\propto$ projection on to PC1
coordinates = state capitols, except for AK and HI

PC1 is kinda the legacy of slavery

Correlation of PC1 with having been a slave state in 1861 is 0.78
Correlation of PC1 with having been in the Confederacy is 0.8

Summing up

Finding the first $k$ PCs = finding the best $k$-dimensional approximation
= finding the $k$ directions of maximum variance
Then project on to the low-dimensional space
PCs describe contrasts in the data

The gory details for multiple PCs

Use $k$ directions in a $p\times k$ matrix $\w$

Require: $\mathbf{w}^T\mathbf{w} = \mathbf{I}$, the basis vectors are orthonormal

$\X \w =$ matrix of projection lengths $[n\times k]$

$\X \w \w^T =$ matrix of projected vectors $[n\times p]$

$\X - \X \w \w^T =$ matrix of vector residuals $[n\times p]$

$(\X-\X\w\w^T)(\X-\X\w\w^T)^T =$ matrix of inner products of vector residuals $[n\times n]$

$\tr{((\X-\X\w\w^T)(\X-\X\w\w^T)^T)} =$ sum of squared errors $[1\times 1]$

The gory details

\[\begin{eqnarray} MSE(\w) & = & \frac{1}{n} \tr{((\X-\X\w\w^T)(\X^T - \w\w^T \X^T))}\\ & = & \frac{1}{n} \tr{(\X \X^T - \X\w\w^T\X^T - \X\w\w^T\X^T + \X\w\w^T\w\w^T\X^T)}\\ & = & \frac{1}{n}\left(\tr{(\X\X^T)} - 2\tr{(\X\w\w^T\X^T)} + \tr{(\X\w\w^T\X^T)}\right)\\ & = & \frac{1}{n}\tr{(\X\X^T)} - \frac{1}{n}\tr{(\X\w\w^T\X^T)} \end{eqnarray}\]

so maximize $\frac{1}{n}\tr{(\X\w\w^T\X^T)}$

The gory details

“trace is cyclic” so \[ \tr{(\X\w\w^T\X^T)} = \tr{(\X^T\X\w\w^T)} = \tr{(\w^T\X^T\X\w)} \] so we want to maximize \[ \tr{\left(\w^T \frac{\X^T \X}{n}\w\right)} \] under the constraint \[ \w^T \w = \mathbf{I} \]

This is the same form we saw before, so it has the same sort of solution: each column of $\w$ must be an eigenvector of $\V$.

The Lagrange multiplier trick

Want to solve \[ \max_{w}{L(w)} \] with constraint $f(w) = c$
Option I: Learn techniques for constrained optimization
- Drawback: Who wants to major in operations research?
Option II: Use $f(w)=c$ to eliminate one coordinate of $w$
- Then $w=g(v,c)$ for some function $g$ and unconstrained $v$
- Do $\max_{v}{L(g(v,c))}$
- An unconstrained problem with one less variable
- Drawback: algebra is hard!
Option III: Solve an unconstrained problem with more variables
- Drawback: Only a French mathematician could think this made sense

The Lagrange multiplier trick

Define a Lagrangian \[ \mathcal{L}(w,\lambda) \equiv L(w) - \lambda(f(w) - c) \]
- $=L(w)$ when the constraint holds
$\lambda$ is the Lagrange multiplier which enforces the constraint $f(w)=c$
Now do an unconstrained optimization over $w$ and $\lambda$: \[ \max_{w, \lambda}{\mathcal{L}(w,\lambda)} \]

The Lagrange multiplier trick

\[ \max_{w, \lambda}{\mathcal{L}(w,\lambda)} \]

Take derivatives: \[\begin{eqnarray} \frac{\partial \mathcal{L}}{\partial \lambda} & = & -(f(w)-c)\\ \frac{\partial \mathcal{L}}{\partial w} & = & \frac{\partial L}{\partial w} - \lambda\frac{\partial f}{\partial w} \end{eqnarray}\]
Set to 0 at the maximum: \[\begin{eqnarray} f(w) & = & c\\ \frac{\partial L}{\partial w} & = & \lambda\frac{\partial f}{\partial w} \end{eqnarray}\]
- We’ve automatically recovered the constraint!
One equation per unknown $\Rightarrow$ Solve for $\lambda$, $w$ together

The Lagrange multiplier trick

More equality constraints $\Rightarrow$ more Lagrange multipliers
Inequality constraints, $g(w) \leq d$, are trickier
Is the unconstrained maximum inside the feasible set?
- Yes: problem solved
- No: constrained maximum is on the boundary
- Boundary is usually $g(w) = d$ $\Rightarrow$ treat like an equality
- There are subtleties; sometimes we need to learn some operations research

Principal Components Analysis I

In our last episodes…

What the data looks like

Finding a “principal” component

Projections

Projections

Projections

How well does the projection approximate the original?

How well does the projection approximate the original?

Minimizing MSE is maximizing variance

Minimizing MSE is maximizing variance

Minimizing MSE is maximizing variance

OK, how do we find this magic direction?

OK, how do we find this magic direction?

The magic direction is an eigenvector

About the sample covariance matrix

Multiple principle components

In R

USA, \(\approx 1977\)

Principal components of the USA, \(\approx 1977\)

Principal components of the USA, \(\approx 1977\)

Principal components of the USA, \(\approx 1977\)

Principal components of the USA, \(\approx 1977\)

PC1 is kinda southern

PC1 is kinda the legacy of slavery

Summing up

The gory details for multiple PCs

The gory details

The gory details

The Lagrange multiplier trick

The Lagrange multiplier trick

The Lagrange multiplier trick

The Lagrange multiplier trick

The Lagrange multiplier trick