How well does the projection approximate the original?

Do it for one vector first:

\[\begin{eqnarray} {\|\vec{x_i} - (\vec{w}\cdot\vec{x_i})\vec{w}\|}^2 & =& \left(\vec{x_i} - (\vec{w}\cdot\vec{x_i})\vec{w}\right)\cdot\left(\vec{x_i} - (\vec{w}\cdot\vec{x_i})\vec{w}\right)\\ & = & \vec{x_i}\cdot\vec{x_i} -\vec{x_i}\cdot (\vec{w}\cdot\vec{x_i})\vec{w}\\ \nonumber & & - (\vec{w}\cdot\vec{x_i})\vec{w}\cdot\vec{x_i} + (\vec{w}\cdot\vec{x_i})\vec{w}\cdot(\vec{w}\cdot\vec{x_i})\vec{w}\\ & = & {\|\vec{x_i}\|}^2 -2(\vec{w}\cdot\vec{x_i})^2 + (\vec{w}\cdot\vec{x_i})^2\vec{w}\cdot\vec{w}\\ & = & \| \vec{x_i}\|^2 - (\vec{w}\cdot\vec{x_i})^2 \end{eqnarray}\]

How well does the projection approximate the original?

Average across all the data vectors:

\[\begin{eqnarray} MSE(\vec{w}) & = & \frac{1}{n}\sum_{i=1}^{n}{\|\vec{x_i}\|^2 -{(\vec{w}\cdot\vec{x_i})}^2}\\ & = & \frac{1}{n}\left(\sum_{i=1}^{n}{\|\vec{x_i}\|^2} -\sum_{i=1}^{n}{(\vec{w}\cdot\vec{x_i})^2}\right) \end{eqnarray}\]

First bit doesn’t depend on $\vec{w}$, so doesn’t matter for minimizing
So we want to maximize \[ L(\vec{w}) = \frac{1}{n}\sum_{i=1}^{n}{{(\vec{w}\cdot\vec{x_i})}^2} \]

Minimizing MSE is maximizing variance

\[\begin{eqnarray} L(w) & = & \frac{1}{n}\sum_{i=1}^{n}{{(\vec{w}\cdot\vec{x_i})}^2}\\ & = & {\left(\frac{1}{n}\sum_{i=1}^{n}{\vec{x_i}\cdot\vec{w}}\right)}^2 + \SampleVar{\vec{w}\cdot\vec{x_i}} \end{eqnarray}\]

($\Expect{Z^2} = (\Expect{Z})^2 + \Var{Z}$)

But \[ \frac{1}{n}\sum_{i=1}^{n}{\vec{x_i} \cdot \vec{w}} = 0 \] (Why?)

Minimizing MSE is maximizing variance

so \[ L(\vec{w}) = \SampleVar{\vec{w}\cdot\vec{x_i}} \]

Minimizing MSE is maximizing variance

The direction which gives us the best approximation of the data is the direction along which the projections have the greatest variance

OK, how do we find this magic direction?

Matrix form: all lengths of projections $=\mathbf{x}\mathbf{w}$ $[n\times 1]$

\[\begin{eqnarray} \SampleVar{\vec{w}\cdot\vec{x_i}} & = & \frac{1}{n}\sum_{i=1}^{n}{{\left(\vec{x_i} \cdot \vec{w}\right)}^2}\\ & = & \frac{1}{n}{\left(\X \w\right)}^{T} \left(\X \w\right)\\ & = & \frac{1}{n} \w^T \X^T \X \w\\ & = & \w^T \frac{\X^T \X}{n} \w\\ \end{eqnarray}\]

Fact: $\V \equiv \frac{\X^T \X}{n} =$ sample covariance matrix of the vectors

OK, how do we find this magic direction?

We need to maximize \[\begin{equation} \SampleVar{\vec{w}\cdot\vec{x_i}} = \w^T \V \w \end{equation}\]
Constraint: $\vec{w}$ has length 1 $\Leftrightarrow$ $\w^T \w = 1$
Add a Lagrange multiplier $\lambda$ times the constraint equation, and take derivatives
- See backup if you’re rusty on Lagrange multipliers

\[\begin{eqnarray} \mathcal{L}(\w,\lambda) & \equiv & \w^T\V\w - \lambda(\w^T \w -1)\\ \frac{\partial \mathcal{L}}{\partial \lambda} & = & -(\w^T \w -1)\\ \frac{\partial \mathcal{L}}{\partial \w} & = & 2\V\w - 2\lambda\w \end{eqnarray}\]

Set derivatives to zero:

\[\begin{eqnarray} \w^T \w & = & 1\\ \V \w & = & \lambda \w \end{eqnarray}\]

The magic direction is an eigenvector

\[\begin{eqnarray} \w^T \w & = & 1\\ \V \w & = & \lambda \w \end{eqnarray}\]

THIS IS AN EIGENVALUE/EIGENVECTOR EQUATION!

At the solution, \[ \SampleVar{\vec{w}\cdot\vec{x_i}} = \w^T \V \w = \w^T \lambda \w = \lambda \] so the maximum comes from the leading eigenvector of $\V$

The first principal component (PC1) and scores on the first principal component

\[\begin{eqnarray} \text{Lead eigenvector of}~ \V & = & 1^{\mathrm{st}} \text{principal component}\\ & = & \text{Direction of maximum variance}\\ & = & \text{Best 1D approximation to the data} \end{eqnarray}\]

Write the leading eigenvector as $\vec{w}_1$, or, as a $p\times 1$ matrix, $\mathbf{w}_1$
The score of the $i^{\mathrm{th}}$ data point on PC1 is $\vec{x}_i \cdot \vec{w}$, the (signed) length of $\vec{x}_i$’s projection on to $\vec{w}$
The $[n\times 1]$ matrix of all scores on PC1 is $\mathbf{x} \mathbf{w}_1$
The scores are the new feature values we’d use after dimension reduction
The $[n\times p]$ matrix of all approximations using just PC1 is $\mathbf{x} \mathbf{w}_1 \mathbf{w}_1^T$

Multiple principle components

What about approximating by a plane, hyper-plane, hyper-hyper-plane, etc.?
If we want to use $q$ principal components, we’ll need to make $\w$ a $[p\times q]$ matrix
- Each column of $\w$ will be the vector in a different direction of the linear subspace we’re projecting down to
- $\X\w$ will be the $n\times q$ matrix of new feature values
Intuition: take the direction of maximum variance $\perp$ the first principal component
Then direction of maximum variance $\perp$ the first two principal components
These are the eigenvectors of $\V$, in order of decreasing $\lambda$
- Proof: See the “gory details” back-up slides
But clearly we need to know something about the eigenvalues and eigenvectors

About the sample covariance matrix

$\V$ is a $p\times p$ matrix, so it has $p$ eigenvalues $\lambda_1, \ldots \lambda_p$, and $p$ eigenvectors, $\vec{w}_1, \ldots \vec{w}_p$
- Some eigenvalues might be repeated
- We can always chose eigenvectors to have length 1 (normalize them)
- We can stack the eigenvectors in to a $p\times p$ matrix $\w$ where each column is a different $\vec{w}_i$
$\V$ is a covariance matrix, so it has two special properties;
1. It’s symmetric
  - Meaning: $\V = \V^T$
  - Since $v_{ij} = \Cov{X_i, X_j} = \Cov{X_j, X_i} = v_{ji}$
2. It’s non-negative definite
  - Meaning: $\vec{a} \cdot (\V \vec{a}) \geq 0$, for any vector $\vec{a}$
  - You’re showing that all covariance matrices are non-negative definite in HW 6
  - Compatible with some entries in $\V$ being negative numbers
Because $\V$ is symmetric:
- All its eigenvalues are real (not complex)
  - Somewhat nontrivial math
- Eigenvectors with distinct eigenvalues are orthogonal: If $\lambda_i \neq \lambda_j$, then $\vec{w}_i \perp \vec{w}_j$
  - See back-up for proof
- Eigenvectors with equal eigenvalues can be chosen to be orthogonal
  - See backup for proof
- So there are $p$ orthogonal, normalized (“orthonormal”) eigenvectors
- In $p$ dimensions, any set of $p$ orthogonal vectors is a basis
- $\Rightarrow$ any vector $\vec{a}$ can be written as a sum of the eigenvectors, $\vec{a} = \sum_{j=1}^{p}{(\vec{a} \cdot \vec{w}_j) \vec{w}_j}$
Because $\V$ is non-negative definite:
- All $\lambda_i \geq 0$
  - For any eigenvector $\vec{w}_i$, $\vec{w}_i \cdot (\V \vec{w}_i) = \lambda_i (\vec{w}_i \cdot \vec{w}_i)$ but this must be $\geq 0$ because $\V$ is non-negative definite

Some properties of the eigenvalues

All eigenvalues $\geq 0$
- Again, this is because $\V$ is a variance matrix
In general, $p$ non-zero eigenvalues
If the data are exactly in a $q$-dimensional subspace, then exactly $q$ non-zero eigenvalues
If $n < p$, at most $n$ non-zero eigenvalues
- Two points define a line, three define a plane, …

Some properties of the PC vectors

The principal components are orthonormal
- $\vec{w}_i \cdot \vec{w}_i = 1$
- $\vec{w}_i \cdot \vec{w}_j = 0$ (unless $i=j$)
- In matrix form, $\w^T\w = \mathbf{I}$
  - Remember, $\w =$ the $p\times p$ matrix whose columns are the PC vectors
- Again, this is because $\V$ is symmetric
PC1 is the direction of maximum variance through the data
- That variance is $\lambda_1$, biggest eigenvalue of $\V$
  - As we saw in the proof above
PC $(i+1)$ is the direction of maximum variance $\perp$ PC1, PC2, $\ldots$ PC $i$
- That variance is $\lambda_{i+1}$
  - See the “gory details” slides

The eigendecomposition of $\V$

We saw a little bit back that $\V$ has $p$ orthonormal eigenvectors, so they’re a basis, and for any vector $\vec{a}$, \[ \vec{a} = \sum_{j=1}^{p}{\vec{w}_j (\vec{w}_j \cdot \vec{a})} \]
This implies that for any vector $\vec{a}$, \[ \V \vec{a} = \sum_{j=1}^{p}{(\V \vec{w}_j) (\vec{w}_j \cdot \vec{a})} = \sum_{j=1}^{p}{\vec{w}_j \lambda_j (\vec{w}_j \cdot \vec{a})} \]
Again, define $\w$ as the $p\times p$ matrix whose columns are the eigenvectors
- So $\w^T$ is the $p\times p$ matrix whose rows are the eigenvectors
Define $\mathbf{\Lambda}$ as the $p\times p$ diagonal matrix whose diagonal is $\lambda_1, \lambda_2, \ldots \lambda_p$
What we’ve just show is that for any vector $\vec{a}$, \[ \V \vec{a} = \w \mathbf{\Lambda} \w^T \vec{a} \] so \[ \V = \w \mathbf{\Lambda} \w^T \]
This is the eigendecomposition of $\V$

Some properties of the PC scores

Define the $[n\times p]$ matrix of the scores of all data points on all PCs: \[ \S \equiv \X \w \]

$s_{ij} =$ score of the $i^{\mathrm{th}}$ data point on PC $j$
Average score on each PC $=0$ (b/c we centered the data)
- Here, “average” is over the $n$ data points, not over the $p$ PCs
- That is, for each $j \in 1:p$, $\frac{1}{n}\sum_{i=1}^{n}{s_{ij}} = 0$
This implies that \[\begin{eqnarray} \Var{\text{scores}} & = & \frac{1}{n} \S^T \S\\ \end{eqnarray}\]

EXERCISE: Show that $\Var{\text{scores}} = \mathbf{\Lambda}$

Hints:

$\V \equiv \frac{\X^T \X}{n} = \w \mathbf{\Lambda} \w^T$
$\w^{-1} = \w^T$ (why?)

Solution to the exercise

\[\begin{eqnarray} \Var{\text{scores}} & = & \frac{1}{n} \S^T \S\\ & = & \frac{1}{n} (\X\w)^T(\X\w)\\ & = & \frac{1}{n}\w^T \X^T \X \w\\ & = & \w^T \V\w = \w^T (\w \mathbf{\Lambda} \mathbf{w}^T) \w & = & (\w^T \w) \mathbf{\Lambda} (\w^T\w)\\ & = & \mathbf{\Lambda} \end{eqnarray}\]

Variance of score on PC $j$ $=\lambda_j$
- This is variance over the $n$ data points, each of which will have its own score on PC $i$
- That is, $\frac{1}{n}\sum_{i=1}^{n}{s_{ij}^2} = \lambda_j$
Covariance of score on PC $j$ with score on PC $k$ $=0$
- Again, this is covariance over the $n$ data points, each of which gets its own score on both PCs
- That is, $\frac{1}{n}\sum_{i=1}^{n}{s_{ij} s_{ik}} = 0$

Some properties of PCA as a whole

If we use all $p$ principal components, we have the eigendecomposition of $\V$: \[ \V = \w \mathbf{\Lambda} \mathbf{w}^T \]
- Once again, $\mathbf{\Lambda}=$ diagonal matrix of eigenvalues $\lambda_1, \ldots \lambda_p$ and $\w =$ matrix whose columns are the PCs
If we use all $p$ principal components, \[ \X = \S\w^T \]
If we use only the top $q$ PCs, we get:
- the best rank-$q$ approximation to $\V$
- the best dimension-$q$ approximation to $\X$
- the best rank-$q$ approximation to $\X$

Low-rank approximation

If we use all the scores, we just re-express the data: \[ \X = \S\w^T = (\X\w) \w^T \]
- Remember $\w^T = \w^{-1}$
Suppose we just keep the top $q$ PCs, so $\w_{q}$ is $p\times q$ \[ \S_{q} = \X\w_{q} ~(\text{dimension} ~ [n\times q]) \]
Approximation to the data: \[ \X_{q} = \S_{q} \w_{q}^T ~ ~(\text{dimension} ~ [n\times q][q\times p]) \]
$\X_q$ is a rank-$q$ matrix, and it’s the closest rank-$q$ matrix to $\X$

Another way to think about PCA

The original coordinates are correlated
There is always another coordinate system with uncorrelated coordinates
We’re rotating to that coordinate system
- Rotating to new coordinates $\Rightarrow$ multiplying by an orthogonal matrix
- That matrix is $\mathbf{w}$
- The new coordinates are the scores

Some data-mining-y applications

Latent semantic indexing
Multidimensional scaling
Netflix / recommender systems

Latent semantic indexing (Deerwester et al. 1990; Landauer and Dumais 1997)

Build bag-of-words representation of our documents, $p \approx 10^4$ or even $10^5$
Do PCA
Keep $q \ll p$ components
- These are the “latent semantic dimensions” of the documents
Use the scores on those components as the new features for
- Classification
- Prediction
- Similarity-based search

LSI for the New York Times stories, in R

nyt.nice.frame holds all the stories, with some standard text-processing tweaks applied
- (Up-weight counts of words which appear in few documents, “inverse document frequency” weighting)
- (Normalize all feature vectors to length 1)

dim(nyt.nice.frame)

## [1]  102 4431

The basic command for PCA in R is prcomp():

nyt.pca <- prcomp(nyt.nice.frame)
str(nyt.pca)

## List of 5
##  $ sdev    : num [1:102] 0.141 0.136 0.131 0.124 0.12 ...
##  $ rotation: num [1:4431, 1:102] 0.02701 0.04073 -0.00657 -0.02276 0.03522 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:4431] "X." "X.d" "X.nd" "X.s" ...
##   .. ..$ : chr [1:102] "PC1" "PC2" "PC3" "PC4" ...
##  $ center  : Named num [1:4431] 0.0145 0.00612 0.00274 0.0123 0.01255 ...
##   ..- attr(*, "names")= chr [1:4431] "X." "X.d" "X.nd" "X.s" ...
##  $ scale   : logi FALSE
##  $ x       : num [1:102, 1:102] -0.231 0.0358 -0.0906 -0.0513 -0.1322 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:102] "art.1" "art.2" "art.3" "art.4" ...
##   .. ..$ : chr [1:102] "PC1" "PC2" "PC3" "PC4" ...
##  - attr(*, "class")= chr "prcomp"

LSI for the New York Times stories

First component:

nyt.latent.sem <- nyt.pca$rotation
signif(sort(nyt.latent.sem[,1],decreasing=TRUE)[1:30],2)

##       music        trio     theater   orchestra   composers       opera 
##       0.110       0.084       0.083       0.067       0.059       0.058 
##    theaters           m    festival        east     program           y 
##       0.055       0.054       0.051       0.049       0.048       0.048 
##      jersey     players   committee      sunday        june     concert 
##       0.047       0.047       0.046       0.045       0.045       0.045 
##    symphony       organ     matinee   misstated instruments           p 
##       0.044       0.044       0.043       0.042       0.041       0.041 
##         X.d       april      samuel        jazz     pianist     society 
##       0.041       0.040       0.040       0.039       0.038       0.038

signif(sort(nyt.latent.sem[,1],decreasing=FALSE)[1:30],2)

##       she       her        ms         i      said    mother    cooper        my 
##    -0.260    -0.240    -0.200    -0.150    -0.130    -0.110    -0.100    -0.094 
##  painting   process paintings        im        he       mrs        me  gagosian 
##    -0.088    -0.071    -0.070    -0.068    -0.065    -0.065    -0.063    -0.062 
##       was   picasso     image sculpture      baby   artists      work    photos 
##    -0.058    -0.057    -0.056    -0.056    -0.055    -0.055    -0.054    -0.051 
##       you    nature    studio       out      says      like 
##    -0.051    -0.050    -0.050    -0.050    -0.050    -0.049

Music vs. art
What’s up with “p”, “m”, “y”, personal pronouns?

LSI for the New York Times stories

Second component:

signif(sort(nyt.latent.sem[,2],decreasing=TRUE)[1:30],2)

##         art      museum      images     artists   donations     museums 
##       0.150       0.120       0.095       0.092       0.075       0.073 
##    painting         tax   paintings   sculpture     gallery  sculptures 
##       0.073       0.070       0.065       0.060       0.055       0.051 
##     painted       white    patterns      artist      nature     service 
##       0.050       0.050       0.047       0.047       0.046       0.046 
##  decorative        feet     digital      statue       color    computer 
##       0.043       0.043       0.043       0.042       0.042       0.041 
##       paris         war collections     diamond       stone     dealers 
##       0.041       0.041       0.041       0.041       0.041       0.040

signif(sort(nyt.latent.sem[,2],decreasing=FALSE)[1:30],2)

##          her          she      theater        opera           ms            i 
##       -0.220       -0.220       -0.160       -0.130       -0.130       -0.083 
##         hour   production         sang     festival        music      musical 
##       -0.081       -0.075       -0.075       -0.074       -0.070       -0.070 
##        songs        vocal    orchestra           la      singing      matinee 
##       -0.068       -0.067       -0.067       -0.065       -0.065       -0.061 
##  performance         band       awards    composers         says           my 
##       -0.061       -0.060       -0.058       -0.058       -0.058       -0.056 
##           im         play     broadway       singer       cooper performances 
##       -0.056       -0.056       -0.055       -0.052       -0.051       -0.051

Visualization

story.classes <- c(rep("art", times=length(art.stories)),
                   rep("music", times=length(music.stories)))
plot(nyt.pca$x[,1:2],
     pch=ifelse(story.classes=="music","m","a"),
     col=ifelse(story.classes=="music","blue","red"))

Notice: each document (= data point, row of $\X$) gets its own score on every PC (I’m just using the first two for visualization)
- What part of prcomp()’s return value contains the scores?
Notice: really easy to separate using just 2 principal components!

Multidimensional scaling (MDS)

Given: High-dimensional vectors with known distances
Desired: Low-dimensional vectors with nearly the same distances
One approach: PCA
- Because: triangle inequality
So we’ve just done MDS
- Gory details: section 3.7 of PDM

Netflix and recommender systems in general

The data: user ratings of movies (1–5), say $n$ users by $p$ movies
- (Really, accounts rather than users…)
Goal: predict how a user will rate a movie they haven’t seen before
- So Netflix can recommend movies users will like, and keep them paying
Originally, Netflix released data $10^8$ ratings from $4.8\times 10^5$ users of $1.8 \times 10^4$ movies
- Challenge: improve the RMSE of ratings on a testing set by 10%
- Released in 2006; prize of $ ${10}^6$ for first team to hit the threshold by 2011 (with smaller prizes for incremental progress)
- The first person to use PCA got an immediate improvement of $3.76$% (!)
Questions to decide before using PCA:
- Are users features for movies, or are movies features for users?
- What’s the best way to use the new features (= scores on PCs) for prediction?
- How do we handle missing values?
We’ll come back to this when we look at recommender systems in more detail

Interpreting PCA results

PCs are linear combinations of the original coordinates
- Usually a good idea to scale variables with different units of measurement first
$\Rightarrow$ PCs change as you add or remove coordinates
- Put in 1000 measures of education and PC1 is education…
Very tempting to reify the PCs
- i.e., to “make them a thing”
- sometimes totally appropriate…
- sometimes not at all appropriate…
- Be very careful when the only evidence is the PCA

PCA is exploratory analysis, not statistical inference

We assumed no model
- Pro: That’s the best linear approximation, no matter what
- Con: doesn’t tell us how much our results are driven by noise
Prediction: PCA predicts nothing
Inference: If $\V \rightarrow \Var{X}$ then PCs $\rightarrow$ eigenvectors of $\Var{X}$
- But PCA doesn’t need this assumption
- Doesn’t tell us about uncertainty
There are PCA-like models (like factor models), which we may come back to later, but those models $\neq$ PCA

PCA and distance preservation

In PCA, we project from $\vec{x}_i$ to $\mathbf{u}\vec{x}_i$
- $\mathbf{u} =$ projection operator for the principal components
- What’s $\mathbf{u}$ in terms of the eigenvectors?
We reduce from $p$ dimensions down to $q$
PCA tries to make the projected points close to the original points, so \[ \frac{1}{n}\sum_{i=1}^{n}{\| \vec{x}_i - \mathbf{u}\vec{x}_i\|^2} \] is minimized
So distances between projected points track distances between original points
- (By the triangle inequality)

Distance-preserving projections

Suppose we just want to preserve distances
A projection matrix $\mathbf{u}$ would be distance preserving or an isometry if \[ \|\mathbf{u}\vec{x}_i - \mathbf{u}\vec{x}_j\|^2 = \|\vec{x}_i - \vec{x}_j\|^2 \]
Exact distance-preservation is a lot to ask
- Unless $p > n$
Approximate distance preservation: we say that $\mathbf{u}$ is “$\epsilon$-close to being an isometry”, or an $\epsilon$-isometry, when \[ (1-\epsilon) \|\vec{x}_i - \vec{x}_j\|^2 \leq \|\mathbf{u}\vec{x}_i - \mathbf{u}\vec{x}_j\|^2 \leq (1+\epsilon) \|\vec{x}_i - \vec{x}_j\|^2 \]
A remarkable math fact: If $q = O(\frac{\log{n}}{\epsilon^2})$ then there is always an $\epsilon$-isometry

The random projection trick

Make $Z_{ij} \sim \mathcal{N}(0,1)$ independent standard Gaussians
Set \[ \mathbf{U} = \frac{1}{\sqrt{q}}\left[\begin{array}{ccc} Z_{11} & Z_{12} & \ldots & Z_{1p}\\ Z_{21} & Z_{22} & \ldots & Z_{2p}\\ \vdots & \vdots & \ldots & \vdots\\ Z_{q1} & Z_{q2} & \ldots & Z_{qp}\end{array}\right] \]
$\mathbf{U}\vec{x}_i$ is like projecting on to $q$ random $p$-dimensional vectors
- What is the expected squared length of those vectors?
Now you can check that for any vector $\vec{x}$, \[ \Expect{\|\mathbf{U}\vec{x}\|^2} = \|\vec{x}\|^2 \]
Hint: $\Expect{\|\mathbf{U}\vec{x}\|^2} = \Expect{\sum_{i=1}^{q}{\left(\frac{1}{\sqrt{q}}\sum_{j=1}^{p}{Z_{ij} x_j}\right)^2}}$ and $\|\vec{x}\|^2 = \sum_{j=1}^{p}{x_j^2}$

The random projection trick

Expectation is linear and the $Z_{ij}$ are uncorrelated so \[ \Expect{\|\mathbf{U}\vec{x}\|^2} = \frac{1}{q}\sum_{i=1}^{q}{\sum_{j=1}^{p}{x_j^2}} = \|\vec{x}\|^2 \]
Random projections are distance-preserving on average

Random projections are nearly distance-preserving with high probability

Pick your favorite $\epsilon > 0$ (approximation level) and $\delta > 0$ (error probability)
Johnson-Lindenstrauss theorem: With probability at least $1-\delta$, $\mathbf{U}$ is an $\epsilon$-isometry for our $n$ data points if \[ q \geq \frac{8}{\epsilon^2}\log{\frac{n}{\sqrt{\delta}}} \]
- E.g., for 95% confidence, need $q \geq \frac{8}{\epsilon^2}\left(\log{(n)} + 1.498\ldots \right)$
- For 95% confidence with $n=205$ (like in HW 7) and $\epsilon=0.2$, we need $q \geq 1365$
- If we had $n=2\times 10^6$, we’d need $q \geq 3202$
The remarkable bits: $p$ doesn’t matter and neither do the actual data vectors
- For the HW 7 data, $p=4.725\times 10^{5}$
You can also turn this around to see what $\epsilon$ you can achieve with a fixed budget of $q$
- With $q = 100$, $n=205$, $\delta=0.05$, we get $\epsilon \approx 0.74$ which isn’t great but is some signal
- Again, independent of $p$

Summing up

We have multivariate data $\X$ (dimension = $[n\times p]$)
We want the best $q$-dimensional linear approximation
Solution: Principal components analysis
- Take the $q$ leading eigenvectors of $\V \equiv \frac{1}{n}\X^T \X =$ sample/empirical covariance matrix
- These eigenvectors = the principal components = directions of largest variance = biggest contrasts within the data
- Project the data on to the principal components
- Equivalently: rotate to new, uncorrelated coordinates
Assemble the eigenvectors into $\w_q$ ($[p\times q]$)
- Scores for the data are $\X\w_q \equiv \S_q$ ($[n\times q]$)
- Approximations are $(\X\w_q) \w_q^T = \S_q \w_q^T$ ($[n\times p]$)
No inference or prediction, just linear approximation
If we just want to (approximately) preserve distances, random linear projections can work really well too

Backup: Gory details of PCA in matrix form

Use $q$ directions in a $p\times q$ matrix $\w$
- i.e., each column of $\w$ will be a unit vector in $p$-dimensional space
Require: $\mathbf{w}^T\mathbf{w} = \mathbf{I}$, the basis vectors are orthonormal
By analogy with one principal component:
- $\X \w =$ matrix of projection lengths $[n\times q]$
- $\X \w \w^T =$ matrix of projected vectors $[n\times p]$
- $\X - \X \w \w^T =$ matrix of vector residuals $[n\times p]$
- $(\X-\X\w\w^T)(\X-\X\w\w^T)^T =$ matrix of inner products of vector residuals $[n\times n]$
- $\tr{((\X-\X\w\w^T)(\X-\X\w\w^T)^T)} =$ sum of squared errors $[1\times 1]$

Backup: The gory details

\[\begin{eqnarray} MSE(\w) & = & \frac{1}{n} \tr{((\X-\X\w\w^T)(\X^T - \w\w^T \X^T))}\\ & = & \frac{1}{n} \tr{(\X \X^T - \X\w\w^T\X^T - \X\w\w^T\X^T + \X\w\w^T\w\w^T\X^T)}\\ & = & \frac{1}{n}\left(\tr{(\X\X^T)} - 2\tr{(\X\w\w^T\X^T)} + \tr{(\X\w\w^T\X^T)}\right)\\ & = & \frac{1}{n}\tr{(\X\X^T)} - \frac{1}{n}\tr{(\X\w\w^T\X^T)} \end{eqnarray}\]

so maximize $\frac{1}{n}\tr{(\X\w\w^T\X^T)}$

Backup: The gory details

“trace is cyclic” so \[ \tr{(\X\w\w^T\X^T)} = \tr{(\X^T\X\w\w^T)} = \tr{(\w^T\X^T\X\w)} \] so we want to maximize \[ \tr{\left(\w^T \frac{\X^T \X}{n}\w\right)} \] under the constraint \[ \w^T \w = \mathbf{I} \]

This is the same form we saw before, so it has the same sort of solution: each column of $\w$ must be an eigenvector of $\V$.

Backup: Distinct eigenvalues of a symmetric matrix have orthogonal eigenvectors

For any symmetric matrix with eigenvalues $\lambda_1, \ldots \lambda_p$ and eigenvectors $\vec{w}_1, \ldots \vec{w}_p$, if $\lambda_i \neq \lambda_j$, then $\vec{w}_i \perp \vec{w}_j$

Proof: Remember that $\vec{w}_i \perp \vec{w}_j$ iff $\vec{w}_i \cdot \vec{w}_j = 0$, so let’s get at that inner product. Now, for any two vectors $\vec{a}$, $\vec{b}$ and square matrix $\mathbf{c}$ (not necessarily symmetric), \[ \vec{a} \cdot (\mathbf{c} \vec{b}) = (\mathbf{c}^T \vec{a}) \cdot \vec{b} \] (To see this, write $\vec{a}$ and $\vec{b}$ as $p\times 1$ matrices, so we’ve got $\vec{a}^T \mathbf{c} \vec{b} = \vec{a}^T (\mathbf{c}^T)^T \vec{b} = (\mathbf{c}^T \vec{a})^T \vec{b}$.) Now let’s apply this to our eigenvectors: \[\begin{eqnarray} \vec{w}_i \cdot (\V \vec{w}_j) & = & (\V^T \vec{w}_i) \cdot \vec{w}_j\\ \vec{w}_i \cdot (\V \vec{w}_j) & = & (\V \vec{w}_i) \cdot \vec{w}_j \vec{w}_i \cdot (\lambda_j \vec{w}_j) & = & (\lambda_i \vec{w}_i) \cdot \vec{w}_j\\ \lambda_j (\vec{w}_i \cdot \vec{w}_j) & = & \lambda_i (\vec{w}_i \cdot \vec{w}_j) \end{eqnarray}\] (The second line uses the assumption that $\V$ is symmetric, the third line uses the fact that $\vec{w}_i$ and $\vec{w}_j$ are both eigenvectors of $\V$, the last line uses linearity of inner products.)

Since, by assumption, $\lambda_i \neq \lambda_j$, the only way the two sides of the equation can balance is if $\vec{w}_i \cdot \vec{w}_j = 0$, as was to be shown.

Backup: Distinct eigenvectors with equal eigenvalues can be made or chosen to be orthogonal

Remember that if $\vec{w}_i$ and $\vec{w}_j$ are both eigenvectors of some matrix $\mathbf{c}$, and both have eigenvalue $\lambda$, then $a\vec{w}_i + b\vec{w}_j$ is also an eigenvector of $\mathbf{c}$, with eigenvalue $\lambda$: \[ \mathbf{c}\left(a\vec{w}_i + b\vec{w}_j\right) = a(\mathbf{c}\vec{w}_i) + b(\mathbf{c}\vec{w}_j) = \lambda(a\vec{w}_i + b\vec{w}_j) \]
Now suppose $\vec{w}_i$ and $\vec{w}_j$ are both eigenvectors of $\mathbf{c}$ with eigenvalue $\lambda$, and that $\vec{w}_i \not\perp \vec{w}_j$; we’ll construct a new eigenvector, also with eigenvalue $\lambda$, which is orthogonal to $\vec{w}_i$
- Without loss of generality (“wolog”), assume $\|\vec{w}_i\|=\|\vec{w}_j\|=1$
Define $\vec{u} \equiv \vec{w}_j - (\vec{w}_i \cdot \vec{w}_j) \vec{w}_i$
- Claim 1: $\vec{u}$ is an eigenvector with eigenvalue $\lambda$ (because it’s a linear combination of two eigenvectors with eigenvalue $\lambda$)
- Claim 2: $\vec{u} \perp \vec{w}_i$: \[\begin{eqnarray} \vec{u} \cdot \vec{w}_i & = & \left(\vec{w}_j - (\vec{w}_i \cdot \vec{w}_j) \vec{w}_i\right) \cdot \vec{w}_i\\ & = & \vec{w}_j \cdot \vec{w}_i - (\vec{w}_i \cdot \vec{w}_j) (\vec{w}_i \cdot \vec{w}_i)\\ & = & \vec{w}_j \cdot \vec{w}_i - \vec{w}_i \cdot \vec{w}_j\\ & = & 0 \end{eqnarray}\] (using $\|\vec{w}_i\|=1$ and symmetry of inner products)
- Claim 3: $\vec{w}_j^{\prime} = \frac{\vec{u}}{\|\vec{u}\|}$ is also an eigenvector with eigenvalue $\lambda$, and also orthogonal to $\vec{w}_i$
  - Because it’s proportional to $\vec{u}$
So we can replace $\vec{w}_j$ with $\vec{w}_j^{\prime}$ in our list of eigenvectors

Backup: The Lagrange multiplier trick (1)

Want to solve \[ \max_{w}{L(w)} \] with constraint $f(w) = c$
Option I: Learn techniques for constrained optimization
- Drawback: Who wants to major in operations research?
Option II: Use $f(w)=c$ to eliminate one coordinate of $w$
- Then $w=g(v,c)$ for some function $g$ and unconstrained $v$
- Do $\max_{v}{L(g(v,c))}$
- An unconstrained problem with one less variable
- Drawback: math is hard!
Option III: Solve an unconstrained problem with more variables
- Drawback: Would only occur to a French mathematician (Joseph-Louis Lagrange, 1736–1813)

Backup: The Lagrange multiplier trick (2)

Define a Lagrangian \[ \mathcal{L}(w,\lambda) \equiv L(w) - \lambda(f(w) - c) \]
- $=L(w)$ when the constraint holds
- Some people write $+\lambda$; doesn’t make any difference
$\lambda$ is the Lagrange multiplier which enforces the constraint $f(w)=c$
Now do an unconstrained optimization over $w$ and $\lambda$: \[ \max_{w, \lambda}{\mathcal{L}(w,\lambda)} \]

Backup: The Lagrange multiplier trick (3)

\[ \max_{w, \lambda}{\mathcal{L}(w,\lambda)} \]

Take derivatives: \[\begin{eqnarray} \frac{\partial \mathcal{L}}{\partial \lambda} & = & -(f(w)-c)\\ \frac{\partial \mathcal{L}}{\partial w} & = & \frac{\partial L}{\partial w} - \lambda\frac{\partial f}{\partial w} \end{eqnarray}\]
Set to 0 at the maximum: \[\begin{eqnarray} f(w) & = & c\\ \frac{\partial L}{\partial w} & = & \lambda\frac{\partial f}{\partial w} \end{eqnarray}\]
We’ve automatically recovered the constraint!
We get 1 equation per unknown $\Rightarrow$ Solve for $\lambda$, $w$

Backup: The Lagrange multiplier trick (4)

More equality constraints $\Rightarrow$ more Lagrange multipliers
Inequality constraints, $g(w) \leq d$, are trickier
The basic question with inequalities: Is the unconstrained maximum inside the feasible set?
- Yes: problem solved
- No: constrained maximum is on the boundary
- Boundary is usually $g(w) = d$ $\Rightarrow$ treat like an equality
- There are subtleties; sometimes we need to learn some operations research

Backup: The Lagrange multiplier trick (5)

Why not to do this instead? \[ \max_{\lambda, w}{L(w) + \lambda(f(w)-c))^2} \]

Backup Example: The axis of historical complexity (1)

Turchin et al. (2018)

9 variable measuring social complexity, over time, for dozens of locations across the world

Dates from -9600 to 1900

summary(soccomp[,complexities], digits=2)

##      PolPop       PolTerr          CapPop        levels      government  
##  Min.   :1.4   Min.   :-0.22   Min.   :1.4   Min.   :0.0   Min.   :0.00  
##  1st Qu.:4.2   1st Qu.: 3.65   1st Qu.:3.5   1st Qu.:1.8   1st Qu.:0.24  
##  Median :6.0   Median : 5.18   Median :4.3   Median :3.0   Median :0.62  
##  Mean   :5.5   Mean   : 4.78   Mean   :4.2   Mean   :2.9   Mean   :0.55  
##  3rd Qu.:6.8   3rd Qu.: 5.97   3rd Qu.:5.1   3rd Qu.:4.0   3rd Qu.:0.86  
##  Max.   :8.5   Max.   : 7.40   Max.   :6.3   Max.   :6.6   Max.   :1.00  
##     infrastr       writing         texts          money    
##  Min.   :0.00   Min.   :0.00   Min.   :0.00   Min.   :0.0  
##  1st Qu.:0.34   1st Qu.:0.26   1st Qu.:0.10   1st Qu.:1.8  
##  Median :0.75   Median :0.82   Median :0.93   Median :4.0  
##  Mean   :0.64   Mean   :0.65   Mean   :0.63   Mean   :3.4  
##  3rd Qu.:0.90   3rd Qu.:0.86   3rd Qu.:0.97   3rd Qu.:5.0  
##  Max.   :1.00   Max.   :1.00   Max.   :1.00   Max.   :6.0

Backup Example: The axis of historical complexity (2)

soccomp.pca <- prcomp(soccomp[,complexities], scale=TRUE)

What are the parts of the return value?

str(soccomp.pca)

## List of 5
##  $ sdev    : num [1:9] 2.634 0.733 0.646 0.581 0.481 ...
##  $ rotation: num [1:9, 1:9] 0.351 0.32 0.339 0.341 0.332 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:9] "PolPop" "PolTerr" "CapPop" "levels" ...
##   .. ..$ : chr [1:9] "PC1" "PC2" "PC3" "PC4" ...
##  $ center  : Named num [1:9] 5.515 4.779 4.229 2.923 0.552 ...
##   ..- attr(*, "names")= chr [1:9] "PolPop" "PolTerr" "CapPop" "levels" ...
##  $ scale   : Named num [1:9] 1.59 1.561 1.112 1.449 0.325 ...
##   ..- attr(*, "names")= chr [1:9] "PolPop" "PolTerr" "CapPop" "levels" ...
##  $ x       : num [1:414, 1:9] -4.35 -4.24 -4.11 -3.67 -3.51 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : NULL
##   .. ..$ : chr [1:9] "PC1" "PC2" "PC3" "PC4" ...
##  - attr(*, "class")= chr "prcomp"

sdev = standard deviations along PCs $=\sqrt{\lambda_i}$
rotation = matrix of eigenvectors $=\w$
x = scores on all PCs $=\S$

Backup Example: The axis of historical complexity (3)

# R^2 for k components = sum(first k eigenvalues)/sum(all eigenvalues)
# cumsum(x) returns the cumulative sums along the vector x
plot(cumsum(soccomp.pca$sdev^2)/sum(soccomp.pca$sdev^2),
     xlab="Number of components used", ylab=expression(R^2),
     ylim=c(0,1))
# Add the lines indicating the desired levels
abline(h=0.75, lty="dashed")
abline(h=0.90, lty="dotted")

One principal component already keeps more than 75% of the variance (3 components keep just under 90%)

Backup Example: The axis of historical complexity (4)

# Plot the coordinates/weights/loadings of PC1
  # But suppress the usual labels on the horizontal axis
plot(soccomp.pca$rotation[,1], xlab="", main="PCs of the complexity measures",
     xaxt="n", ylab="PC weight", ylim=c(-1,1), type="l")
# Label the horizontal axis by the names of the variables
  # The las option turns the axis labels on their side (so they don't overlap
  # each other)
axis(side=1, at=1:9, labels=colnames(soccomp)[complexities], las=2,
     cex.axis=0.5)
# Now add PC2 and PC3, in distinct colors
lines(soccomp.pca$rotation[,2], col="red")
lines(soccomp.pca$rotation[,3], col="blue")
# Legend
legend("bottomright", legend=c("PC1", "PC2", "PC3"), lty="solid",
       col=c("black", "red", "blue"))
# Guide-to-the-eye horizontal line at 0
abline(h=0, col="grey")

PC1: Nearly equal positive weight on all variables
- High scores: large population, large areas, large capital cities, much hierarchy, very differentiated, etc.
- Low scores: small and unsophisticated
PC2: negative weight on population, area, levels of hierarchy, positive weight on gov’t, infrastructure, information, money
- High scores: small but sophisticated
- Low scores: big but dumb

Backup Example: The axis of historical complexity (5)

# Add scores to data frame
soccomp$PC1 <- soccomp.pca$x[,1]
# Plot vs. time (all areas)
with(soccomp, plot(PC1 ~ Time))

Backup Example: The axis of historical complexity (6)

Illinois, North India, Italy around Rome, central China, West Africa, Persia/Iran
Spot the collapses of civilization

Backup Example: The axis of historical complexity (7)

plot(soccomp.pca$x[,1], soccomp.pca$x[,2], xlab="Score on PC1",
     ylab="Score on PC2")

Correlation of scores between PC1 and PC2 is 1.5406210^{-14}
Uncorrelated but dependent

MOAR READING

PCA goes back to Pearson (1901)
- This makes it clear that PCA just needs the sample variance matrix for the features (what we’ve been writing $\V$), and that it’s like doing linear regression without picking out one dimension as the thing to be predicted
  - In addition to PCA, Pearson invented the correlation coefficient, the $\chi^2$ test for dependence in contingency tables (and for goodness of fit), and a lot more of modern statistics. He did all this as part of a very racist and classist (but explicitly anti-sexist!) project to breed better human beings [@{Porter-on-KPearson].
- In the subsequent 120 years, PCA got re-invented multiple times, most influentially by Hotelling (1933a);Hotelling (1933b)
  - Lots of presentations inspired by Hotelling leap right to maximizing variance, but I think it makes more sense to focus on trying to come close to the data (as in Pearson’s original paper)
  - The latest re-invention is reported (credulously) in [https://www.wired.com/story/the-style-maven-astrophysicists-of-silicon-valley/]
- One re-invention was actually an important extension, the Karhunen-Loeve transformation or Karhunen-Loeve expansion (Loève 1955):
  - We used PCA to write $p$-dimensional vectors as combinations of $p$ orthogonal basis vectors, so $\vec{x}_i = \sum_{j=1}^{p}{s_{ij} \vec{w}_j}$ with $\vec{w}_j \cdot \vec{w}_k = \delta_{jk}$ and $n^{-1}\sum_{i=1}^{n}{s_{ij} s_{ik}} = \lambda_j \delta_{jk}$
  - Suppose instead our data was $n$ continuous functions, $x_i(t)$, say with $t \in [0,T]$. We’d like to find a set of basis functions $\phi_j(t)$ where we can write $x_i(t) = \sum_{j}{s_{ij} \phi_j(t)}$ (notice the score $s_{ij}$ doesn’t change over time), with $\int_{0}^{T}{\phi_j(t) \phi_k(t) dt} = \delta_{jk}$ (basis functions are orthogonal) and $n^{-1}\sum_{i=1}^{n}{s_{ij} s_{ik}} = \lambda_j \delta_{jk}$ (scores on different basis functions are uncorrelated)
  - K&L (independently) showed how to do this, and also showed how to make it work if we imagine all of the data $x_i(t)$ as being randomly drawn from some distribution over functions
  - The same trick works for functions defined over 2, 3, many, … dimensions
- In many fields, PCA is called “empirical orthogonal functions” (see e.g. Eshel (2012)) because of the connection to the KL transformation
The random projection trick relies on the Johnson-Lindenstrauss (JL) theorem, which goes back to Johnson and Lindenstrauss (1984)
- The theorem is often called the JL lemma, because it was just a lemma (intermediate result) in their original work (Johnson and Lindenstrauss 1984, Lemma 1, pp. 190–191)
  - Johnson and Lindenstrauss were actually solving a kind of interpolation problem, about how to extend functions defined on a limited set of points to functions defined everywhere on a continuous space, and the JL theorem was just a tool for them
- The slickest proof of the Johnson-Lindenstrauss theorem I’ve seen is Dasgupta and Gupta (2002)
- The version of the JL theorem I’ve quoted (for the constants) comes from Boucheron, Lugosi, and Massart (2013)

References

Boucheron, Stéphane, Gábor Lugosi, and Pascal Massart. 2013. Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford: Oxford University Press.

Dasgupta, Sanjoy, and Anupam Gupta. 2002. “An Elementary Proof of a Theorem of Johnson and Lindenstrauss.” Random Structures and Algorithms 22:60–65. https://doi.org/10.1002/rsa.10073.

Deerwester, Scott, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman. 1990. “Indexing by Latent Semantic Analysis.” Journal of the American Society for Information Science 41:391–407. https://doi.org/10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9.

Eshel, Gidon. 2012. Spatiotemporal Data Analysis. Princeton, New Jersey: Princeton University Press.

Hotelling, Harold. 1933a. “Analysis of a Complex of Statistical Variables into Principal Components [Part 1 of 2].” Journal of Educational Psychology 24:471–41. https://doi.org/10.1037/h0071325.

———. 1933b. “Analysis of a Complex of Statistical Variables into Principal Components [Part 2 of 2].” Journal of Educational Psychology 24:498–520. https://doi.org/10.1037/h0070888.

Johnson, William B., and Joram Lindenstrauss. 1984. “Extensions of Lipschitz Mappings into a Hilbert Space.” In Conference on Modern Analysis and Probability, edited by Richard Beals, Anatole Beck, Alexandra Bellow, and Arshag Hajian, 26:189–206. Contemporary Mathematics. Providence, Rhode Island: American Mathematical Society.

Landauer, Thomas K., and Susan T. Dumais. 1997. “A Solution to Plato’s Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge.” Psychological Review 104:211–40. http://lsa.colorado.edu/papers/plato/plato.annote.html.

Loève, Michel. 1955. Probability Theory. 1st ed. New York: D. Van Nostrand Company.

Pearson, Karl. 1901. “On Lines and Planes of Closest Fit to Systems of Points in Space.” Philosophical Magazine 2 (series 6):559–72. https://doi.org/10.1080/14786440109462720.

Turchin, Peter, Thomas E. Currie, Harvey Whitehouse, Pieter François, Kevin Feeney, Daniel Mullins, Daniel Hoyer, et al. 2018. “Quantitative Historical Analysis Uncovers a Single Dimension of Complexity That Structures Global Variation in Human Social Organization.” Proceedings of the National Academy of Sciences (USA) 115:E144–E151. https://doi.org/10.1073/pnas.1708800115.

Linear Dimension Reduction

Previously…

What the data looks like

Finding the “principal” component

Projections

How well does the projection approximate the original?

How well does the projection approximate the original?

Minimizing MSE is maximizing variance

Minimizing MSE is maximizing variance

Minimizing MSE is maximizing variance

OK, how do we find this magic direction?

OK, how do we find this magic direction?

The magic direction is an eigenvector

The first principal component (PC1) and scores on the first principal component

Multiple principle components

About the sample covariance matrix

Some properties of the eigenvalues

Some properties of the PC vectors

The eigendecomposition of \(\V\)

Some properties of the PC scores

Solution to the exercise

Some properties of PCA as a whole

Low-rank approximation

Another way to think about PCA

Some data-mining-y applications

Latent semantic indexing (Deerwester et al. 1990; Landauer and Dumais 1997)

LSI for the New York Times stories, in R

LSI for the New York Times stories

LSI for the New York Times stories

Visualization

Multidimensional scaling (MDS)

Netflix and recommender systems in general

Interpreting PCA results

PCA is exploratory analysis, not statistical inference

PCA and distance preservation

Distance-preserving projections

The random projection trick

The random projection trick

Random projections are nearly distance-preserving with high probability

Summing up

Backup: Gory details of PCA in matrix form

Backup: The gory details

Backup: The gory details

Backup: Distinct eigenvalues of a symmetric matrix have orthogonal eigenvectors

Backup: Distinct eigenvectors with equal eigenvalues can be made or chosen to be orthogonal

Backup: The Lagrange multiplier trick (1)

Backup: The Lagrange multiplier trick (2)

Backup: The Lagrange multiplier trick (3)

Backup: The Lagrange multiplier trick (4)

Backup: The Lagrange multiplier trick (5)

Backup Example: The axis of historical complexity (1)

Backup Example: The axis of historical complexity (2)

Backup Example: The axis of historical complexity (3)

Backup Example: The axis of historical complexity (4)

Backup Example: The axis of historical complexity (5)

Backup Example: The axis of historical complexity (6)

Backup Example: The axis of historical complexity (7)

MOAR READING

References