Linear Dimension Reduction
36-467/667
25 February 2020
\[
\newcommand{\X}{\mathbf{x}}
\newcommand{\w}{\mathbf{w}}
\newcommand{\V}{\mathbf{v}}
\newcommand{\S}{\mathbf{s}}
\newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]}
\newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]}
\newcommand{\SampleVar}[1]{\widehat{\mathrm{Var}}\left[ #1 \right]}
\newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]}
\DeclareMathOperator{\tr}{tr}
\DeclareMathOperator*{\argmin}{argmin}
\]
Previously…
- Represent items in our data base as vectors of numerical features
- Want to look for patterns in those features
- First, or as an example, low-dimensional structure in the vectors
What the data looks like
- Data is \(\mathbf{x} = n\times p\) matrix
- Could be multivariate (\(p\)-dimensional) data on \(n\) items in our data base
- Or multivariate data from \(n\) moments in time
- Or multivariate data from \(n\) spatial locations
- Or scalar data at \(n\) time points and \(p\) locations
- Or multivariate data from \(n_S\) locations at \(n_T\) moments (\(n=n_S n_T\))
- Etc., etc.
- Write \(\vec{x}_i\) for row \(i\) (\(1\times p\) matrix)
- First, center the data, just to reduce book-keeping
- i.e., subtract the mean of each column
- Optional: scale each column to equal variance
Finding the “principal” component
- We don’t want to keep track of \(p\) dimensions
- We want one dimension
- We also don’t want to distort the data too much
- Pick a direction in the \(p\)-dimensional space, and a length-1 vector \(\vec{w}\)
- What’s the best \(\vec{w}\)?
Projections
- \(\vec{x}_i \cdot \vec{w} =\) length of \(\vec{x}_i\)’s projection on to the direction of \(\vec{w}\)
- (Strictly the signed length)
- (Remember we’re requiring \(\|\vec{w}\| = 1\))
- \((\vec{x}_i \cdot \vec{w})\vec{w} =\) the actual projected vector
How well does the projection approximate the original?
Do it for one vector first:
\[\begin{eqnarray}
{\|\vec{x_i} - (\vec{w}\cdot\vec{x_i})\vec{w}\|}^2
& =& \left(\vec{x_i} - (\vec{w}\cdot\vec{x_i})\vec{w}\right)\cdot\left(\vec{x_i} - (\vec{w}\cdot\vec{x_i})\vec{w}\right)\\
& = & \vec{x_i}\cdot\vec{x_i} -\vec{x_i}\cdot (\vec{w}\cdot\vec{x_i})\vec{w}\\
\nonumber & & - (\vec{w}\cdot\vec{x_i})\vec{w}\cdot\vec{x_i} + (\vec{w}\cdot\vec{x_i})\vec{w}\cdot(\vec{w}\cdot\vec{x_i})\vec{w}\\
& = & {\|\vec{x_i}\|}^2 -2(\vec{w}\cdot\vec{x_i})^2 + (\vec{w}\cdot\vec{x_i})^2\vec{w}\cdot\vec{w}\\
& = & \| \vec{x_i}\|^2 - (\vec{w}\cdot\vec{x_i})^2
\end{eqnarray}\]
How well does the projection approximate the original?
- Average across all the data vectors:
\[\begin{eqnarray}
MSE(\vec{w}) & = & \frac{1}{n}\sum_{i=1}^{n}{\|\vec{x_i}\|^2 -{(\vec{w}\cdot\vec{x_i})}^2}\\
& = & \frac{1}{n}\left(\sum_{i=1}^{n}{\|\vec{x_i}\|^2} -\sum_{i=1}^{n}{(\vec{w}\cdot\vec{x_i})^2}\right)
\end{eqnarray}\]
- First bit doesn’t depend on \(\vec{w}\), so doesn’t matter for minimizing
- So we want to maximize \[
L(\vec{w}) = \frac{1}{n}\sum_{i=1}^{n}{{(\vec{w}\cdot\vec{x_i})}^2}
\]
Minimizing MSE is maximizing variance
\[\begin{eqnarray}
L(w) & = & \frac{1}{n}\sum_{i=1}^{n}{{(\vec{w}\cdot\vec{x_i})}^2}\\
& = & {\left(\frac{1}{n}\sum_{i=1}^{n}{\vec{x_i}\cdot\vec{w}}\right)}^2 +
\SampleVar{\vec{w}\cdot\vec{x_i}}
\end{eqnarray}\]
(\(\Expect{Z^2} = (\Expect{Z})^2 + \Var{Z}\))
But \[
\frac{1}{n}\sum_{i=1}^{n}{\vec{x_i} \cdot \vec{w}} = 0
\] (Why?)
Minimizing MSE is maximizing variance
so \[
L(\vec{w}) = \SampleVar{\vec{w}\cdot\vec{x_i}}
\]
Minimizing MSE is maximizing variance
The direction which gives us the best approximation of the data is the direction along which the projections have the greatest variance
OK, how do we find this magic direction?
Matrix form: all lengths of projections \(=\mathbf{x}\mathbf{w}\) \([n\times 1]\)
\[\begin{eqnarray}
\SampleVar{\vec{w}\cdot\vec{x_i}} & = & \frac{1}{n}\sum_{i=1}^{n}{{\left(\vec{x_i} \cdot \vec{w}\right)}^2}\\
& = & \frac{1}{n}{\left(\X \w\right)}^{T} \left(\X \w\right)\\
& = & \frac{1}{n} \w^T \X^T \X \w\\
& = & \w^T \frac{\X^T \X}{n} \w\\
\end{eqnarray}\]
Fact: \(\V \equiv \frac{\X^T \X}{n} =\) sample covariance matrix of the vectors
OK, how do we find this magic direction?
\[\begin{eqnarray}
\mathcal{L}(\w,\lambda) & \equiv & \w^T\V\w - \lambda(\w^T \w -1)\\
\frac{\partial \mathcal{L}}{\partial \lambda} & = & -(\w^T \w -1)\\
\frac{\partial \mathcal{L}}{\partial \w} & = & 2\V\w - 2\lambda\w
\end{eqnarray}\]
\[\begin{eqnarray}
\w^T \w & = & 1\\
\V \w & = & \lambda \w
\end{eqnarray}\]
The magic direction is an eigenvector
\[\begin{eqnarray}
\w^T \w & = & 1\\
\V \w & = & \lambda \w
\end{eqnarray}\]
THIS IS AN EIGENVALUE/EIGENVECTOR EQUATION!
At the solution, \[
\SampleVar{\vec{w}\cdot\vec{x_i}} = \w^T \V \w = \w^T \lambda \w = \lambda
\] so the maximum comes from the leading eigenvector of \(\V\)
The first principal component (PC1) and scores on the first principal component
\[\begin{eqnarray}
\text{Lead eigenvector of}~ \V & = & 1^{\mathrm{st}} \text{principal component}\\
& = & \text{Direction of maximum variance}\\
& = & \text{Best 1D approximation to the data}
\end{eqnarray}\]
- Write the leading eigenvector as \(\vec{w}_1\), or, as a \(p\times 1\) matrix, \(\mathbf{w}_1\)
- The score of the \(i^{\mathrm{th}}\) data point on PC1 is \(\vec{x}_i \cdot \vec{w}\), the (signed) length of \(\vec{x}_i\)’s projection on to \(\vec{w}\)
- The \([n\times 1]\) matrix of all scores on PC1 is \(\mathbf{x} \mathbf{w}_1\)
- The scores are the new feature values we’d use after dimension reduction
- The \([n\times p]\) matrix of all approximations using just PC1 is \(\mathbf{x} \mathbf{w}_1 \mathbf{w}_1^T\)

Multiple principle components
- What about approximating by a plane, hyper-plane, hyper-hyper-plane, etc.?
- If we want to use \(q\) principal components, we’ll need to make \(\w\) a \([p\times q]\) matrix
- Each column of \(\w\) will be the vector in a different direction of the linear subspace we’re projecting down to
- \(\X\w\) will be the \(n\times q\) matrix of new feature values
- Intuition: take the direction of maximum variance \(\perp\) the first principal component
- Then direction of maximum variance \(\perp\) the first two principal components
- These are the eigenvectors of \(\V\), in order of decreasing \(\lambda\)
- Proof: See the “gory details” back-up slides
- But clearly we need to know something about the eigenvalues and eigenvectors
About the sample covariance matrix
- \(\V\) is a \(p\times p\) matrix, so it has \(p\) eigenvalues \(\lambda_1, \ldots \lambda_p\), and \(p\) eigenvectors, \(\vec{w}_1, \ldots \vec{w}_p\)
- Some eigenvalues might be repeated
- We can always chose eigenvectors to have length 1 (normalize them)
- We can stack the eigenvectors in to a \(p\times p\) matrix \(\w\) where each column is a different \(\vec{w}_i\)
- \(\V\) is a covariance matrix, so it has two special properties;
- It’s symmetric
- Meaning: \(\V = \V^T\)
- Since \(v_{ij} = \Cov{X_i, X_j} = \Cov{X_j, X_i} = v_{ji}\)
- It’s non-negative definite
- Meaning: \(\vec{a} \cdot (\V \vec{a}) \geq 0\), for any vector \(\vec{a}\)
- You’re showing that all covariance matrices are non-negative definite in HW 6
- Compatible with some entries in \(\V\) being negative numbers
- Because \(\V\) is symmetric:
- All its eigenvalues are real (not complex)
- Eigenvectors with distinct eigenvalues are orthogonal: If \(\lambda_i \neq \lambda_j\), then \(\vec{w}_i \perp \vec{w}_j\)
- Eigenvectors with equal eigenvalues can be chosen to be orthogonal
- So there are \(p\) orthogonal, normalized (“orthonormal”) eigenvectors
- In \(p\) dimensions, any set of \(p\) orthogonal vectors is a basis
- \(\Rightarrow\) any vector \(\vec{a}\) can be written as a sum of the eigenvectors, \(\vec{a} = \sum_{j=1}^{p}{(\vec{a} \cdot \vec{w}_j) \vec{w}_j}\)
- Because \(\V\) is non-negative definite:
- All \(\lambda_i \geq 0\)
- For any eigenvector \(\vec{w}_i\), \(\vec{w}_i \cdot (\V \vec{w}_i) = \lambda_i (\vec{w}_i \cdot \vec{w}_i)\) but this must be \(\geq 0\) because \(\V\) is non-negative definite
Some properties of the eigenvalues
- All eigenvalues \(\geq 0\)
- Again, this is because \(\V\) is a variance matrix
- In general, \(p\) non-zero eigenvalues
- If the data are exactly in a \(q\)-dimensional subspace, then exactly \(q\) non-zero eigenvalues
- If \(n < p\), at most \(n\) non-zero eigenvalues
- Two points define a line, three define a plane, …
Some properties of the PC vectors
- The principal components are orthonormal
- \(\vec{w}_i \cdot \vec{w}_i = 1\)
- \(\vec{w}_i \cdot \vec{w}_j = 0\) (unless \(i=j\))
- In matrix form, \(\w^T\w = \mathbf{I}\)
- Remember, \(\w =\) the \(p\times p\) matrix whose columns are the PC vectors
- Again, this is because \(\V\) is symmetric
- PC1 is the direction of maximum variance through the data
- That variance is \(\lambda_1\), biggest eigenvalue of \(\V\)
- As we saw in the proof above
- PC \((i+1)\) is the direction of maximum variance \(\perp\) PC1, PC2, \(\ldots\) PC \(i\)
- That variance is \(\lambda_{i+1}\)
- See the “gory details” slides
The eigendecomposition of \(\V\)
- We saw a little bit back that \(\V\) has \(p\) orthonormal eigenvectors, so they’re a basis, and for any vector \(\vec{a}\), \[
\vec{a} = \sum_{j=1}^{p}{\vec{w}_j (\vec{w}_j \cdot \vec{a})}
\]
- This implies that for any vector \(\vec{a}\), \[
\V \vec{a} = \sum_{j=1}^{p}{(\V \vec{w}_j) (\vec{w}_j \cdot \vec{a})} = \sum_{j=1}^{p}{\vec{w}_j \lambda_j (\vec{w}_j \cdot \vec{a})}
\]
- Again, define \(\w\) as the \(p\times p\) matrix whose columns are the eigenvectors
- So \(\w^T\) is the \(p\times p\) matrix whose rows are the eigenvectors
- Define \(\mathbf{\Lambda}\) as the \(p\times p\) diagonal matrix whose diagonal is \(\lambda_1, \lambda_2, \ldots \lambda_p\)
- What we’ve just show is that for any vector \(\vec{a}\), \[
\V \vec{a} = \w \mathbf{\Lambda} \w^T \vec{a}
\] so \[
\V = \w \mathbf{\Lambda} \w^T
\]
- This is the eigendecomposition of \(\V\)
Some properties of the PC scores
Define the \([n\times p]\) matrix of the scores of all data points on all PCs: \[
\S \equiv \X \w
\]
- \(s_{ij} =\) score of the \(i^{\mathrm{th}}\) data point on PC \(j\)
- Average score on each PC \(=0\) (b/c we centered the data)
- Here, “average” is over the \(n\) data points, not over the \(p\) PCs
- That is, for each \(j \in 1:p\), \(\frac{1}{n}\sum_{i=1}^{n}{s_{ij}} = 0\)
- This implies that \[\begin{eqnarray}
\Var{\text{scores}} & = & \frac{1}{n} \S^T \S\\
\end{eqnarray}\]
EXERCISE: Show that \(\Var{\text{scores}} = \mathbf{\Lambda}\)
- \(\V \equiv \frac{\X^T \X}{n} = \w \mathbf{\Lambda} \w^T\)
- \(\w^{-1} = \w^T\) (why?)
Solution to the exercise
\[\begin{eqnarray}
\Var{\text{scores}} & = & \frac{1}{n} \S^T \S\\
& = & \frac{1}{n} (\X\w)^T(\X\w)\\
& = & \frac{1}{n}\w^T \X^T \X \w\\
& = & \w^T \V\w = \w^T (\w \mathbf{\Lambda} \mathbf{w}^T) \w
& = & (\w^T \w) \mathbf{\Lambda} (\w^T\w)\\
& = & \mathbf{\Lambda}
\end{eqnarray}\]
- Variance of score on PC \(j\) \(=\lambda_j\)
- This is variance over the \(n\) data points, each of which will have its own score on PC \(i\)
- That is, \(\frac{1}{n}\sum_{i=1}^{n}{s_{ij}^2} = \lambda_j\)
- Covariance of score on PC \(j\) with score on PC \(k\) \(=0\)
- Again, this is covariance over the \(n\) data points, each of which gets its own score on both PCs
- That is, \(\frac{1}{n}\sum_{i=1}^{n}{s_{ij} s_{ik}} = 0\)
Some properties of PCA as a whole
- If we use all \(p\) principal components, we have the eigendecomposition of \(\V\): \[
\V = \w \mathbf{\Lambda} \mathbf{w}^T
\]
- Once again, \(\mathbf{\Lambda}=\) diagonal matrix of eigenvalues \(\lambda_1, \ldots \lambda_p\) and \(\w =\) matrix whose columns are the PCs
- If we use all \(p\) principal components, \[
\X = \S\w^T
\]
- If we use only the top \(q\) PCs, we get:
- the best rank-\(q\) approximation to \(\V\)
- the best dimension-\(q\) approximation to \(\X\)
- the best rank-\(q\) approximation to \(\X\)
Low-rank approximation
- If we use all the scores, we just re-express the data: \[
\X = \S\w^T = (\X\w) \w^T
\]
- Remember \(\w^T = \w^{-1}\)
- Suppose we just keep the top \(q\) PCs, so \(\w_{q}\) is \(p\times q\) \[
\S_{q} = \X\w_{q} ~(\text{dimension} ~ [n\times q])
\]
- Approximation to the data: \[
\X_{q} = \S_{q} \w_{q}^T ~ ~(\text{dimension} ~ [n\times q][q\times p])
\]
- \(\X_q\) is a rank-\(q\) matrix, and it’s the closest rank-\(q\) matrix to \(\X\)
Another way to think about PCA
- The original coordinates are correlated
- There is always another coordinate system with uncorrelated coordinates
- We’re rotating to that coordinate system
- Rotating to new coordinates \(\Rightarrow\) multiplying by an orthogonal matrix
- That matrix is \(\mathbf{w}\)
- The new coordinates are the scores
Some data-mining-y applications
- Latent semantic indexing
- Multidimensional scaling
- Netflix / recommender systems
Latent semantic indexing (Deerwester et al. 1990; Landauer and Dumais 1997)
- Build bag-of-words representation of our documents, \(p \approx 10^4\) or even \(10^5\)
- Do PCA
- Keep \(q \ll p\) components
- These are the “latent semantic dimensions” of the documents
- Use the scores on those components as the new features for
- Classification
- Prediction
- Similarity-based search
LSI for the New York Times stories, in R
nyt.nice.frame
holds all the stories, with some standard text-processing tweaks applied
- (Up-weight counts of words which appear in few documents, “inverse document frequency” weighting)
- (Normalize all feature vectors to length 1)
dim(nyt.nice.frame)
## [1] 102 4431
- The basic command for PCA in
R
is prcomp()
:
nyt.pca <- prcomp(nyt.nice.frame)
str(nyt.pca)
## List of 5
## $ sdev : num [1:102] 0.141 0.136 0.131 0.124 0.12 ...
## $ rotation: num [1:4431, 1:102] 0.02701 0.04073 -0.00657 -0.02276 0.03522 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:4431] "X." "X.d" "X.nd" "X.s" ...
## .. ..$ : chr [1:102] "PC1" "PC2" "PC3" "PC4" ...
## $ center : Named num [1:4431] 0.0145 0.00612 0.00274 0.0123 0.01255 ...
## ..- attr(*, "names")= chr [1:4431] "X." "X.d" "X.nd" "X.s" ...
## $ scale : logi FALSE
## $ x : num [1:102, 1:102] -0.231 0.0358 -0.0906 -0.0513 -0.1322 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:102] "art.1" "art.2" "art.3" "art.4" ...
## .. ..$ : chr [1:102] "PC1" "PC2" "PC3" "PC4" ...
## - attr(*, "class")= chr "prcomp"
LSI for the New York Times stories
First component:
nyt.latent.sem <- nyt.pca$rotation
signif(sort(nyt.latent.sem[,1],decreasing=TRUE)[1:30],2)
## music trio theater orchestra composers opera
## 0.110 0.084 0.083 0.067 0.059 0.058
## theaters m festival east program y
## 0.055 0.054 0.051 0.049 0.048 0.048
## jersey players committee sunday june concert
## 0.047 0.047 0.046 0.045 0.045 0.045
## symphony organ matinee misstated instruments p
## 0.044 0.044 0.043 0.042 0.041 0.041
## X.d april samuel jazz pianist society
## 0.041 0.040 0.040 0.039 0.038 0.038
signif(sort(nyt.latent.sem[,1],decreasing=FALSE)[1:30],2)
## she her ms i said mother cooper my
## -0.260 -0.240 -0.200 -0.150 -0.130 -0.110 -0.100 -0.094
## painting process paintings im he mrs me gagosian
## -0.088 -0.071 -0.070 -0.068 -0.065 -0.065 -0.063 -0.062
## was picasso image sculpture baby artists work photos
## -0.058 -0.057 -0.056 -0.056 -0.055 -0.055 -0.054 -0.051
## you nature studio out says like
## -0.051 -0.050 -0.050 -0.050 -0.050 -0.049
- Music vs. art
- What’s up with “p”, “m”, “y”, personal pronouns?
LSI for the New York Times stories
Second component:
signif(sort(nyt.latent.sem[,2],decreasing=TRUE)[1:30],2)
## art museum images artists donations museums
## 0.150 0.120 0.095 0.092 0.075 0.073
## painting tax paintings sculpture gallery sculptures
## 0.073 0.070 0.065 0.060 0.055 0.051
## painted white patterns artist nature service
## 0.050 0.050 0.047 0.047 0.046 0.046
## decorative feet digital statue color computer
## 0.043 0.043 0.043 0.042 0.042 0.041
## paris war collections diamond stone dealers
## 0.041 0.041 0.041 0.041 0.041 0.040
signif(sort(nyt.latent.sem[,2],decreasing=FALSE)[1:30],2)
## her she theater opera ms i
## -0.220 -0.220 -0.160 -0.130 -0.130 -0.083
## hour production sang festival music musical
## -0.081 -0.075 -0.075 -0.074 -0.070 -0.070
## songs vocal orchestra la singing matinee
## -0.068 -0.067 -0.067 -0.065 -0.065 -0.061
## performance band awards composers says my
## -0.061 -0.060 -0.058 -0.058 -0.058 -0.056
## im play broadway singer cooper performances
## -0.056 -0.056 -0.055 -0.052 -0.051 -0.051
Visualization
story.classes <- c(rep("art", times=length(art.stories)),
rep("music", times=length(music.stories)))
plot(nyt.pca$x[,1:2],
pch=ifelse(story.classes=="music","m","a"),
col=ifelse(story.classes=="music","blue","red"))

- Notice: each document (= data point, row of \(\X\)) gets its own score on every PC (I’m just using the first two for visualization)
- What part of
prcomp()
’s return value contains the scores?
- Notice: really easy to separate using just 2 principal components!
Multidimensional scaling (MDS)
- Given: High-dimensional vectors with known distances
- Desired: Low-dimensional vectors with nearly the same distances
- One approach: PCA
- Because: triangle inequality
- So we’ve just done MDS
- Gory details: section 3.7 of PDM
Netflix and recommender systems in general
- The data: user ratings of movies (1–5), say \(n\) users by \(p\) movies
- (Really, accounts rather than users…)
- Goal: predict how a user will rate a movie they haven’t seen before
- So Netflix can recommend movies users will like, and keep them paying
- Originally, Netflix released data \(10^8\) ratings from \(4.8\times 10^5\) users of \(1.8 \times 10^4\) movies
- Challenge: improve the RMSE of ratings on a testing set by 10%
- Released in 2006; prize of $ \({10}^6\) for first team to hit the threshold by 2011 (with smaller prizes for incremental progress)
- The first person to use PCA got an immediate improvement of \(3.76\)% (!)
- Questions to decide before using PCA:
- Are users features for movies, or are movies features for users?
- What’s the best way to use the new features (= scores on PCs) for prediction?
- How do we handle missing values?
- We’ll come back to this when we look at recommender systems in more detail
Interpreting PCA results
- PCs are linear combinations of the original coordinates
- Usually a good idea to scale variables with different units of measurement first
- \(\Rightarrow\) PCs change as you add or remove coordinates
- Put in 1000 measures of education and PC1 is education…
- Very tempting to reify the PCs
- i.e., to “make them a thing”
- sometimes totally appropriate…
- sometimes not at all appropriate…
- Be very careful when the only evidence is the PCA
PCA is exploratory analysis, not statistical inference
- We assumed no model
- Pro: That’s the best linear approximation, no matter what
- Con: doesn’t tell us how much our results are driven by noise
- Prediction: PCA predicts nothing
- Inference: If \(\V \rightarrow \Var{X}\) then PCs \(\rightarrow\) eigenvectors of \(\Var{X}\)
- But PCA doesn’t need this assumption
- Doesn’t tell us about uncertainty
- There are PCA-like models (like factor models), which we may come back to later, but those models \(\neq\) PCA
PCA and distance preservation
- In PCA, we project from \(\vec{x}_i\) to \(\mathbf{u}\vec{x}_i\)
- \(\mathbf{u} =\) projection operator for the principal components
- What’s \(\mathbf{u}\) in terms of the eigenvectors?
- We reduce from \(p\) dimensions down to \(q\)
- PCA tries to make the projected points close to the original points, so \[
\frac{1}{n}\sum_{i=1}^{n}{\| \vec{x}_i - \mathbf{u}\vec{x}_i\|^2}
\] is minimized
- So distances between projected points track distances between original points
- (By the triangle inequality)
Distance-preserving projections
- Suppose we just want to preserve distances
- A projection matrix \(\mathbf{u}\) would be distance preserving or an isometry if \[
\|\mathbf{u}\vec{x}_i - \mathbf{u}\vec{x}_j\|^2 = \|\vec{x}_i - \vec{x}_j\|^2
\]
- Exact distance-preservation is a lot to ask
- Approximate distance preservation: we say that \(\mathbf{u}\) is “\(\epsilon\)-close to being an isometry”, or an \(\epsilon\)-isometry, when \[
(1-\epsilon) \|\vec{x}_i - \vec{x}_j\|^2 \leq \|\mathbf{u}\vec{x}_i - \mathbf{u}\vec{x}_j\|^2 \leq (1+\epsilon) \|\vec{x}_i - \vec{x}_j\|^2
\]
- A remarkable math fact: If \(q = O(\frac{\log{n}}{\epsilon^2})\) then there is always an \(\epsilon\)-isometry
The random projection trick
- Make \(Z_{ij} \sim \mathcal{N}(0,1)\) independent standard Gaussians
- Set \[
\mathbf{U} = \frac{1}{\sqrt{q}}\left[\begin{array}{ccc} Z_{11} & Z_{12} & \ldots & Z_{1p}\\
Z_{21} & Z_{22} & \ldots & Z_{2p}\\
\vdots & \vdots & \ldots & \vdots\\
Z_{q1} & Z_{q2} & \ldots & Z_{qp}\end{array}\right]
\]
- \(\mathbf{U}\vec{x}_i\) is like projecting on to \(q\) random \(p\)-dimensional vectors
- What is the expected squared length of those vectors?
- Now you can check that for any vector \(\vec{x}\), \[
\Expect{\|\mathbf{U}\vec{x}\|^2} = \|\vec{x}\|^2
\]
- Hint: \(\Expect{\|\mathbf{U}\vec{x}\|^2} = \Expect{\sum_{i=1}^{q}{\left(\frac{1}{\sqrt{q}}\sum_{j=1}^{p}{Z_{ij} x_j}\right)^2}}\) and \(\|\vec{x}\|^2 = \sum_{j=1}^{p}{x_j^2}\)
The random projection trick
Expectation is linear and the \(Z_{ij}\) are uncorrelated so \[
\Expect{\|\mathbf{U}\vec{x}\|^2} = \frac{1}{q}\sum_{i=1}^{q}{\sum_{j=1}^{p}{x_j^2}} = \|\vec{x}\|^2
\]
Random projections are distance-preserving on average
Random projections are nearly distance-preserving with high probability
- Pick your favorite \(\epsilon > 0\) (approximation level) and \(\delta > 0\) (error probability)
- Johnson-Lindenstrauss theorem: With probability at least \(1-\delta\), \(\mathbf{U}\) is an \(\epsilon\)-isometry for our \(n\) data points if \[
q \geq \frac{8}{\epsilon^2}\log{\frac{n}{\sqrt{\delta}}}
\]
- E.g., for 95% confidence, need \(q \geq \frac{8}{\epsilon^2}\left(\log{(n)} + 1.498\ldots \right)\)
- For 95% confidence with \(n=205\) (like in HW 7) and \(\epsilon=0.2\), we need \(q \geq 1365\)
- If we had \(n=2\times 10^6\), we’d need \(q \geq 3202\)
- The remarkable bits: \(p\) doesn’t matter and neither do the actual data vectors
- For the HW 7 data, \(p=4.725\times 10^{5}\)
- You can also turn this around to see what \(\epsilon\) you can achieve with a fixed budget of \(q\)
- With \(q = 100\), \(n=205\), \(\delta=0.05\), we get \(\epsilon \approx 0.74\) which isn’t great but is some signal
- Again, independent of \(p\)
Summing up
- We have multivariate data \(\X\) (dimension = \([n\times p]\))
- We want the best \(q\)-dimensional linear approximation
- Solution: Principal components analysis
- Take the \(q\) leading eigenvectors of \(\V \equiv \frac{1}{n}\X^T \X =\) sample/empirical covariance matrix
- These eigenvectors = the principal components = directions of largest variance = biggest contrasts within the data
- Project the data on to the principal components
- Equivalently: rotate to new, uncorrelated coordinates
- Assemble the eigenvectors into \(\w_q\) (\([p\times q]\))
- Scores for the data are \(\X\w_q \equiv \S_q\) (\([n\times q]\))
- Approximations are \((\X\w_q) \w_q^T = \S_q \w_q^T\) (\([n\times p]\))
- No inference or prediction, just linear approximation
- If we just want to (approximately) preserve distances, random linear projections can work really well too
Backup: The gory details
\[\begin{eqnarray}
MSE(\w) & = & \frac{1}{n} \tr{((\X-\X\w\w^T)(\X^T - \w\w^T \X^T))}\\
& = & \frac{1}{n} \tr{(\X \X^T - \X\w\w^T\X^T - \X\w\w^T\X^T + \X\w\w^T\w\w^T\X^T)}\\
& = & \frac{1}{n}\left(\tr{(\X\X^T)} - 2\tr{(\X\w\w^T\X^T)} + \tr{(\X\w\w^T\X^T)}\right)\\
& = & \frac{1}{n}\tr{(\X\X^T)} - \frac{1}{n}\tr{(\X\w\w^T\X^T)}
\end{eqnarray}\]
so maximize \(\frac{1}{n}\tr{(\X\w\w^T\X^T)}\)
Backup: The gory details
“trace is cyclic” so \[
\tr{(\X\w\w^T\X^T)} = \tr{(\X^T\X\w\w^T)} = \tr{(\w^T\X^T\X\w)}
\] so we want to maximize \[
\tr{\left(\w^T \frac{\X^T \X}{n}\w\right)}
\] under the constraint \[
\w^T \w = \mathbf{I}
\]
This is the same form we saw before, so it has the same sort of solution: each column of \(\w\) must be an eigenvector of \(\V\).
Backup: Distinct eigenvalues of a symmetric matrix have orthogonal eigenvectors
For any symmetric matrix with eigenvalues \(\lambda_1, \ldots \lambda_p\) and eigenvectors \(\vec{w}_1, \ldots \vec{w}_p\), if \(\lambda_i \neq \lambda_j\), then \(\vec{w}_i \perp \vec{w}_j\)
Proof: Remember that \(\vec{w}_i \perp \vec{w}_j\) iff \(\vec{w}_i \cdot \vec{w}_j = 0\), so let’s get at that inner product. Now, for any two vectors \(\vec{a}\), \(\vec{b}\) and square matrix \(\mathbf{c}\) (not necessarily symmetric), \[
\vec{a} \cdot (\mathbf{c} \vec{b}) = (\mathbf{c}^T \vec{a}) \cdot \vec{b}
\] (To see this, write \(\vec{a}\) and \(\vec{b}\) as \(p\times 1\) matrices, so we’ve got \(\vec{a}^T \mathbf{c} \vec{b} = \vec{a}^T (\mathbf{c}^T)^T \vec{b} = (\mathbf{c}^T \vec{a})^T \vec{b}\).) Now let’s apply this to our eigenvectors: \[\begin{eqnarray}
\vec{w}_i \cdot (\V \vec{w}_j) & = & (\V^T \vec{w}_i) \cdot \vec{w}_j\\
\vec{w}_i \cdot (\V \vec{w}_j) & = & (\V \vec{w}_i) \cdot \vec{w}_j
\vec{w}_i \cdot (\lambda_j \vec{w}_j) & = & (\lambda_i \vec{w}_i) \cdot \vec{w}_j\\
\lambda_j (\vec{w}_i \cdot \vec{w}_j) & = & \lambda_i (\vec{w}_i \cdot \vec{w}_j)
\end{eqnarray}\] (The second line uses the assumption that \(\V\) is symmetric, the third line uses the fact that \(\vec{w}_i\) and \(\vec{w}_j\) are both eigenvectors of \(\V\), the last line uses linearity of inner products.)
Since, by assumption, \(\lambda_i \neq \lambda_j\), the only way the two sides of the equation can balance is if \(\vec{w}_i \cdot \vec{w}_j = 0\), as was to be shown.
Backup: Distinct eigenvectors with equal eigenvalues can be made or chosen to be orthogonal
- Remember that if \(\vec{w}_i\) and \(\vec{w}_j\) are both eigenvectors of some matrix \(\mathbf{c}\), and both have eigenvalue \(\lambda\), then \(a\vec{w}_i + b\vec{w}_j\) is also an eigenvector of \(\mathbf{c}\), with eigenvalue \(\lambda\): \[
\mathbf{c}\left(a\vec{w}_i + b\vec{w}_j\right) = a(\mathbf{c}\vec{w}_i) + b(\mathbf{c}\vec{w}_j) = \lambda(a\vec{w}_i + b\vec{w}_j)
\]
- Now suppose \(\vec{w}_i\) and \(\vec{w}_j\) are both eigenvectors of \(\mathbf{c}\) with eigenvalue \(\lambda\), and that \(\vec{w}_i \not\perp \vec{w}_j\); we’ll construct a new eigenvector, also with eigenvalue \(\lambda\), which is orthogonal to \(\vec{w}_i\)
- Without loss of generality (“wolog”), assume \(\|\vec{w}_i\|=\|\vec{w}_j\|=1\)
- Define \(\vec{u} \equiv \vec{w}_j - (\vec{w}_i \cdot \vec{w}_j) \vec{w}_i\)
- Claim 1: \(\vec{u}\) is an eigenvector with eigenvalue \(\lambda\) (because it’s a linear combination of two eigenvectors with eigenvalue \(\lambda\))
- Claim 2: \(\vec{u} \perp \vec{w}_i\): \[\begin{eqnarray}
\vec{u} \cdot \vec{w}_i & = & \left(\vec{w}_j - (\vec{w}_i \cdot \vec{w}_j) \vec{w}_i\right) \cdot \vec{w}_i\\
& = & \vec{w}_j \cdot \vec{w}_i - (\vec{w}_i \cdot \vec{w}_j) (\vec{w}_i \cdot \vec{w}_i)\\
& = & \vec{w}_j \cdot \vec{w}_i - \vec{w}_i \cdot \vec{w}_j\\
& = & 0
\end{eqnarray}\] (using \(\|\vec{w}_i\|=1\) and symmetry of inner products)
- Claim 3: \(\vec{w}_j^{\prime} = \frac{\vec{u}}{\|\vec{u}\|}\) is also an eigenvector with eigenvalue \(\lambda\), and also orthogonal to \(\vec{w}_i\)
- Because it’s proportional to \(\vec{u}\)
- So we can replace \(\vec{w}_j\) with \(\vec{w}_j^{\prime}\) in our list of eigenvectors
Backup: The Lagrange multiplier trick (1)
- Want to solve \[
\max_{w}{L(w)}
\] with constraint \(f(w) = c\)
- Option I: Learn techniques for constrained optimization
- Drawback: Who wants to major in operations research?
- Option II: Use \(f(w)=c\) to eliminate one coordinate of \(w\)
- Then \(w=g(v,c)\) for some function \(g\) and unconstrained \(v\)
- Do \(\max_{v}{L(g(v,c))}\)
- An unconstrained problem with one less variable
- Drawback: math is hard!
- Option III: Solve an unconstrained problem with more variables
- Drawback: Would only occur to a French mathematician (Joseph-Louis Lagrange, 1736–1813)
Backup: The Lagrange multiplier trick (2)
- Define a Lagrangian \[
\mathcal{L}(w,\lambda) \equiv L(w) - \lambda(f(w) - c)
\]
- \(=L(w)\) when the constraint holds
- Some people write \(+\lambda\); doesn’t make any difference
- \(\lambda\) is the Lagrange multiplier which enforces the constraint \(f(w)=c\)
- Now do an unconstrained optimization over \(w\) and \(\lambda\): \[
\max_{w, \lambda}{\mathcal{L}(w,\lambda)}
\]
Backup: The Lagrange multiplier trick (3)
\[
\max_{w, \lambda}{\mathcal{L}(w,\lambda)}
\]
- Take derivatives: \[\begin{eqnarray}
\frac{\partial \mathcal{L}}{\partial \lambda} & = & -(f(w)-c)\\
\frac{\partial \mathcal{L}}{\partial w} & = & \frac{\partial L}{\partial w} - \lambda\frac{\partial f}{\partial w}
\end{eqnarray}\]
- Set to 0 at the maximum: \[\begin{eqnarray}
f(w) & = & c\\
\frac{\partial L}{\partial w} & = & \lambda\frac{\partial f}{\partial w}
\end{eqnarray}\]
- We’ve automatically recovered the constraint!
- We get 1 equation per unknown \(\Rightarrow\) Solve for \(\lambda\), \(w\)
Backup: The Lagrange multiplier trick (4)
- More equality constraints \(\Rightarrow\) more Lagrange multipliers
- Inequality constraints, \(g(w) \leq d\), are trickier
- The basic question with inequalities: Is the unconstrained maximum inside the feasible set?
- Yes: problem solved
- No: constrained maximum is on the boundary
- Boundary is usually \(g(w) = d\) \(\Rightarrow\) treat like an equality
- There are subtleties; sometimes we need to learn some operations research
Backup: The Lagrange multiplier trick (5)
- Why not to do this instead? \[
\max_{\lambda, w}{L(w) + \lambda(f(w)-c))^2}
\]
Backup Example: The axis of historical complexity (1)
Turchin et al. (2018)
9 variable measuring social complexity, over time, for dozens of locations across the world
Dates from -9600 to 1900
summary(soccomp[,complexities], digits=2)
## PolPop PolTerr CapPop levels government
## Min. :1.4 Min. :-0.22 Min. :1.4 Min. :0.0 Min. :0.00
## 1st Qu.:4.2 1st Qu.: 3.65 1st Qu.:3.5 1st Qu.:1.8 1st Qu.:0.24
## Median :6.0 Median : 5.18 Median :4.3 Median :3.0 Median :0.62
## Mean :5.5 Mean : 4.78 Mean :4.2 Mean :2.9 Mean :0.55
## 3rd Qu.:6.8 3rd Qu.: 5.97 3rd Qu.:5.1 3rd Qu.:4.0 3rd Qu.:0.86
## Max. :8.5 Max. : 7.40 Max. :6.3 Max. :6.6 Max. :1.00
## infrastr writing texts money
## Min. :0.00 Min. :0.00 Min. :0.00 Min. :0.0
## 1st Qu.:0.34 1st Qu.:0.26 1st Qu.:0.10 1st Qu.:1.8
## Median :0.75 Median :0.82 Median :0.93 Median :4.0
## Mean :0.64 Mean :0.65 Mean :0.63 Mean :3.4
## 3rd Qu.:0.90 3rd Qu.:0.86 3rd Qu.:0.97 3rd Qu.:5.0
## Max. :1.00 Max. :1.00 Max. :1.00 Max. :6.0
Backup Example: The axis of historical complexity (2)
soccomp.pca <- prcomp(soccomp[,complexities], scale=TRUE)
What are the parts of the return value?
str(soccomp.pca)
## List of 5
## $ sdev : num [1:9] 2.634 0.733 0.646 0.581 0.481 ...
## $ rotation: num [1:9, 1:9] 0.351 0.32 0.339 0.341 0.332 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:9] "PolPop" "PolTerr" "CapPop" "levels" ...
## .. ..$ : chr [1:9] "PC1" "PC2" "PC3" "PC4" ...
## $ center : Named num [1:9] 5.515 4.779 4.229 2.923 0.552 ...
## ..- attr(*, "names")= chr [1:9] "PolPop" "PolTerr" "CapPop" "levels" ...
## $ scale : Named num [1:9] 1.59 1.561 1.112 1.449 0.325 ...
## ..- attr(*, "names")= chr [1:9] "PolPop" "PolTerr" "CapPop" "levels" ...
## $ x : num [1:414, 1:9] -4.35 -4.24 -4.11 -3.67 -3.51 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : NULL
## .. ..$ : chr [1:9] "PC1" "PC2" "PC3" "PC4" ...
## - attr(*, "class")= chr "prcomp"
sdev
= standard deviations along PCs \(=\sqrt{\lambda_i}\)
rotation
= matrix of eigenvectors \(=\w\)
x
= scores on all PCs \(=\S\)
Backup Example: The axis of historical complexity (3)
# R^2 for k components = sum(first k eigenvalues)/sum(all eigenvalues)
# cumsum(x) returns the cumulative sums along the vector x
plot(cumsum(soccomp.pca$sdev^2)/sum(soccomp.pca$sdev^2),
xlab="Number of components used", ylab=expression(R^2),
ylim=c(0,1))
# Add the lines indicating the desired levels
abline(h=0.75, lty="dashed")
abline(h=0.90, lty="dotted")

One principal component already keeps more than 75% of the variance (3 components keep just under 90%)
Backup Example: The axis of historical complexity (4)
# Plot the coordinates/weights/loadings of PC1
# But suppress the usual labels on the horizontal axis
plot(soccomp.pca$rotation[,1], xlab="", main="PCs of the complexity measures",
xaxt="n", ylab="PC weight", ylim=c(-1,1), type="l")
# Label the horizontal axis by the names of the variables
# The las option turns the axis labels on their side (so they don't overlap
# each other)
axis(side=1, at=1:9, labels=colnames(soccomp)[complexities], las=2,
cex.axis=0.5)
# Now add PC2 and PC3, in distinct colors
lines(soccomp.pca$rotation[,2], col="red")
lines(soccomp.pca$rotation[,3], col="blue")
# Legend
legend("bottomright", legend=c("PC1", "PC2", "PC3"), lty="solid",
col=c("black", "red", "blue"))
# Guide-to-the-eye horizontal line at 0
abline(h=0, col="grey")

- PC1: Nearly equal positive weight on all variables
- High scores: large population, large areas, large capital cities, much hierarchy, very differentiated, etc.
- Low scores: small and unsophisticated
- PC2: negative weight on population, area, levels of hierarchy, positive weight on gov’t, infrastructure, information, money
- High scores: small but sophisticated
- Low scores: big but dumb
Backup Example: The axis of historical complexity (5)
# Add scores to data frame
soccomp$PC1 <- soccomp.pca$x[,1]
# Plot vs. time (all areas)
with(soccomp, plot(PC1 ~ Time))

Backup Example: The axis of historical complexity (6)

- Illinois, North India, Italy around Rome, central China, West Africa, Persia/Iran
- Spot the collapses of civilization
Backup Example: The axis of historical complexity (7)
plot(soccomp.pca$x[,1], soccomp.pca$x[,2], xlab="Score on PC1",
ylab="Score on PC2")

- Correlation of scores between PC1 and PC2 is 1.5406210^{-14}
- Uncorrelated but dependent
MOAR READING
- PCA goes back to Pearson (1901)
- This makes it clear that PCA just needs the sample variance matrix for the features (what we’ve been writing \(\V\)), and that it’s like doing linear regression without picking out one dimension as the thing to be predicted
- In addition to PCA, Pearson invented the correlation coefficient, the \(\chi^2\) test for dependence in contingency tables (and for goodness of fit), and a lot more of modern statistics. He did all this as part of a very racist and classist (but explicitly anti-sexist!) project to breed better human beings [@{Porter-on-KPearson].
- In the subsequent 120 years, PCA got re-invented multiple times, most influentially by Hotelling (1933a);Hotelling (1933b)
- One re-invention was actually an important extension, the Karhunen-Loeve transformation or Karhunen-Loeve expansion (Loève 1955):
- We used PCA to write \(p\)-dimensional vectors as combinations of \(p\) orthogonal basis vectors, so \(\vec{x}_i = \sum_{j=1}^{p}{s_{ij} \vec{w}_j}\) with \(\vec{w}_j \cdot \vec{w}_k = \delta_{jk}\) and \(n^{-1}\sum_{i=1}^{n}{s_{ij} s_{ik}} = \lambda_j \delta_{jk}\)
- Suppose instead our data was \(n\) continuous functions, \(x_i(t)\), say with \(t \in [0,T]\). We’d like to find a set of basis functions \(\phi_j(t)\) where we can write \(x_i(t) = \sum_{j}{s_{ij} \phi_j(t)}\) (notice the score \(s_{ij}\) doesn’t change over time), with \(\int_{0}^{T}{\phi_j(t) \phi_k(t) dt} = \delta_{jk}\) (basis functions are orthogonal) and \(n^{-1}\sum_{i=1}^{n}{s_{ij} s_{ik}} = \lambda_j \delta_{jk}\) (scores on different basis functions are uncorrelated)
- K&L (independently) showed how to do this, and also showed how to make it work if we imagine all of the data \(x_i(t)\) as being randomly drawn from some distribution over functions
- The same trick works for functions defined over 2, 3, many, … dimensions
- In many fields, PCA is called “empirical orthogonal functions” (see e.g. Eshel (2012)) because of the connection to the KL transformation
- The random projection trick relies on the Johnson-Lindenstrauss (JL) theorem, which goes back to Johnson and Lindenstrauss (1984)
- The theorem is often called the JL lemma, because it was just a lemma (intermediate result) in their original work (Johnson and Lindenstrauss 1984, Lemma 1, pp. 190–191)
- Johnson and Lindenstrauss were actually solving a kind of interpolation problem, about how to extend functions defined on a limited set of points to functions defined everywhere on a continuous space, and the JL theorem was just a tool for them
- The slickest proof of the Johnson-Lindenstrauss theorem I’ve seen is Dasgupta and Gupta (2002)
- The version of the JL theorem I’ve quoted (for the constants) comes from Boucheron, Lugosi, and Massart (2013)
References
Boucheron, Stéphane, Gábor Lugosi, and Pascal Massart. 2013. Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford: Oxford University Press.
Dasgupta, Sanjoy, and Anupam Gupta. 2002. “An Elementary Proof of a Theorem of Johnson and Lindenstrauss.” Random Structures and Algorithms 22:60–65. https://doi.org/10.1002/rsa.10073.
Eshel, Gidon. 2012. Spatiotemporal Data Analysis. Princeton, New Jersey: Princeton University Press.
Hotelling, Harold. 1933a. “Analysis of a Complex of Statistical Variables into Principal Components [Part 1 of 2].” Journal of Educational Psychology 24:471–41. https://doi.org/10.1037/h0071325.
———. 1933b. “Analysis of a Complex of Statistical Variables into Principal Components [Part 2 of 2].” Journal of Educational Psychology 24:498–520. https://doi.org/10.1037/h0070888.
Johnson, William B., and Joram Lindenstrauss. 1984. “Extensions of Lipschitz Mappings into a Hilbert Space.” In Conference on Modern Analysis and Probability, edited by Richard Beals, Anatole Beck, Alexandra Bellow, and Arshag Hajian, 26:189–206. Contemporary Mathematics. Providence, Rhode Island: American Mathematical Society.
Loève, Michel. 1955. Probability Theory. 1st ed. New York: D. Van Nostrand Company.
Turchin, Peter, Thomas E. Currie, Harvey Whitehouse, Pieter François, Kevin Feeney, Daniel Mullins, Daniel Hoyer, et al. 2018. “Quantitative Historical Analysis Uncovers a Single Dimension of Complexity That Structures Global Variation in Human Social Organization.” Proceedings of the National Academy of Sciences (USA) 115:E144–E151. https://doi.org/10.1073/pnas.1708800115.