Linear Dimension Reduction

36-467/667

25 February 2020

\[ \newcommand{\X}{\mathbf{x}} \newcommand{\w}{\mathbf{w}} \newcommand{\V}{\mathbf{v}} \newcommand{\S}{\mathbf{s}} \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\SampleVar}[1]{\widehat{\mathrm{Var}}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \DeclareMathOperator{\tr}{tr} \DeclareMathOperator*{\argmin}{argmin} \]

Previously…

What the data looks like

Finding the “principal” component

Projections

How well does the projection approximate the original?

Do it for one vector first:

\[\begin{eqnarray} {\|\vec{x_i} - (\vec{w}\cdot\vec{x_i})\vec{w}\|}^2 & =& \left(\vec{x_i} - (\vec{w}\cdot\vec{x_i})\vec{w}\right)\cdot\left(\vec{x_i} - (\vec{w}\cdot\vec{x_i})\vec{w}\right)\\ & = & \vec{x_i}\cdot\vec{x_i} -\vec{x_i}\cdot (\vec{w}\cdot\vec{x_i})\vec{w}\\ \nonumber & & - (\vec{w}\cdot\vec{x_i})\vec{w}\cdot\vec{x_i} + (\vec{w}\cdot\vec{x_i})\vec{w}\cdot(\vec{w}\cdot\vec{x_i})\vec{w}\\ & = & {\|\vec{x_i}\|}^2 -2(\vec{w}\cdot\vec{x_i})^2 + (\vec{w}\cdot\vec{x_i})^2\vec{w}\cdot\vec{w}\\ & = & \| \vec{x_i}\|^2 - (\vec{w}\cdot\vec{x_i})^2 \end{eqnarray}\]

How well does the projection approximate the original?

\[\begin{eqnarray} MSE(\vec{w}) & = & \frac{1}{n}\sum_{i=1}^{n}{\|\vec{x_i}\|^2 -{(\vec{w}\cdot\vec{x_i})}^2}\\ & = & \frac{1}{n}\left(\sum_{i=1}^{n}{\|\vec{x_i}\|^2} -\sum_{i=1}^{n}{(\vec{w}\cdot\vec{x_i})^2}\right) \end{eqnarray}\]

Minimizing MSE is maximizing variance

\[\begin{eqnarray} L(w) & = & \frac{1}{n}\sum_{i=1}^{n}{{(\vec{w}\cdot\vec{x_i})}^2}\\ & = & {\left(\frac{1}{n}\sum_{i=1}^{n}{\vec{x_i}\cdot\vec{w}}\right)}^2 + \SampleVar{\vec{w}\cdot\vec{x_i}} \end{eqnarray}\]

(\(\Expect{Z^2} = (\Expect{Z})^2 + \Var{Z}\))

But \[ \frac{1}{n}\sum_{i=1}^{n}{\vec{x_i} \cdot \vec{w}} = 0 \] (Why?)

Minimizing MSE is maximizing variance

so \[ L(\vec{w}) = \SampleVar{\vec{w}\cdot\vec{x_i}} \]

Minimizing MSE is maximizing variance

The direction which gives us the best approximation of the data is the direction along which the projections have the greatest variance

OK, how do we find this magic direction?

Matrix form: all lengths of projections \(=\mathbf{x}\mathbf{w}\) \([n\times 1]\)

\[\begin{eqnarray} \SampleVar{\vec{w}\cdot\vec{x_i}} & = & \frac{1}{n}\sum_{i=1}^{n}{{\left(\vec{x_i} \cdot \vec{w}\right)}^2}\\ & = & \frac{1}{n}{\left(\X \w\right)}^{T} \left(\X \w\right)\\ & = & \frac{1}{n} \w^T \X^T \X \w\\ & = & \w^T \frac{\X^T \X}{n} \w\\ \end{eqnarray}\]

Fact: \(\V \equiv \frac{\X^T \X}{n} =\) sample covariance matrix of the vectors

OK, how do we find this magic direction?

\[\begin{eqnarray} \mathcal{L}(\w,\lambda) & \equiv & \w^T\V\w - \lambda(\w^T \w -1)\\ \frac{\partial \mathcal{L}}{\partial \lambda} & = & -(\w^T \w -1)\\ \frac{\partial \mathcal{L}}{\partial \w} & = & 2\V\w - 2\lambda\w \end{eqnarray}\] \[\begin{eqnarray} \w^T \w & = & 1\\ \V \w & = & \lambda \w \end{eqnarray}\]

The magic direction is an eigenvector

\[\begin{eqnarray} \w^T \w & = & 1\\ \V \w & = & \lambda \w \end{eqnarray}\]

THIS IS AN EIGENVALUE/EIGENVECTOR EQUATION!

At the solution, \[ \SampleVar{\vec{w}\cdot\vec{x_i}} = \w^T \V \w = \w^T \lambda \w = \lambda \] so the maximum comes from the leading eigenvector of \(\V\)

The first principal component (PC1) and scores on the first principal component

\[\begin{eqnarray} \text{Lead eigenvector of}~ \V & = & 1^{\mathrm{st}} \text{principal component}\\ & = & \text{Direction of maximum variance}\\ & = & \text{Best 1D approximation to the data} \end{eqnarray}\]

Multiple principle components

About the sample covariance matrix

Some properties of the eigenvalues

Some properties of the PC vectors

The eigendecomposition of \(\V\)

Some properties of the PC scores

Define the \([n\times p]\) matrix of the scores of all data points on all PCs: \[ \S \equiv \X \w \]

EXERCISE: Show that \(\Var{\text{scores}} = \mathbf{\Lambda}\)

  1. \(\V \equiv \frac{\X^T \X}{n} = \w \mathbf{\Lambda} \w^T\)
  2. \(\w^{-1} = \w^T\) (why?)

Solution to the exercise

\[\begin{eqnarray} \Var{\text{scores}} & = & \frac{1}{n} \S^T \S\\ & = & \frac{1}{n} (\X\w)^T(\X\w)\\ & = & \frac{1}{n}\w^T \X^T \X \w\\ & = & \w^T \V\w = \w^T (\w \mathbf{\Lambda} \mathbf{w}^T) \w & = & (\w^T \w) \mathbf{\Lambda} (\w^T\w)\\ & = & \mathbf{\Lambda} \end{eqnarray}\]

Some properties of PCA as a whole

Low-rank approximation

Another way to think about PCA

Some data-mining-y applications

Latent semantic indexing (Deerwester et al. 1990; Landauer and Dumais 1997)

LSI for the New York Times stories, in R

dim(nyt.nice.frame)
## [1]  102 4431
nyt.pca <- prcomp(nyt.nice.frame)
str(nyt.pca)
## List of 5
##  $ sdev    : num [1:102] 0.141 0.136 0.131 0.124 0.12 ...
##  $ rotation: num [1:4431, 1:102] 0.02701 0.04073 -0.00657 -0.02276 0.03522 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:4431] "X." "X.d" "X.nd" "X.s" ...
##   .. ..$ : chr [1:102] "PC1" "PC2" "PC3" "PC4" ...
##  $ center  : Named num [1:4431] 0.0145 0.00612 0.00274 0.0123 0.01255 ...
##   ..- attr(*, "names")= chr [1:4431] "X." "X.d" "X.nd" "X.s" ...
##  $ scale   : logi FALSE
##  $ x       : num [1:102, 1:102] -0.231 0.0358 -0.0906 -0.0513 -0.1322 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:102] "art.1" "art.2" "art.3" "art.4" ...
##   .. ..$ : chr [1:102] "PC1" "PC2" "PC3" "PC4" ...
##  - attr(*, "class")= chr "prcomp"

LSI for the New York Times stories

First component:

nyt.latent.sem <- nyt.pca$rotation
signif(sort(nyt.latent.sem[,1],decreasing=TRUE)[1:30],2)
##       music        trio     theater   orchestra   composers       opera 
##       0.110       0.084       0.083       0.067       0.059       0.058 
##    theaters           m    festival        east     program           y 
##       0.055       0.054       0.051       0.049       0.048       0.048 
##      jersey     players   committee      sunday        june     concert 
##       0.047       0.047       0.046       0.045       0.045       0.045 
##    symphony       organ     matinee   misstated instruments           p 
##       0.044       0.044       0.043       0.042       0.041       0.041 
##         X.d       april      samuel        jazz     pianist     society 
##       0.041       0.040       0.040       0.039       0.038       0.038
signif(sort(nyt.latent.sem[,1],decreasing=FALSE)[1:30],2)
##       she       her        ms         i      said    mother    cooper        my 
##    -0.260    -0.240    -0.200    -0.150    -0.130    -0.110    -0.100    -0.094 
##  painting   process paintings        im        he       mrs        me  gagosian 
##    -0.088    -0.071    -0.070    -0.068    -0.065    -0.065    -0.063    -0.062 
##       was   picasso     image sculpture      baby   artists      work    photos 
##    -0.058    -0.057    -0.056    -0.056    -0.055    -0.055    -0.054    -0.051 
##       you    nature    studio       out      says      like 
##    -0.051    -0.050    -0.050    -0.050    -0.050    -0.049

LSI for the New York Times stories

Second component:

signif(sort(nyt.latent.sem[,2],decreasing=TRUE)[1:30],2)
##         art      museum      images     artists   donations     museums 
##       0.150       0.120       0.095       0.092       0.075       0.073 
##    painting         tax   paintings   sculpture     gallery  sculptures 
##       0.073       0.070       0.065       0.060       0.055       0.051 
##     painted       white    patterns      artist      nature     service 
##       0.050       0.050       0.047       0.047       0.046       0.046 
##  decorative        feet     digital      statue       color    computer 
##       0.043       0.043       0.043       0.042       0.042       0.041 
##       paris         war collections     diamond       stone     dealers 
##       0.041       0.041       0.041       0.041       0.041       0.040
signif(sort(nyt.latent.sem[,2],decreasing=FALSE)[1:30],2)
##          her          she      theater        opera           ms            i 
##       -0.220       -0.220       -0.160       -0.130       -0.130       -0.083 
##         hour   production         sang     festival        music      musical 
##       -0.081       -0.075       -0.075       -0.074       -0.070       -0.070 
##        songs        vocal    orchestra           la      singing      matinee 
##       -0.068       -0.067       -0.067       -0.065       -0.065       -0.061 
##  performance         band       awards    composers         says           my 
##       -0.061       -0.060       -0.058       -0.058       -0.058       -0.056 
##           im         play     broadway       singer       cooper performances 
##       -0.056       -0.056       -0.055       -0.052       -0.051       -0.051

Visualization

story.classes <- c(rep("art", times=length(art.stories)),
                   rep("music", times=length(music.stories)))
plot(nyt.pca$x[,1:2],
     pch=ifelse(story.classes=="music","m","a"),
     col=ifelse(story.classes=="music","blue","red"))

Multidimensional scaling (MDS)

Netflix and recommender systems in general

Interpreting PCA results

PCA is exploratory analysis, not statistical inference

PCA and distance preservation

Distance-preserving projections

The random projection trick

The random projection trick

Random projections are nearly distance-preserving with high probability

Summing up

Backup: Gory details of PCA in matrix form

Backup: The gory details

\[\begin{eqnarray} MSE(\w) & = & \frac{1}{n} \tr{((\X-\X\w\w^T)(\X^T - \w\w^T \X^T))}\\ & = & \frac{1}{n} \tr{(\X \X^T - \X\w\w^T\X^T - \X\w\w^T\X^T + \X\w\w^T\w\w^T\X^T)}\\ & = & \frac{1}{n}\left(\tr{(\X\X^T)} - 2\tr{(\X\w\w^T\X^T)} + \tr{(\X\w\w^T\X^T)}\right)\\ & = & \frac{1}{n}\tr{(\X\X^T)} - \frac{1}{n}\tr{(\X\w\w^T\X^T)} \end{eqnarray}\]

so maximize \(\frac{1}{n}\tr{(\X\w\w^T\X^T)}\)

Backup: The gory details

“trace is cyclic” so \[ \tr{(\X\w\w^T\X^T)} = \tr{(\X^T\X\w\w^T)} = \tr{(\w^T\X^T\X\w)} \] so we want to maximize \[ \tr{\left(\w^T \frac{\X^T \X}{n}\w\right)} \] under the constraint \[ \w^T \w = \mathbf{I} \]

This is the same form we saw before, so it has the same sort of solution: each column of \(\w\) must be an eigenvector of \(\V\).

Backup: Distinct eigenvalues of a symmetric matrix have orthogonal eigenvectors

For any symmetric matrix with eigenvalues \(\lambda_1, \ldots \lambda_p\) and eigenvectors \(\vec{w}_1, \ldots \vec{w}_p\), if \(\lambda_i \neq \lambda_j\), then \(\vec{w}_i \perp \vec{w}_j\)

Proof: Remember that \(\vec{w}_i \perp \vec{w}_j\) iff \(\vec{w}_i \cdot \vec{w}_j = 0\), so let’s get at that inner product. Now, for any two vectors \(\vec{a}\), \(\vec{b}\) and square matrix \(\mathbf{c}\) (not necessarily symmetric), \[ \vec{a} \cdot (\mathbf{c} \vec{b}) = (\mathbf{c}^T \vec{a}) \cdot \vec{b} \] (To see this, write \(\vec{a}\) and \(\vec{b}\) as \(p\times 1\) matrices, so we’ve got \(\vec{a}^T \mathbf{c} \vec{b} = \vec{a}^T (\mathbf{c}^T)^T \vec{b} = (\mathbf{c}^T \vec{a})^T \vec{b}\).) Now let’s apply this to our eigenvectors: \[\begin{eqnarray} \vec{w}_i \cdot (\V \vec{w}_j) & = & (\V^T \vec{w}_i) \cdot \vec{w}_j\\ \vec{w}_i \cdot (\V \vec{w}_j) & = & (\V \vec{w}_i) \cdot \vec{w}_j \vec{w}_i \cdot (\lambda_j \vec{w}_j) & = & (\lambda_i \vec{w}_i) \cdot \vec{w}_j\\ \lambda_j (\vec{w}_i \cdot \vec{w}_j) & = & \lambda_i (\vec{w}_i \cdot \vec{w}_j) \end{eqnarray}\] (The second line uses the assumption that \(\V\) is symmetric, the third line uses the fact that \(\vec{w}_i\) and \(\vec{w}_j\) are both eigenvectors of \(\V\), the last line uses linearity of inner products.)

Since, by assumption, \(\lambda_i \neq \lambda_j\), the only way the two sides of the equation can balance is if \(\vec{w}_i \cdot \vec{w}_j = 0\), as was to be shown.

Backup: Distinct eigenvectors with equal eigenvalues can be made or chosen to be orthogonal

Backup: The Lagrange multiplier trick (1)

Backup: The Lagrange multiplier trick (2)

Backup: The Lagrange multiplier trick (3)

\[ \max_{w, \lambda}{\mathcal{L}(w,\lambda)} \]

Backup: The Lagrange multiplier trick (4)

Backup: The Lagrange multiplier trick (5)

Backup Example: The axis of historical complexity (1)

Turchin et al. (2018)

9 variable measuring social complexity, over time, for dozens of locations across the world

Dates from -9600 to 1900

summary(soccomp[,complexities], digits=2)
##      PolPop       PolTerr          CapPop        levels      government  
##  Min.   :1.4   Min.   :-0.22   Min.   :1.4   Min.   :0.0   Min.   :0.00  
##  1st Qu.:4.2   1st Qu.: 3.65   1st Qu.:3.5   1st Qu.:1.8   1st Qu.:0.24  
##  Median :6.0   Median : 5.18   Median :4.3   Median :3.0   Median :0.62  
##  Mean   :5.5   Mean   : 4.78   Mean   :4.2   Mean   :2.9   Mean   :0.55  
##  3rd Qu.:6.8   3rd Qu.: 5.97   3rd Qu.:5.1   3rd Qu.:4.0   3rd Qu.:0.86  
##  Max.   :8.5   Max.   : 7.40   Max.   :6.3   Max.   :6.6   Max.   :1.00  
##     infrastr       writing         texts          money    
##  Min.   :0.00   Min.   :0.00   Min.   :0.00   Min.   :0.0  
##  1st Qu.:0.34   1st Qu.:0.26   1st Qu.:0.10   1st Qu.:1.8  
##  Median :0.75   Median :0.82   Median :0.93   Median :4.0  
##  Mean   :0.64   Mean   :0.65   Mean   :0.63   Mean   :3.4  
##  3rd Qu.:0.90   3rd Qu.:0.86   3rd Qu.:0.97   3rd Qu.:5.0  
##  Max.   :1.00   Max.   :1.00   Max.   :1.00   Max.   :6.0

Backup Example: The axis of historical complexity (2)

soccomp.pca <- prcomp(soccomp[,complexities], scale=TRUE)

What are the parts of the return value?

str(soccomp.pca)
## List of 5
##  $ sdev    : num [1:9] 2.634 0.733 0.646 0.581 0.481 ...
##  $ rotation: num [1:9, 1:9] 0.351 0.32 0.339 0.341 0.332 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:9] "PolPop" "PolTerr" "CapPop" "levels" ...
##   .. ..$ : chr [1:9] "PC1" "PC2" "PC3" "PC4" ...
##  $ center  : Named num [1:9] 5.515 4.779 4.229 2.923 0.552 ...
##   ..- attr(*, "names")= chr [1:9] "PolPop" "PolTerr" "CapPop" "levels" ...
##  $ scale   : Named num [1:9] 1.59 1.561 1.112 1.449 0.325 ...
##   ..- attr(*, "names")= chr [1:9] "PolPop" "PolTerr" "CapPop" "levels" ...
##  $ x       : num [1:414, 1:9] -4.35 -4.24 -4.11 -3.67 -3.51 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : NULL
##   .. ..$ : chr [1:9] "PC1" "PC2" "PC3" "PC4" ...
##  - attr(*, "class")= chr "prcomp"

Backup Example: The axis of historical complexity (3)

# R^2 for k components = sum(first k eigenvalues)/sum(all eigenvalues)
# cumsum(x) returns the cumulative sums along the vector x
plot(cumsum(soccomp.pca$sdev^2)/sum(soccomp.pca$sdev^2),
     xlab="Number of components used", ylab=expression(R^2),
     ylim=c(0,1))
# Add the lines indicating the desired levels
abline(h=0.75, lty="dashed")
abline(h=0.90, lty="dotted")

One principal component already keeps more than 75% of the variance (3 components keep just under 90%)

Backup Example: The axis of historical complexity (4)

# Plot the coordinates/weights/loadings of PC1
  # But suppress the usual labels on the horizontal axis
plot(soccomp.pca$rotation[,1], xlab="", main="PCs of the complexity measures",
     xaxt="n", ylab="PC weight", ylim=c(-1,1), type="l")
# Label the horizontal axis by the names of the variables
  # The las option turns the axis labels on their side (so they don't overlap
  # each other)
axis(side=1, at=1:9, labels=colnames(soccomp)[complexities], las=2,
     cex.axis=0.5)
# Now add PC2 and PC3, in distinct colors
lines(soccomp.pca$rotation[,2], col="red")
lines(soccomp.pca$rotation[,3], col="blue")
# Legend
legend("bottomright", legend=c("PC1", "PC2", "PC3"), lty="solid",
       col=c("black", "red", "blue"))
# Guide-to-the-eye horizontal line at 0
abline(h=0, col="grey")

Backup Example: The axis of historical complexity (5)

# Add scores to data frame
soccomp$PC1 <- soccomp.pca$x[,1]
# Plot vs. time (all areas)
with(soccomp, plot(PC1 ~ Time))

Backup Example: The axis of historical complexity (6)

Backup Example: The axis of historical complexity (7)

plot(soccomp.pca$x[,1], soccomp.pca$x[,2], xlab="Score on PC1",
     ylab="Score on PC2")

MOAR READING

References

Boucheron, Stéphane, Gábor Lugosi, and Pascal Massart. 2013. Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford: Oxford University Press.

Dasgupta, Sanjoy, and Anupam Gupta. 2002. “An Elementary Proof of a Theorem of Johnson and Lindenstrauss.” Random Structures and Algorithms 22:60–65. https://doi.org/10.1002/rsa.10073.

Deerwester, Scott, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman. 1990. “Indexing by Latent Semantic Analysis.” Journal of the American Society for Information Science 41:391–407. https://doi.org/10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9.

Eshel, Gidon. 2012. Spatiotemporal Data Analysis. Princeton, New Jersey: Princeton University Press.

Hotelling, Harold. 1933a. “Analysis of a Complex of Statistical Variables into Principal Components [Part 1 of 2].” Journal of Educational Psychology 24:471–41. https://doi.org/10.1037/h0071325.

———. 1933b. “Analysis of a Complex of Statistical Variables into Principal Components [Part 2 of 2].” Journal of Educational Psychology 24:498–520. https://doi.org/10.1037/h0070888.

Johnson, William B., and Joram Lindenstrauss. 1984. “Extensions of Lipschitz Mappings into a Hilbert Space.” In Conference on Modern Analysis and Probability, edited by Richard Beals, Anatole Beck, Alexandra Bellow, and Arshag Hajian, 26:189–206. Contemporary Mathematics. Providence, Rhode Island: American Mathematical Society.

Landauer, Thomas K., and Susan T. Dumais. 1997. “A Solution to Plato’s Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge.” Psychological Review 104:211–40. http://lsa.colorado.edu/papers/plato/plato.annote.html.

Loève, Michel. 1955. Probability Theory. 1st ed. New York: D. Van Nostrand Company.

Pearson, Karl. 1901. “On Lines and Planes of Closest Fit to Systems of Points in Space.” Philosophical Magazine 2 (series 6):559–72. https://doi.org/10.1080/14786440109462720.

Turchin, Peter, Thomas E. Currie, Harvey Whitehouse, Pieter François, Kevin Feeney, Daniel Mullins, Daniel Hoyer, et al. 2018. “Quantitative Historical Analysis Uncovers a Single Dimension of Complexity That Structures Global Variation in Human Social Organization.” Proceedings of the National Academy of Sciences (USA) 115:E144–E151. https://doi.org/10.1073/pnas.1708800115.