Unsupervised Learning

class: center, middle, inverse, title-slide

# Unsupervised Learning
## Principal components analysis
### Ron Yurko
### 06/25/2020

---

## Principal components analysis (PCA)

We have a collection of `$p$` variables for each of `$n$` objects `$X_1,\dots,X_n$`, and for observation `$i$`:

`$$X_{i,1},...,X_{1,p} \sim P$$`
- `$P$` is a `$p$`-dimensional distribution that we might not know much about *a priori*

Let's build up intuition for PCA by starting off with a picture with `$p = 2$`:

---

## What does the PCA algorithm do?

.pull-left[

- Moves the coordinate system origin to the __centroid of the point cloud__

- Rotates the axes so that "PC Axis 1" lies along the direction of greatest variation (the solid line)

- "PC Axis 2" lies __orthogonal__ to to "PC Axis 1

- If `$p > 2$`, then "PC Axis 2" would lie along the direction of greatest residual variation, and "PC Axis 3" would then be defined orthogonal to axes 1 and 2, etc.
]

.pull-right[
<img src="http://www.stat.cmu.edu/~pfreeman/pca.png" width="90%" style="display: block; margin: auto;" />
]

---

## PCA as dimension reduction Tool

Imagine that we have a `$p$`-dimensional point cloud that we wish to visualize

--
- Can use PCA "to find a low-dimensional representation of the data that captures as much of the [statistical] information [present] as possible."

--
For instance, if it would so happen that the data in our `$p$`-dimensional space actually all lie on a two-dimensional __plane__ embedded within that space, __PCA would uncover this structure and we would subsequently be able to visualize the data using two-dimensional scatter plots__

- Another possibility is to perform linear regression using a subset of principal components rather than the original data frame

- "Principal components regression" is covered in ISLR 6.3

--
The main limitation of PCA is that it is a __linear algorithm__: it projects data to __hyperplanes__. If the data inhabit a curved surface within the `$p$`-dimensional space, PCA would not be the optimal algorithm by which to explore the data.

- Instead we'd use *nonlinear* techniques like diffusion map or local linear embedding

---

## PCA: Algorithm

.pull-left[
The __score__ or coordinate of the `$i^{\rm th}$` observation along __principal component (PC)__ `$j$` is
`$$Z_{ij} = \sum_{k=1}^p X_{ik} \phi_{kj}$$`
- `$k$` represents the `$k^{\rm th}$` variable.

- PCA determines the __rotation (or loading)__ matrix `$\phi$`, which is normalized such that `$\sum_{k=1}^p \phi_{kj}^2 = 1$` ( `$j$` ranges from 1 to `$p$`)

Note that since we are "mixing axes" when doing PCA, it is generally best to standardize (or scale) the data frame before applying the algorithm.

]

.pull-right[
<img src="http://www.stat.cmu.edu/~pfreeman/pc1_proj.png" width="70%" style="display: block; margin: auto;" />

<img src="http://www.stat.cmu.edu/~pfreeman/pc1.png" width="70%" style="display: block; margin: auto;" />
]

---

## PCA: [__singular value decomposition (SVD)__](https://en.wikipedia.org/wiki/Singular_value_decomposition)

$$
X = U D V^T
$$

- Matrices `$U$` and `$V$` contain the left and right (respectively) __singular vectors of scaled matrix `$X$`__

- `$D$` is the diagonal matrix of the __singular values__

- SVD simplifies matrix-vector multiplication as __rotate, scale, and rotate again__

--
`$V$` is called the __loading matrix__ for `$X$`, where the PC matrix is `$Z = X \cdot V$`

- First column of `$Z$` is the first principal component, second column is second PC, etc.

--
BONUS __eigenvalue decomposition__

- `$V$` are the __eigenvectors__ of `$X^TX$` (covariance matrix, `$^T$` means _transpose_)

- `$U$` are the __eigenvectors__ of `$XX^T$`

- The singular values (diagonal of `$D$`) are the square roots of the __eigenvalues__ of `$X^TX$` or `$XX^T$`

--
- Meaning that `$Z = UD$`, and that `$\phi$` from the last slide is the matrix of eigenvectors `$V$`

Note that PCA will really only work if `$n$`, the number of rows / observations, is greater than `$p$`, number of variables.

---

## Eigenvalues solve time travel?

---

## Genetics PCA example

---

## PCA: Choosing the Number of Dimensions to Retain

Well, that's the million-dollar question now, isn't it? And guess what: there is no simple answer to this! ("Embrace the Ambiguity" indeed...)

The ideal would be the following: choose `$M < p$` such that `$x_{ij} = \sum_{m=1}^p z_{im}\phi_{jm} \approx \sum_{m=1}^M z_{im}\phi_{jm}$`

In other words, we don't lose much ability to reconstruct the input data `$X$` by dropping the last `$p-M$` principal components, which we assume represent random variation in the data (i.e., noise).

One convention: sum up the amount of variance explained in the first `$M$` PCs, and adopt the smallest value of `$M$` such that 90% or 95% or 99%, etc., of the overall variance is "explained."

--
.pull-left[
Heuristic: look for an "elbow" in the __PC scree plot__,

- where the percentage of variance explained transitions from falling off rapidly to slowly
 
]

.pull-right[
<img src="http://www.stat.cmu.edu/~pfreeman/propvar.png" width="100%" style="display: block; margin: auto;" />

]

---

## Note from ISLR and back to an example

"In practice, we tend to look at the first few principal components in order to find interesting patterns in the data. If no interesting patterns are found in the first few principal components, then further principal components are unlikely to be of interest."

Back to `Hitters` data

```r
library(ISLR)
data("Hitters")
Hitters <- as_tibble(Hitters) %>%
  filter(!is.na(Salary))
model_x <- model.matrix(Salary ~ ., Hitters)[, -1]
```

Use the `prcomp` function for PCA (using SVD, while `princomp()` uses spectral decomposition approach)

```r
pca_hitters <- prcomp(model_x, scale = TRUE)
```

---

## PCA analysis with `factoextra`

Visualize the proportion of variance explained by each PC with [`factoextra`](http://www.sthda.com/english/wiki/factoextra-r-package-easy-multivariate-data-analyses-and-elegant-visualization)

```r
library(factoextra)
fviz_eig(pca_hitters)
```

---

## PCA analysis with `factoextra`

Display observations with first two PC

```r
fviz_pca_ind(pca_hitters)
```

---

## PCA analysis with `factoextra`

Projection of variables - angles are interpreted as correlations, where negative correlated values point to opposite sides of graph (compare these angles with `pca_hitters$rotation` values...)

```r
fviz_pca_var(pca_hitters)
```

---

## PCA analysis with `factoextra`

__Biplot__ displays both the space of observations and the space of variables - where the arrows represent the directions of the original variables

```r
fviz_pca_biplot(pca_hitters)
```