Dimension Reduction

class: center, middle, inverse, title-slide

# Dimension Reduction
## Principal components analysis (PCA)
### July 6th, 2022

---

## What is the goal of dimension reduction?

We have `\(p\)` variables (columns) for `\(n\)` observations (rows) __BUT__ which variables are __interesting__?

--

Can we find a smaller number of dimensions that captures the __interesting__ structure in the data?

- Could examine all pairwise scatterplots of each variable - tedious, manual process

- Tuesday: clustered variables based on correlation
  
--

- Can we find a combination of the original `\(p\)` variables?

--

__Dimension reduction__:

- Focus on reducing the dimensionality of the feature space (i.e., number of columns),

- While __retaining__ most of the information / __variability__ in a lower dimensional space (i.e., reducing the number of columns)

---

## [Principal components analysis (PCA)](https://en.wikipedia.org/wiki/Principal_component_analysis)

<img src="https://bradleyboehmke.github.io/HOML/images/pcr-steps.png" width="110%" style="display: block; margin: auto;" />

---

## [Principal components analysis (PCA)](https://en.wikipedia.org/wiki/Principal_component_analysis)

- PCA explores the __covariance__ between variables, and combines variables into a smaller set of __uncorrelated__ variables called __principal components (PCs)__

- PCs are __weighted__, linear combinations of the original variables

- Weights reveal how different variables are ___loaded___ into the PCs

- We want a __small number of PCs__ to explain most of the information / variance in the data

--

__First principal component__:

`$$Z_1 = \phi_{11} X_1 + \phi_{21} X_2 + \dots + \phi_{p1} X_p$$`

--

- `\(\phi_{j1}\)` are the weights indicating the contributions of each variable `\(j \in 1, \dots, p\)`
  
  - Weights are normalized `\(\sum_{j=1}^p \phi_{j1}^2 = 1\)`
  
  - `\(\phi_{1} = (\phi_{11}, \phi_{21}, \dots, \phi_{p1})\)` is the __loading vector__ for PC1

--
  
  - `\(Z_1\)` is a linear combination of the `\(p\)` variables that has the __largest variance__

---

## [Principal components analysis (PCA)](https://en.wikipedia.org/wiki/Principal_component_analysis)

__Second principal component__:

`$$Z_2 = \phi_{12} X_1 + \phi_{22} X_2 + \dots + \phi_{p2} X_p$$`

- `\(\phi_{j2}\)` are the weights indicating the contributions of each variable `\(j \in 1, \dots, p\)`
  
  - Weights are normalized `\(\sum_{j=1}^p \phi_{j1}^2 = 1\)`
  
  - `\(\phi_{2} = (\phi_{12}, \phi_{22}, \dots, \phi_{p2})\)` is the __loading vector__ for PC2
  
  - `\(Z_2\)` is a linear combination of the `\(p\)` variables that has the __largest variance__
  
    - __Subject to constraint it is uncorrelated with `\(Z_1\)`__ 
    
--

We repeat this process to create `\(p\)` principal components

---

## [Visualizing PCA](https://www.stevejburr.com/post/scatter-plots-and-best-fit-lines/) in two dimensions

<img src="14-PCA_files/figure-html/unnamed-chunk-2-1.png" width="50%" style="display: block; margin: auto;" />

---

## [Visualizing PCA](https://www.stevejburr.com/post/scatter-plots-and-best-fit-lines/) in two dimensions

<img src="14-PCA_files/figure-html/unnamed-chunk-3-1.png" width="50%" style="display: block; margin: auto;" />

---

## [Visualizing PCA](https://www.stevejburr.com/post/scatter-plots-and-best-fit-lines/) in two dimensions

<img src="14-PCA_files/figure-html/unnamed-chunk-4-1.png" width="50%" style="display: block; margin: auto;" />

---

## [Visualizing PCA](https://www.stevejburr.com/post/scatter-plots-and-best-fit-lines/) in two dimensions

<img src="14-PCA_files/figure-html/unnamed-chunk-5-1.png" width="50%" style="display: block; margin: auto;" />

---

## [Visualizing PCA](https://www.stevejburr.com/post/scatter-plots-and-best-fit-lines/) in two dimensions

<img src="14-PCA_files/figure-html/unnamed-chunk-6-1.png" width="1008" style="display: block; margin: auto;" />

---

## Searching for variance in orthogonal directions

<img src="https://bradleyboehmke.github.io/HOML/15-pca_files/figure-html/create-pca-image-1.png" width="60%" style="display: block; margin: auto;" />

---

## PCA: [__singular value decomposition (SVD)__](https://en.wikipedia.org/wiki/Singular_value_decomposition)

$$
X = U D V^T
$$

- Matrices `\(U\)` and `\(V\)` contain the left and right (respectively) __singular vectors of scaled matrix `\(X\)`__

- `\(D\)` is the diagonal matrix of the __singular values__

--

- SVD simplifies matrix-vector multiplication as __rotate, scale, and rotate again__

--
`\(V\)` is called the __loading matrix__ for `\(X\)` with `\(\phi_{j}\)` as columns,

- `\(Z = X  V\)` is the PC matrix

--

BONUS __eigenvalue decomposition__ (aka spectral decomposition)

- `\(V\)` are the __eigenvectors__ of `\(X^TX\)` (covariance matrix, `\(^T\)` means _transpose_)

- `\(U\)` are the __eigenvectors__ of `\(XX^T\)`

- The singular values (diagonal of `\(D\)`) are square roots of the __eigenvalues__ of `\(X^TX\)` or `\(XX^T\)`

- Meaning that `\(Z = UD\)`

---

## Eigenvalues solve time travel?

<img src="https://thumbs.gfycat.com/RealisticFragrantHerculesbeetle-size_restricted.gif" width="70%" style="display: block; margin: auto;" />

---

## Probably not... but they guide dimension reduction

We want to choose `\(p^* < p\)` such that we are explaining variation in the data

--

Eigenvalues `\(\lambda_j\)` for `\(j \in 1, \dots, p\)` indicate __the variance explained by each component__

- `\(\sum_j^p \lambda_j = p\)`, meaning `\(\lambda_j \geq 1\)` indicates `\(\text{PC}j\)` contains at least one variable's worth in variability
  
  - `\(\lambda_j / p\)` equals proportion of variance explained by `\(\text{PC}j\)`
  
  - Arranged in descending order so that `\(\lambda_1\)` is largest eigenvalue and corresponds to PC1
  
--
  
  - Can compute the cumulative proportion of variance explained (CVE) with `\(p^*\)` components:
  
`$$\text{CVE}_{p^*} = \frac{\sum_j^{p*} \lambda_j}{p}$$`

Can use [__scree plot__](https://en.wikipedia.org/wiki/Scree_plot) to plot eigenvalues and guide choice for `\(p^* <p\)` by looking for "elbow" (rapid to slow change)

---

## Example data: NFL teams summary

Created dataset using [`nflfastR`](https://www.nflfastr.com/) summarizing NFL team performances from 1999 to 2021

```r
library(tidyverse)
nfl_teams_data <- read_csv("http://www.stat.cmu.edu/cmsac/sure/2022/materials/data/sports/regression_examples/nfl_team_season_summary.csv")
nfl_model_data <- nfl_teams_data %>%
  mutate(score_diff = points_scored - points_allowed) %>%
  # Only use rows with air yards
  filter(season >= 2006) %>%
  dplyr::select(-wins, -losses, -ties, -points_scored, -points_allowed, -season, -team)
```

---

## NFL PCA example

Use the `prcomp` function (uses SVD) for PCA on __centered__ and __scaled__ data

```r
model_x <- as.matrix(dplyr::select(nfl_model_data, -score_diff))
*pca_nfl <- prcomp(model_x, center = TRUE, scale = TRUE)
summary(pca_nfl)
```

```
## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6     PC7     PC8
## Standard deviation     3.2060 3.1026 2.3257 2.04728 1.52301 1.40350 1.35714 1.26250
## Proportion of Variance 0.2141 0.2006 0.1127 0.08732 0.04832 0.04104 0.03837 0.03321
## Cumulative Proportion  0.2141 0.4147 0.5274 0.61468 0.66301 0.70405 0.74242 0.77562
##                            PC9    PC10    PC11    PC12    PC13    PC14    PC15    PC16
## Standard deviation     1.14773 1.09881 1.01200 0.95689 0.93513 0.85233 0.82315 0.77434
## Proportion of Variance 0.02744 0.02515 0.02134 0.01908 0.01822 0.01513 0.01412 0.01249
## Cumulative Proportion  0.80307 0.82822 0.84956 0.86863 0.88685 0.90199 0.91610 0.92859
##                           PC17    PC18    PC19   PC20    PC21    PC22    PC23    PC24
## Standard deviation     0.65692 0.64016 0.60076 0.5796 0.56756 0.51349 0.47233 0.46768
## Proportion of Variance 0.00899 0.00854 0.00752 0.0070 0.00671 0.00549 0.00465 0.00456
## Cumulative Proportion  0.93758 0.94612 0.95364 0.9606 0.96735 0.97284 0.97749 0.98205
##                           PC25    PC26    PC27    PC28    PC29    PC30    PC31    PC32
## Standard deviation     0.41284 0.35810 0.33597 0.32018 0.30357 0.26161 0.25289 0.22149
## Proportion of Variance 0.00355 0.00267 0.00235 0.00214 0.00192 0.00143 0.00133 0.00102
## Cumulative Proportion  0.98560 0.98827 0.99062 0.99276 0.99468 0.99610 0.99744 0.99846
##                           PC33    PC34    PC35    PC36    PC37    PC38    PC39    PC40
## Standard deviation     0.13146 0.11459 0.10964 0.09672 0.08397 0.07385 0.05223 0.04814
## Proportion of Variance 0.00036 0.00027 0.00025 0.00019 0.00015 0.00011 0.00006 0.00005
## Cumulative Proportion  0.99882 0.99909 0.99934 0.99954 0.99968 0.99980 0.99985 0.99990
##                           PC41    PC42    PC43    PC44    PC45    PC46    PC47    PC48
## Standard deviation     0.03391 0.02901 0.02562 0.02290 0.02213 0.02139 0.01718 0.01670
## Proportion of Variance 0.00002 0.00002 0.00001 0.00001 0.00001 0.00001 0.00001 0.00001
## Cumulative Proportion  0.99993 0.99994 0.99996 0.99997 0.99998 0.99999 0.99999 1.00000
```

---

## Proportion of variance explained

`prcomp$sdev` corresponds to the singular values, i.e., `\(\sqrt{\lambda_j}\)`, what is `pca_nfl$sdev^2 / ncol(model_x)`?

--

Can use the `broom` package easily tidy `prcomp` summary for plotting

.pull-left[

```r
library(broom)
pca_nfl %>%
* tidy(matrix = "eigenvalues") %>%
  ggplot(aes(x = PC, y = percent)) +
  geom_line() + geom_point() +
* geom_hline(yintercept = 1 / ncol(model_x),
             color = "darkred", 
             linetype = "dashed") +
  theme_bw()
```

- Add reference line at `\(1/p\)`, _why_?

]
.pull-right[
<img src="14-PCA_files/figure-html/unnamed-chunk-11-1.png" width="80%" style="display: block; margin: auto;" />

]

---

## Display data in lower dimensions

`prcomp$x` corresponds to the matrix of __principal component scores__, i.e., `\(Z = XV\)`

.pull-left[

Can `augment` dataset with PC scores for plotting

- Add team and season for context

```r
pca_nfl %>%
* augment(nfl_model_data) %>%
  bind_cols({
    nfl_teams_data %>% 
      filter(season >= 2006) %>%
      dplyr::select(season, team)
  }) %>%
* unite("team_id", team:season, sep = "-",
        remove = FALSE) %>%
  ggplot(aes(x = .fittedPC1, y = .fittedPC2, 
             color = season)) +
  geom_text(aes(label = team_id), alpha = 0.9) +
  scale_color_gradient(low = "purple", high = "green") +
  theme_bw() + theme(legend.position = "bottom")
```

]
.pull-right[

<img src="14-PCA_files/figure-html/unnamed-chunk-12-1.png" width="90%" style="display: block; margin: auto;" />

]

---

## What are the [loadings of these dimensions](https://clauswilke.com/blog/2020/09/07/pca-tidyverse-style/)?

`prcomp$rotation` corresponds to the __loading matrix__, i.e., `\(V\)`

.pull-left[

```r
arrow_style <- arrow(
  angle = 20, ends = "first", type = "closed", 
  length = grid::unit(8, "pt")
)
library(ggrepel)
pca_nfl %>%
  tidy(matrix = "rotation") %>%
  pivot_wider(names_from = "PC", names_prefix = "PC", 
              values_from = "value") %>%
  mutate(stat_type = ifelse(str_detect(column, "offense"),
                            "offense", "defense")) %>%
  ggplot(aes(PC1, PC2)) +
  geom_segment(xend = 0, yend = 0, arrow = arrow_style) +
  geom_text_repel(aes(label = column, color = stat_type),
                  size = 3) +
  scale_color_manual(values = c("darkred", "darkblue")) +
  theme_bw() +
  theme(legend.position = "bottom")
```

]
.pull-right[

<img src="14-PCA_files/figure-html/unnamed-chunk-13-1.png" width="90%" style="display: block; margin: auto;" />

]

---

## PCA analysis with `factoextra`

Visualize the proportion of variance explained by each PC with [`factoextra`](http://www.sthda.com/english/wiki/factoextra-r-package-easy-multivariate-data-analyses-and-elegant-visualization)

```r
library(factoextra)
fviz_eig(pca_nfl)
```

<img src="14-PCA_files/figure-html/unnamed-chunk-14-1.png" width="504" style="display: block; margin: auto;" />

---

## PCA analysis with `factoextra`

Display observations with first two PC

```r
fviz_pca_ind(pca_nfl)
```

<img src="14-PCA_files/figure-html/unnamed-chunk-15-1.png" width="504" style="display: block; margin: auto;" />

---

## PCA analysis with `factoextra`

Projection of variables - angles are interpreted as correlations, where negative correlated values point to opposite sides of graph

```r
fviz_pca_var(pca_nfl)
```

<img src="14-PCA_files/figure-html/unnamed-chunk-16-1.png" width="504" style="display: block; margin: auto;" />

---

## PCA analysis with `factoextra`

__Biplot__ displays both the space of observations and the space of variables

- Arrows represent the directions of the original variables

```r
fviz_pca_biplot(pca_nfl)
```

<img src="14-PCA_files/figure-html/unnamed-chunk-17-1.png" width="504" style="display: block; margin: auto;" />