class: center, middle, inverse, title-slide # Dimension Reduction ## Principal components analysis (PCA) ### July 6th, 2022 --- ## What is the goal of dimension reduction? We have `\(p\)` variables (columns) for `\(n\)` observations (rows) __BUT__ which variables are __interesting__? -- Can we find a smaller number of dimensions that captures the __interesting__ structure in the data? - Could examine all pairwise scatterplots of each variable - tedious, manual process - Tuesday: clustered variables based on correlation -- - Can we find a combination of the original `\(p\)` variables? -- __Dimension reduction__: - Focus on reducing the dimensionality of the feature space (i.e., number of columns), - While __retaining__ most of the information / __variability__ in a lower dimensional space (i.e., reducing the number of columns) --- ## [Principal components analysis (PCA)](https://en.wikipedia.org/wiki/Principal_component_analysis) <img src="https://bradleyboehmke.github.io/HOML/images/pcr-steps.png" width="110%" style="display: block; margin: auto;" /> --- ## [Principal components analysis (PCA)](https://en.wikipedia.org/wiki/Principal_component_analysis) - PCA explores the __covariance__ between variables, and combines variables into a smaller set of __uncorrelated__ variables called __principal components (PCs)__ - PCs are __weighted__, linear combinations of the original variables - Weights reveal how different variables are ___loaded___ into the PCs - We want a __small number of PCs__ to explain most of the information / variance in the data -- __First principal component__: `$$Z_1 = \phi_{11} X_1 + \phi_{21} X_2 + \dots + \phi_{p1} X_p$$` -- - `\(\phi_{j1}\)` are the weights indicating the contributions of each variable `\(j \in 1, \dots, p\)` - Weights are normalized `\(\sum_{j=1}^p \phi_{j1}^2 = 1\)` - `\(\phi_{1} = (\phi_{11}, \phi_{21}, \dots, \phi_{p1})\)` is the __loading vector__ for PC1 -- - `\(Z_1\)` is a linear combination of the `\(p\)` variables that has the __largest variance__ --- ## [Principal components analysis (PCA)](https://en.wikipedia.org/wiki/Principal_component_analysis) __Second principal component__: `$$Z_2 = \phi_{12} X_1 + \phi_{22} X_2 + \dots + \phi_{p2} X_p$$` - `\(\phi_{j2}\)` are the weights indicating the contributions of each variable `\(j \in 1, \dots, p\)` - Weights are normalized `\(\sum_{j=1}^p \phi_{j1}^2 = 1\)` - `\(\phi_{2} = (\phi_{12}, \phi_{22}, \dots, \phi_{p2})\)` is the __loading vector__ for PC2 - `\(Z_2\)` is a linear combination of the `\(p\)` variables that has the __largest variance__ - __Subject to constraint it is uncorrelated with `\(Z_1\)`__ -- We repeat this process to create `\(p\)` principal components --- ## [Visualizing PCA](https://www.stevejburr.com/post/scatter-plots-and-best-fit-lines/) in two dimensions <img src="14-PCA_files/figure-html/unnamed-chunk-2-1.png" width="50%" style="display: block; margin: auto;" /> --- ## [Visualizing PCA](https://www.stevejburr.com/post/scatter-plots-and-best-fit-lines/) in two dimensions <img src="14-PCA_files/figure-html/unnamed-chunk-3-1.png" width="50%" style="display: block; margin: auto;" /> --- ## [Visualizing PCA](https://www.stevejburr.com/post/scatter-plots-and-best-fit-lines/) in two dimensions <img src="14-PCA_files/figure-html/unnamed-chunk-4-1.png" width="50%" style="display: block; margin: auto;" /> --- ## [Visualizing PCA](https://www.stevejburr.com/post/scatter-plots-and-best-fit-lines/) in two dimensions <img src="14-PCA_files/figure-html/unnamed-chunk-5-1.png" width="50%" style="display: block; margin: auto;" /> --- ## [Visualizing PCA](https://www.stevejburr.com/post/scatter-plots-and-best-fit-lines/) in two dimensions <img src="14-PCA_files/figure-html/unnamed-chunk-6-1.png" width="1008" style="display: block; margin: auto;" /> --- ## Searching for variance in orthogonal directions <img src="https://bradleyboehmke.github.io/HOML/15-pca_files/figure-html/create-pca-image-1.png" width="60%" style="display: block; margin: auto;" /> --- ## PCA: [__singular value decomposition (SVD)__](https://en.wikipedia.org/wiki/Singular_value_decomposition) $$ X = U D V^T $$ - Matrices `\(U\)` and `\(V\)` contain the left and right (respectively) __singular vectors of scaled matrix `\(X\)`__ - `\(D\)` is the diagonal matrix of the __singular values__ -- - SVD simplifies matrix-vector multiplication as __rotate, scale, and rotate again__ -- `\(V\)` is called the __loading matrix__ for `\(X\)` with `\(\phi_{j}\)` as columns, - `\(Z = X V\)` is the PC matrix -- BONUS __eigenvalue decomposition__ (aka spectral decomposition) - `\(V\)` are the __eigenvectors__ of `\(X^TX\)` (covariance matrix, `\(^T\)` means _transpose_) - `\(U\)` are the __eigenvectors__ of `\(XX^T\)` - The singular values (diagonal of `\(D\)`) are square roots of the __eigenvalues__ of `\(X^TX\)` or `\(XX^T\)` - Meaning that `\(Z = UD\)` --- ## Eigenvalues solve time travel? <img src="https://thumbs.gfycat.com/RealisticFragrantHerculesbeetle-size_restricted.gif" width="70%" style="display: block; margin: auto;" /> --- ## Probably not... but they guide dimension reduction We want to choose `\(p^* < p\)` such that we are explaining variation in the data -- Eigenvalues `\(\lambda_j\)` for `\(j \in 1, \dots, p\)` indicate __the variance explained by each component__ - `\(\sum_j^p \lambda_j = p\)`, meaning `\(\lambda_j \geq 1\)` indicates `\(\text{PC}j\)` contains at least one variable's worth in variability - `\(\lambda_j / p\)` equals proportion of variance explained by `\(\text{PC}j\)` - Arranged in descending order so that `\(\lambda_1\)` is largest eigenvalue and corresponds to PC1 -- - Can compute the cumulative proportion of variance explained (CVE) with `\(p^*\)` components: `$$\text{CVE}_{p^*} = \frac{\sum_j^{p*} \lambda_j}{p}$$` Can use [__scree plot__](https://en.wikipedia.org/wiki/Scree_plot) to plot eigenvalues and guide choice for `\(p^* <p\)` by looking for "elbow" (rapid to slow change) --- ## Example data: NFL teams summary Created dataset using [`nflfastR`](https://www.nflfastr.com/) summarizing NFL team performances from 1999 to 2021 ```r library(tidyverse) nfl_teams_data <- read_csv("http://www.stat.cmu.edu/cmsac/sure/2022/materials/data/sports/regression_examples/nfl_team_season_summary.csv") nfl_model_data <- nfl_teams_data %>% mutate(score_diff = points_scored - points_allowed) %>% # Only use rows with air yards filter(season >= 2006) %>% dplyr::select(-wins, -losses, -ties, -points_scored, -points_allowed, -season, -team) ``` --- ## NFL PCA example Use the `prcomp` function (uses SVD) for PCA on __centered__ and __scaled__ data ```r model_x <- as.matrix(dplyr::select(nfl_model_data, -score_diff)) *pca_nfl <- prcomp(model_x, center = TRUE, scale = TRUE) summary(pca_nfl) ``` ``` ## Importance of components: ## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 ## Standard deviation 3.2060 3.1026 2.3257 2.04728 1.52301 1.40350 1.35714 1.26250 ## Proportion of Variance 0.2141 0.2006 0.1127 0.08732 0.04832 0.04104 0.03837 0.03321 ## Cumulative Proportion 0.2141 0.4147 0.5274 0.61468 0.66301 0.70405 0.74242 0.77562 ## PC9 PC10 PC11 PC12 PC13 PC14 PC15 PC16 ## Standard deviation 1.14773 1.09881 1.01200 0.95689 0.93513 0.85233 0.82315 0.77434 ## Proportion of Variance 0.02744 0.02515 0.02134 0.01908 0.01822 0.01513 0.01412 0.01249 ## Cumulative Proportion 0.80307 0.82822 0.84956 0.86863 0.88685 0.90199 0.91610 0.92859 ## PC17 PC18 PC19 PC20 PC21 PC22 PC23 PC24 ## Standard deviation 0.65692 0.64016 0.60076 0.5796 0.56756 0.51349 0.47233 0.46768 ## Proportion of Variance 0.00899 0.00854 0.00752 0.0070 0.00671 0.00549 0.00465 0.00456 ## Cumulative Proportion 0.93758 0.94612 0.95364 0.9606 0.96735 0.97284 0.97749 0.98205 ## PC25 PC26 PC27 PC28 PC29 PC30 PC31 PC32 ## Standard deviation 0.41284 0.35810 0.33597 0.32018 0.30357 0.26161 0.25289 0.22149 ## Proportion of Variance 0.00355 0.00267 0.00235 0.00214 0.00192 0.00143 0.00133 0.00102 ## Cumulative Proportion 0.98560 0.98827 0.99062 0.99276 0.99468 0.99610 0.99744 0.99846 ## PC33 PC34 PC35 PC36 PC37 PC38 PC39 PC40 ## Standard deviation 0.13146 0.11459 0.10964 0.09672 0.08397 0.07385 0.05223 0.04814 ## Proportion of Variance 0.00036 0.00027 0.00025 0.00019 0.00015 0.00011 0.00006 0.00005 ## Cumulative Proportion 0.99882 0.99909 0.99934 0.99954 0.99968 0.99980 0.99985 0.99990 ## PC41 PC42 PC43 PC44 PC45 PC46 PC47 PC48 ## Standard deviation 0.03391 0.02901 0.02562 0.02290 0.02213 0.02139 0.01718 0.01670 ## Proportion of Variance 0.00002 0.00002 0.00001 0.00001 0.00001 0.00001 0.00001 0.00001 ## Cumulative Proportion 0.99993 0.99994 0.99996 0.99997 0.99998 0.99999 0.99999 1.00000 ``` --- ## Proportion of variance explained `prcomp$sdev` corresponds to the singular values, i.e., `\(\sqrt{\lambda_j}\)`, what is `pca_nfl$sdev^2 / ncol(model_x)`? -- Can use the `broom` package easily tidy `prcomp` summary for plotting .pull-left[ ```r library(broom) pca_nfl %>% * tidy(matrix = "eigenvalues") %>% ggplot(aes(x = PC, y = percent)) + geom_line() + geom_point() + * geom_hline(yintercept = 1 / ncol(model_x), color = "darkred", linetype = "dashed") + theme_bw() ``` - Add reference line at `\(1/p\)`, _why_? ] .pull-right[ <img src="14-PCA_files/figure-html/unnamed-chunk-11-1.png" width="80%" style="display: block; margin: auto;" /> ] --- ## Display data in lower dimensions `prcomp$x` corresponds to the matrix of __principal component scores__, i.e., `\(Z = XV\)` .pull-left[ Can `augment` dataset with PC scores for plotting - Add team and season for context ```r pca_nfl %>% * augment(nfl_model_data) %>% bind_cols({ nfl_teams_data %>% filter(season >= 2006) %>% dplyr::select(season, team) }) %>% * unite("team_id", team:season, sep = "-", remove = FALSE) %>% ggplot(aes(x = .fittedPC1, y = .fittedPC2, color = season)) + geom_text(aes(label = team_id), alpha = 0.9) + scale_color_gradient(low = "purple", high = "green") + theme_bw() + theme(legend.position = "bottom") ``` ] .pull-right[ <img src="14-PCA_files/figure-html/unnamed-chunk-12-1.png" width="90%" style="display: block; margin: auto;" /> ] --- ## What are the [loadings of these dimensions](https://clauswilke.com/blog/2020/09/07/pca-tidyverse-style/)? `prcomp$rotation` corresponds to the __loading matrix__, i.e., `\(V\)` .pull-left[ ```r arrow_style <- arrow( angle = 20, ends = "first", type = "closed", length = grid::unit(8, "pt") ) library(ggrepel) pca_nfl %>% tidy(matrix = "rotation") %>% pivot_wider(names_from = "PC", names_prefix = "PC", values_from = "value") %>% mutate(stat_type = ifelse(str_detect(column, "offense"), "offense", "defense")) %>% ggplot(aes(PC1, PC2)) + geom_segment(xend = 0, yend = 0, arrow = arrow_style) + geom_text_repel(aes(label = column, color = stat_type), size = 3) + scale_color_manual(values = c("darkred", "darkblue")) + theme_bw() + theme(legend.position = "bottom") ``` ] .pull-right[ <img src="14-PCA_files/figure-html/unnamed-chunk-13-1.png" width="90%" style="display: block; margin: auto;" /> ] --- ## PCA analysis with `factoextra` Visualize the proportion of variance explained by each PC with [`factoextra`](http://www.sthda.com/english/wiki/factoextra-r-package-easy-multivariate-data-analyses-and-elegant-visualization) ```r library(factoextra) fviz_eig(pca_nfl) ``` <img src="14-PCA_files/figure-html/unnamed-chunk-14-1.png" width="504" style="display: block; margin: auto;" /> --- ## PCA analysis with `factoextra` Display observations with first two PC ```r fviz_pca_ind(pca_nfl) ``` <img src="14-PCA_files/figure-html/unnamed-chunk-15-1.png" width="504" style="display: block; margin: auto;" /> --- ## PCA analysis with `factoextra` Projection of variables - angles are interpreted as correlations, where negative correlated values point to opposite sides of graph ```r fviz_pca_var(pca_nfl) ``` <img src="14-PCA_files/figure-html/unnamed-chunk-16-1.png" width="504" style="display: block; margin: auto;" /> --- ## PCA analysis with `factoextra` __Biplot__ displays both the space of observations and the space of variables - Arrows represent the directions of the original variables ```r fviz_pca_biplot(pca_nfl) ``` <img src="14-PCA_files/figure-html/unnamed-chunk-17-1.png" width="504" style="display: block; margin: auto;" />