class: center, middle, inverse, title-slide # Dimension Reduction ## Principal components analysis (PCA) ### June 29th, 2021 --- ## What is the goal of dimension reduction? We have `\(p\)` variables (columns) for `\(n\)` observations (rows) __BUT__ which variables are __interesting__? -- Can we find a smaller number of dimensions that captures the __interesting__ structure in the data? - Could examine all pairwise scatterplots of each variable - tedious, manual process - Last week: clustered variables based on correlation -- - Can we find a combination of the original `\(p\)` variables? -- __Dimension reduction__: - Focus on reducing the dimensionality of the feature space (i.e., number of columns), - While __retaining__ most of the information / __variability__ in a lower dimensional space (i.e., reducing the number of columns) --- ## [Principal components analysis (PCA)](https://en.wikipedia.org/wiki/Principal_component_analysis) <img src="https://bradleyboehmke.github.io/HOML/images/pcr-steps.png" width="110%" style="display: block; margin: auto;" /> --- ## [Principal components analysis (PCA)](https://en.wikipedia.org/wiki/Principal_component_analysis) - PCA explores the __covariance__ between variables, and combines variables into a smaller set of __uncorrelated__ variables called __principal components (PCs)__ - PCs are __weighted__, linear combinations of the original variables - Weights reveal how different variables are ___loaded___ into the PCs - We want a __small number of PCs__ to explain most of the information / variance in the data -- __First principal component__: `$$Z_1 = \phi_{11} X_1 + \phi_{21} X_2 + \dots + \phi_{p1} X_p$$` -- - `\(\phi_{j1}\)` are the weights indicating the contributions of each variable `\(j \in 1, \dots, p\)` - Weights are normalized `\(\sum_{j=1}^p \phi_{j1}^2 = 1\)` - `\(\phi_{1} = (\phi_{11}, \phi_{21}, \dots, \phi_{p1})\)` is the __loading vector__ for PC1 -- - `\(Z_1\)` is a linear combination of the `\(p\)` variables that has the __largest variance__ --- ## [Principal components analysis (PCA)](https://en.wikipedia.org/wiki/Principal_component_analysis) __Second principal component__: `$$Z_2 = \phi_{12} X_1 + \phi_{22} X_2 + \dots + \phi_{p2} X_p$$` - `\(\phi_{j2}\)` are the weights indicating the contributions of each variable `\(j \in 1, \dots, p\)` - Weights are normalized `\(\sum_{j=1}^p \phi_{j1}^2 = 1\)` - `\(\phi_{2} = (\phi_{12}, \phi_{22}, \dots, \phi_{p2})\)` is the __loading vector__ for PC2 - `\(Z_2\)` is a linear combination of the `\(p\)` variables that has the __largest variance__ - __Subject to constraint it is uncorrelated with `\(Z_1\)`__ -- We repeat this process to create `\(p\)` principal components --- ## [Visualizing PCA](https://www.stevejburr.com/post/scatter-plots-and-best-fit-lines/) in two dimensions <img src="16-PCA_files/figure-html/unnamed-chunk-2-1.png" width="50%" style="display: block; margin: auto;" /> --- ## [Visualizing PCA](https://www.stevejburr.com/post/scatter-plots-and-best-fit-lines/) in two dimensions <img src="16-PCA_files/figure-html/unnamed-chunk-3-1.png" width="50%" style="display: block; margin: auto;" /> --- ## [Visualizing PCA](https://www.stevejburr.com/post/scatter-plots-and-best-fit-lines/) in two dimensions <img src="16-PCA_files/figure-html/unnamed-chunk-4-1.png" width="50%" style="display: block; margin: auto;" /> --- ## [Visualizing PCA](https://www.stevejburr.com/post/scatter-plots-and-best-fit-lines/) in two dimensions <img src="16-PCA_files/figure-html/unnamed-chunk-5-1.png" width="50%" style="display: block; margin: auto;" /> --- ## [Visualizing PCA](https://www.stevejburr.com/post/scatter-plots-and-best-fit-lines/) in two dimensions <img src="16-PCA_files/figure-html/unnamed-chunk-6-1.png" width="1008" style="display: block; margin: auto;" /> --- ## Searching for variance in orthogonal directions <img src="https://bradleyboehmke.github.io/HOML/15-pca_files/figure-html/create-pca-image-1.png" width="60%" style="display: block; margin: auto;" /> --- ## PCA: [__singular value decomposition (SVD)__](https://en.wikipedia.org/wiki/Singular_value_decomposition) $$ X = U D V^T $$ - Matrices `\(U\)` and `\(V\)` contain the left and right (respectively) __singular vectors of scaled matrix `\(X\)`__ - `\(D\)` is the diagonal matrix of the __singular values__ -- - SVD simplifies matrix-vector multiplication as __rotate, scale, and rotate again__ -- `\(V\)` is called the __loading matrix__ for `\(X\)` with `\(\phi_{j}\)` as columns, - `\(Z = X V\)` is the PC matrix -- BONUS __eigenvalue decomposition__ (aka spectral decomposition) - `\(V\)` are the __eigenvectors__ of `\(X^TX\)` (covariance matrix, `\(^T\)` means _transpose_) - `\(U\)` are the __eigenvectors__ of `\(XX^T\)` - The singular values (diagonal of `\(D\)`) are square roots of the __eigenvalues__ of `\(X^TX\)` or `\(XX^T\)` - Meaning that `\(Z = UD\)` --- ## Eigenvalues solve time travel? <img src="https://thumbs.gfycat.com/RealisticFragrantHerculesbeetle-size_restricted.gif" width="70%" style="display: block; margin: auto;" /> --- ## Probably not... but they guide dimension reduction We want to choose `\(p^* < p\)` such that we are explaining variation in the data -- Eigenvalues `\(\lambda_j\)` for `\(j \in 1, \dots, p\)` indicate __the variance explained by each component__ - `\(\sum_j^p \lambda_j = p\)`, meaning `\(\lambda_j \geq 1\)` indicates `\(\text{PC}j\)` contains at least one variable's worth in variability - `\(\lambda_j / p\)` equals proportion of variance explained by `\(\text{PC}j\)` - Arranged in descending order so that `\(\lambda_1\)` is largest eigenvalue and corresponds to PC1 -- - Can compute the cumulative proportion of variance explained (CVE) with `\(p^*\)` components: `$$\text{CVE}_{p^*} = \frac{\sum_j^{p*} \lambda_j}{p}$$` Can use [__scree plot__](https://en.wikipedia.org/wiki/Scree_plot) to plot eigenvalues and guide choice for `\(p^* <p\)` by looking for "elbow" (rapid to slow change) --- ## Example data: NFL teams summary Created dataset using [`nflfastR`](https://www.nflfastr.com/) summarizing NFL team performances from 1999 to 2020 ```r library(tidyverse) nfl_teams_data <- read_csv("http://www.stat.cmu.edu/cmsac/sure/2021/materials/data/regression_projects/nfl_team_season_summary.csv") nfl_model_data <- nfl_teams_data %>% mutate(score_diff = points_scored - points_allowed) %>% # Only use rows with air yards filter(season >= 2006) %>% dplyr::select(-wins, -losses, -ties, -points_scored, -points_allowed, -season, -team) ``` --- ## NFL PCA example Use the `prcomp` function (uses SVD) for PCA on __centered__ and __scaled__ data ```r model_x <- as.matrix(dplyr::select(nfl_model_data, -score_diff)) *pca_nfl <- prcomp(model_x, center = TRUE, scale = TRUE) summary(pca_nfl) ``` ``` ## Importance of components: ## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 ## Standard deviation 3.2119 3.1086 2.3406 2.03961 1.5384 1.41243 1.33352 1.26070 1.15021 ## Proportion of Variance 0.2149 0.2013 0.1141 0.08667 0.0493 0.04156 0.03705 0.03311 0.02756 ## Cumulative Proportion 0.2149 0.4162 0.5304 0.61704 0.6663 0.70791 0.74495 0.77807 0.80563 ## PC10 PC11 PC12 PC13 PC14 PC15 PC16 PC17 PC18 ## Standard deviation 1.10316 1.01999 0.95873 0.93244 0.8314 0.82639 0.77427 0.65754 0.60286 ## Proportion of Variance 0.02535 0.02167 0.01915 0.01811 0.0144 0.01423 0.01249 0.00901 0.00757 ## Cumulative Proportion 0.83098 0.85265 0.87180 0.88992 0.9043 0.91855 0.93104 0.94004 0.94761 ## PC19 PC20 PC21 PC22 PC23 PC24 PC25 PC26 PC27 ## Standard deviation 0.58982 0.57864 0.53934 0.49402 0.47675 0.45810 0.41473 0.36075 0.33619 ## Proportion of Variance 0.00725 0.00698 0.00606 0.00508 0.00474 0.00437 0.00358 0.00271 0.00235 ## Cumulative Proportion 0.95486 0.96184 0.96790 0.97298 0.97772 0.98209 0.98567 0.98838 0.99074 ## PC28 PC29 PC30 PC31 PC32 PC33 PC34 PC35 PC36 ## Standard deviation 0.32284 0.2939 0.26156 0.25057 0.22371 0.13145 0.11477 0.10849 0.09386 ## Proportion of Variance 0.00217 0.0018 0.00143 0.00131 0.00104 0.00036 0.00027 0.00025 0.00018 ## Cumulative Proportion 0.99291 0.9947 0.99614 0.99744 0.99849 0.99885 0.99912 0.99937 0.99955 ## PC37 PC38 PC39 PC40 PC41 PC42 PC43 PC44 PC45 ## Standard deviation 0.08375 0.07230 0.05156 0.04715 0.03327 0.02877 0.02594 0.02201 0.02187 ## Proportion of Variance 0.00015 0.00011 0.00006 0.00005 0.00002 0.00002 0.00001 0.00001 0.00001 ## Cumulative Proportion 0.99970 0.99980 0.99986 0.99991 0.99993 0.99995 0.99996 0.99997 0.99998 ## PC46 PC47 PC48 ## Standard deviation 0.02035 0.01669 0.01628 ## Proportion of Variance 0.00001 0.00001 0.00001 ## Cumulative Proportion 0.99999 0.99999 1.00000 ``` --- ## Proportion of variance explained `prcomp$sdev` corresponds to the singular values, i.e., `\(\sqrt{\lambda_j}\)`, what is `pca_nfl$sdev^2 / ncol(model_x)`? -- Can use the `broom` package easily tidy `prcomp` summary for plotting .pull-left[ ```r library(broom) pca_nfl %>% * tidy(matrix = "eigenvalues") %>% ggplot(aes(x = PC, y = percent)) + geom_line() + geom_point() + * geom_hline(yintercept = 1 / ncol(model_x), color = "darkred", linetype = "dashed") + theme_bw() ``` - Add reference line at `\(1/p\)`, _why_? ] .pull-right[ <img src="16-PCA_files/figure-html/unnamed-chunk-11-1.png" width="80%" style="display: block; margin: auto;" /> ] --- ## Display data in lower dimensions `prcomp$x` corresponds to the matrix of __principal component scores__, i.e., `\(Z = XV\)` .pull-left[ Can `augment` dataset with PC scores for plotting - Add team and season for context ```r pca_nfl %>% * augment(nfl_model_data) %>% bind_cols({ nfl_teams_data %>% filter(season >= 2006) %>% dplyr::select(season, team) }) %>% * unite("team_id", team:season, sep = "-", remove = FALSE) %>% ggplot(aes(x = .fittedPC1, y = .fittedPC2, color = season)) + geom_text(aes(label = team_id), alpha = 0.9) + scale_color_gradient(low = "purple", high = "green") + theme_bw() + theme(legend.position = "bottom") ``` ] .pull-right[ <img src="16-PCA_files/figure-html/unnamed-chunk-12-1.png" width="90%" style="display: block; margin: auto;" /> ] --- ## What are the [loadings of these dimensions](https://clauswilke.com/blog/2020/09/07/pca-tidyverse-style/)? `prcomp$rotation` corresponds to the __loading matrix__, i.e., `\(V\)` .pull-left[ ```r arrow_style <- arrow( angle = 20, ends = "first", type = "closed", length = grid::unit(8, "pt") ) library(ggrepel) pca_nfl %>% tidy(matrix = "rotation") %>% pivot_wider(names_from = "PC", names_prefix = "PC", values_from = "value") %>% mutate(stat_type = ifelse(str_detect(column, "offense"), "offense", "defense")) %>% ggplot(aes(PC1, PC2)) + geom_segment(xend = 0, yend = 0, arrow = arrow_style) + geom_text_repel(aes(label = column, color = stat_type), size = 3) + scale_color_manual(values = c("darkred", "darkblue")) + theme_bw() + theme(legend.position = "bottom") ``` ] .pull-right[ <img src="16-PCA_files/figure-html/unnamed-chunk-13-1.png" width="90%" style="display: block; margin: auto;" /> ] --- ## PCA analysis with `factoextra` Visualize the proportion of variance explained by each PC with [`factoextra`](http://www.sthda.com/english/wiki/factoextra-r-package-easy-multivariate-data-analyses-and-elegant-visualization) ```r library(factoextra) fviz_eig(pca_nfl) ``` <img src="16-PCA_files/figure-html/unnamed-chunk-14-1.png" width="504" style="display: block; margin: auto;" /> --- ## PCA analysis with `factoextra` Display observations with first two PC ```r fviz_pca_ind(pca_nfl) ``` <img src="16-PCA_files/figure-html/unnamed-chunk-15-1.png" width="504" style="display: block; margin: auto;" /> --- ## PCA analysis with `factoextra` Projection of variables - angles are interpreted as correlations, where negative correlated values point to opposite sides of graph ```r fviz_pca_var(pca_nfl) ``` <img src="16-PCA_files/figure-html/unnamed-chunk-16-1.png" width="504" style="display: block; margin: auto;" /> --- ## PCA analysis with `factoextra` __Biplot__ displays both the space of observations and the space of variables - Arrows represent the directions of the original variables ```r fviz_pca_biplot(pca_nfl) ``` <img src="16-PCA_files/figure-html/unnamed-chunk-17-1.png" width="504" style="display: block; margin: auto;" />