class: center, middle, inverse, title-slide # Clustering ## K-means and hierarchical clustering ### Ron Yurko ### 06/09/2020 --- ## Brace yourselves <img src="https://media0.giphy.com/media/NsIwMll0rhfgpdQlzn/giphy.gif" width="60%" style="display: block; margin: auto;" /> --- ## Into statistical learning with unsupervised learning What is __statistical learning?__ -- [Preface of Introduction to Statistical Learning with Applications in R (ISLR)](http://faculty.marshall.usc.edu/gareth-james/ISL/): > _refers to a set of tools for modeling and understanding complex datasets_ -- What is __unsupervised learning?__ -- We have `\(p\)` variables for `\(n\)` observations `\(X_1,\dots,X_n\)`, and for observation `\(i\)`: `$$X_{i1},X_{i2},\ldots,X_{ip} \sim P$$` - `\(P\)` is a `\(p\)`-dimensional distribution that we might not know much about *a priori*. - _unsupervised_: none of the variables are __response__ variables, i.e., there are no labeled data -- Think of unsupervised learning as __an extension of EDA...__ -- - `\(\Rightarrow\)` __there is no unique right answer!__ ??? - Statistical learning is the process of ascertaining (discovering) associations between groups of variables - unsupervised learning - where the goal is to discover interesting things about the data --- ## What is clustering (aka cluster analysis)? -- [ISLR 10.3](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf): > _very broad set of techniques for finding subgroups, or clusters, in a dataset_ -- __Goals__: - observations __within__ clusters are __more similar__ to each other, - observations __in different__ clusters are __more different__ from each other -- How do we define __distance / dissimilarity__ between observations? -- - e.g. __Euclidean distance__ between observations `\(i\)` and `\(j\)` `$$d(x_i, x_j) = \sqrt{(x_{i1}-x_{j1})^2 + \cdots + (x_{ip}-x_{jp})^2}$$` -- __Units matter!__ -- - one variable may _dominate_ others when computing Euclidean distance because its range is much larger - can standardize each variable / column of dataset to have mean 0 and standard divation 1 with `scale()` - __but we may value the separation in that variable!__ (so just be careful...) ??? It is the partitioning of data into homogeneous subgroups Goal define clusters for which the within-cluster variation is relatively small, i.e. observations within clusters are similar to each other --- ## What's the clustering objective? - `\(C_1, \dots, C_K\)` are _sets_ containing indices of observations in each of the `\(K\)` clusters - if observation `\(i\)` is in cluster `\(k\)`, then `\(i \in C_k\)` -- - We want to minimize the __within-cluster variation__ `\(W(C_k)\)` for each cluster `\(C_k\)` and solve: `$$\underset{C_1, \dots, C_K}{\text{minimize}} \Big\{ \sum_{k=1}^K W(C_k) \Big\}$$` - Can define using the __squared Euclidean distance__ ( `\(|C_k| = n_k =\)` # observations in cluster `\(k\)`) `$$W(C_k) = \frac{1}{|C_k|}\sum_{i,j \in C_k} d(x_i, x_j)^2$$` -- __So how can we solve this?__ --- ## K-means clustering <img src="http://www.stat.cmu.edu/~pfreeman/alg_10-1.png" width="60%" style="display: block; margin: auto;" /> - __No universally accepted metric that leads one to conclude that a particular value of `\(K\)` is optimal__ - Results will change from run to run unless you explicitly set the random number seed - To mitigate randomness, set `nstart` argument in the function call to a large number (e.g., 50), and select the best result (smallest value of __within-cluster variation__) ??? Step 1 of the algorithm indicates that the algorithm randomly associates data to clusters. --- ## K-means clustering <img src="img/islr_kmeans.png" width="50%" style="display: block; margin: auto;" /> --- ## New dataset - NBA 2020 player stats per 100 possessions Created dataset of NBA player statistics per 100 possessions using the [`ballr`](https://cran.r-project.org/web/packages/ballr/vignettes/use-ballr.html) - which scrapes tables from [basketball-reference.com](https://www.basketball-reference.com/leagues/NBA_2020_per_poss.html) ```r library(tidyverse) nba_pos_stats <- read_csv("http://www.stat.cmu.edu/cmsac/sure/materials/data/clustering/nba_2020_player_per_pos_stats.csv") head(nba_pos_stats) ``` ``` ## # A tibble: 6 x 31 ## player pos age tm g gs mp fg fga fgpercent x3p x3pa ## <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Steve… C 26 OKC 58 58 1564 8.2 13.8 0.591 0 0.1 ## 2 Bam A… PF 22 MIA 65 65 2235 8.9 15.7 0.567 0 0.3 ## 3 LaMar… C 34 SAS 53 53 1754 10.7 21.7 0.493 1.7 4.3 ## 4 Nicke… SG 21 NOP 41 0 501 7.1 20.9 0.339 3.7 10.8 ## 5 Grays… SG 24 MEM 30 0 498 7.4 16.5 0.449 3.1 8.5 ## 6 Jarre… C 21 BRK 64 58 1647 7.7 11.9 0.646 0 0.1 ## # … with 19 more variables: x3ppercent <dbl>, x2p <dbl>, x2pa <dbl>, x2ppercent <dbl>, ## # ft <dbl>, fta <dbl>, ft_2 <dbl>, orb <dbl>, drb <dbl>, trb <dbl>, ast <dbl>, ## # stl <dbl>, blk <dbl>, tov <dbl>, pf <dbl>, pts <dbl>, x <lgl>, ortg <dbl>, ## # drtg <dbl> ``` --- ## Some initial cleaning... - Each row is at player-team level (`player`-`tm`) - For players that switched teams during the season, they have rows with `tm == "TOT"` ```r tot_players <- nba_pos_stats %>% filter(tm == "TOT") ``` - Create a new dataset by stacking `tot_players` with the rows for players that did not switch teams - Use the [`bind_rows()` function](https://dplyr.tidyverse.org/reference/bind.html) ```r nba_player_stats <- nba_pos_stats %>% * filter(!(player %in% tot_players$player)) %>% * bind_rows(tot_players) ``` --- ## Make a cutoff with the ECDF View the ECDF of the minutes played (`mp`) variable to make a cutoff: .pull-left[ ```r nba_player_stats %>% ggplot(aes(x = mp)) + stat_ecdf() + theme_bw() + labs(x = "Minutes played", y = "Proportion of NBA players") + * geom_vline(xintercept = 250, * linetype = "dashed", * color = "darkred") ``` - Arbitrary cutoff: at least 250 minutes played - Keeps around 75% of players - Use [`geom_vline`](https://ggplot2.tidyverse.org/reference/geom_abline.html) to draw vertical line ```r nba_filtered_stats <- nba_player_stats %>% filter(mp >= 250) ``` ] .pull-right[ <img src="06-intro-clustering_files/figure-html/unnamed-chunk-4-1.png" width="504" /> ] --- ## K-means clustering example (3 point attempts and rebounds) .pull-left[ - Use the `kmeans()` function, __but must provide number of clusters `\(K\)`__ ```r init_nba_kmeans <- kmeans(dplyr::select(nba_filtered_stats, x3pa, trb), 2, nstart = 30) nba_filtered_stats <- nba_filtered_stats %>% mutate(player_clusters = as.factor(init_nba_kmeans$cluster)) nba_filtered_stats %>% ggplot(aes(x = x3pa, y = trb, color = player_clusters)) + geom_point() + ggthemes::scale_color_colorblind() + theme_bw() + labs(x = "3 pt att / 100 pos", y = "Rebounds / 100 pos", color = "Cluster") + theme(legend.position = "bottom") ``` ] .pull-right[ <img src="06-intro-clustering_files/figure-html/unnamed-chunk-5-1.png" width="504" /> ] --- ## K-means clustering example (3 point attempts and rebounds) .pull-left[ - Use `\(K = 3\)` instead? (__labels are arbitrary__) ```r upd_nba_kmeans <- kmeans(dplyr::select(nba_filtered_stats, x3pa, trb), 3, nstart = 30) nba_filtered_stats <- nba_filtered_stats %>% mutate(player_clusters = as.factor(upd_nba_kmeans$cluster)) nba_filtered_stats %>% ggplot(aes(x = x3pa, y = trb, color = player_clusters)) + geom_point() + ggthemes::scale_color_colorblind() + theme_bw() + labs(x = "3 pt att / 100 pos", y = "Rebounds / 100 pos", color = "Cluster") + theme(legend.position = "bottom") ``` ] .pull-right[ <img src="06-intro-clustering_files/figure-html/unnamed-chunk-6-1.png" width="504" /> ] --- ## Hierarchical clustering <img src="http://www.stat.cmu.edu/~pfreeman/alg_10-2.png" width="60%" style="display: block; margin: auto;" /> - __"bottom-up"__ or __agglomerative__ clustering - Limitation? All clusters lie within other clusters (hence the name "hierarchical") - Advantage? Does __NOT__ depend on number of clusters `\(K\)` to start --- ## Hiearchical clustering example .pull-left[ - Use the `hclust` function __with a distance matrix__, e.g. `dist()` computes Euclidean - Can view the __dendrogram__ - Check out the [`ggdendro`](https://cran.r-project.org/web/packages/ggdendro/vignettes/ggdendro.html) package! ```r nba_complete_hclust <- * hclust(dist( * dplyr::select(nba_filtered_stats, * x3pa, trb)), * method = "complete") plot(nba_complete_hclust) ``` - Dendrograms can be difficult with many observations ] .pull-right[ <img src="06-intro-clustering_files/figure-html/unnamed-chunk-8-1.png" width="504" style="display: block; margin: auto;" /> ] --- ## Hiearchical clustering example .pull-left[ - Use the `cutree` function to return cluster labels (either number of clusters `k` or height `h`) ```r *hc_player_clusters <- * cutree(nba_complete_hclust, * k = 3) nba_filtered_stats <- nba_filtered_stats %>% mutate(player_hc_clusters = as.factor(hc_player_clusters)) nba_filtered_stats %>% ggplot(aes(x = x3pa, y = trb, color = player_hc_clusters)) + geom_point() + ggthemes::scale_color_colorblind() + theme_bw() + labs(x = "3 pt att / 100 pos", y = "Rebounds / 100 pos", color = "Cluster") + theme(legend.position = "bottom") ``` ] .pull-right[ <img src="06-intro-clustering_files/figure-html/unnamed-chunk-9-1.png" width="504" /> ] --- ## Need to choose a linkage function! - __Agglomerative__ clustering: clusters are built up piece by piece by __linking them together__ -- <img src="http://www.stat.cmu.edu/~pfreeman/linkage.png" width="60%" style="display: block; margin: auto;" /> - There is no unique algorithm for how that linking is done! Average and complete are the most popular --- ## We're not done yet... <img src="https://media1.tenor.com/images/bfb8e3e881ac4413ae12b61c4b982d60/tenor.gif?itemid=5140031" width="60%" style="display: block; margin: auto;" /> --- ## [Minimax linkage](http://statweb.stanford.edu/~tibs/sta306bfiles/minimax-clustering.pdf) - Each cluster is defined __by a prototype__ observation (most representative) - __Identify the point whose farthest point is closest__ (hence the minimax) <img src="img/minimax_example_viz.png" width="60%" style="display: block; margin: auto;" /> - Dendogram interpretation: each point point is `\(\leq h\)` in dissimilarity to the center of cluster - __Centers are chosen among the observations themselves__ - Easily done in `R` via the [`protoclust`](https://github.com/jacobbien/protoclust) package --- ## Minimax linkage example .pull-left[ - Use the `protoclust()` function to apply the clustering ```r library(protoclust) *nba_minimax <- protoclust(dist( * dplyr::select(nba_filtered_stats, * x3pa, trb))) plot(nba_minimax) ``` ] .pull-right[ <img src="06-intro-clustering_files/figure-html/unnamed-chunk-13-1.png" width="504" style="display: block; margin: auto;" /> ] --- ## Minimax linkage example .pull-left[ - Use the `protocut()` function cut the tree (either number of clusters `k` or height `h`) ```r *minimax_player_clusters <- * protocut(nba_minimax, k = 3) nba_filtered_stats <- nba_filtered_stats %>% mutate(minimax_clusters = as.factor(minimax_player_clusters$cl)) nba_filtered_stats %>% ggplot(aes(x = x3pa, y = trb, color = minimax_clusters)) + geom_point() + ggthemes::scale_color_colorblind() + theme_bw() + labs(x = "3 pt att / 100 pos", y = "Rebounds / 100 pos", color = "Cluster") + theme(legend.position = "bottom") ``` ] .pull-right[ <img src="06-intro-clustering_files/figure-html/unnamed-chunk-14-1.png" width="504" style="display: block; margin: auto;" /> ] --- ## Minimax linkage example - Want to check out the prototypes for the three clusters - `protocut` returns the indices of the prototypes (in order of the cluster labels) ```r minimax_player_clusters$protos ``` ``` ## [1] 395 21 112 ``` - View these player rows: ```r dplyr::select(nba_filtered_stats[minimax_player_clusters$protos, ], player, pos, age, x3pa, trb) ``` ``` ## # A tibble: 3 x 5 ## player pos age x3pa trb ## <chr> <chr> <dbl> <dbl> <dbl> ## 1 Noah Vonleh C-PF 24 1.9 15.7 ## 2 RJ Barrett SG 19 5.6 8 ## 3 Paul George SF 29 12.8 9.3 ``` --- ## Obviously can improve this... - What if we account for several variables, how does the clustering results relate to the position column in the dataset? ```r nba_multidim_clust <- protoclust(dist(dplyr::select(nba_filtered_stats, x2ppercent, x3pa, ast, stl, trb))) nba_multidim_clust_cut <- protocut(nba_multidim_clust, k = 5) table("Positions" = nba_filtered_stats$pos, "Clusters" = nba_multidim_clust_cut$cl) ``` -- .pull-left[ ``` ## Clusters ## Positions 1 2 3 4 5 ## C 40 32 5 1 0 ## C-PF 2 0 0 0 0 ## PF 15 38 14 9 0 ## PG 0 6 12 28 27 ## SF 0 29 16 22 1 ## SF-PF 0 0 0 1 0 ## SF-SG 0 1 0 1 0 ## SG 0 14 34 45 4 ``` ] .pull-right[ - Can see positions tend to fall within particular clusters... - But we might want to explore __soft__ assignments instead (tomorrow!) - _What's the way to visually compare the two labels?_ ] --- ## And one more thing... We can use the [`GGally`](https://ggobi.github.io/ggally/index.html) package to easily create __pairs__ plots of multiple variables .pull-left[ ```r *library(GGally) nba_filtered_stats <- nba_filtered_stats %>% mutate(full_minimax_clusters = as.factor(nba_multidim_clust_cut$cl)) *ggpairs(nba_filtered_stats, * columns = * c("x2ppercent", "x3pa", * "ast", "stl", "trb"), * aes(color = full_minimax_clusters)) ``` ] .pull-right[ <img src="06-intro-clustering_files/figure-html/unnamed-chunk-16-1.png" width="504" /> ]