class: center, middle, inverse, title-slide # Clustering ## Hierarchical clustering ### June 14th, 2021 --- ## Prep dataset from before Created dataset of NBA player statistics per 100 possessions using [`ballr`](https://cran.r-project.org/web/packages/ballr/vignettes/use-ballr.html) ```r library(tidyverse) nba_pos_stats <- read_csv("http://www.stat.cmu.edu/cmsac/sure/2021/materials/data/clustering/nba_2021_player_per_pos_stats.csv") # Find rows for players indicating a full season worth of stats tot_players <- nba_pos_stats %>% filter(tm == "TOT") # Stack this dataset with players that played on just one team nba_player_stats <- nba_pos_stats %>% filter(!(player %in% tot_players$player)) %>% bind_rows(tot_players) # Filter to only players with at least 250 minutes played nba_filtered_stats <- nba_player_stats %>% filter(mp >= 250) head(nba_filtered_stats) ``` ``` ## # A tibble: 6 x 31 ## player pos age tm g gs mp fg fga fgpercent x3p x3pa x3ppercent x2p ## <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Preciou… PF 21 MIA 61 4 737 8.4 15.4 0.544 0 0.1 0 8.4 ## 2 Steven … C 27 NOP 58 58 1605 5.6 9.2 0.614 0 0.1 0 5.6 ## 3 Bam Ade… C 23 MIA 64 64 2143 10.6 18.5 0.57 0 0.2 0.25 10.5 ## 4 Nickeil… SG 22 NOP 46 13 1007 9.1 21.8 0.419 3.6 10.4 0.347 5.5 ## 5 Grayson… SG 25 MEM 50 38 1259 6.6 15.7 0.418 4.1 10.4 0.391 2.5 ## 6 Kyle An… PF 27 MEM 69 69 1887 7.8 16.7 0.468 2.4 6.6 0.36 5.4 ## # … with 17 more variables: x2pa <dbl>, x2ppercent <dbl>, ft <dbl>, fta <dbl>, ftpercent <dbl>, ## # orb <dbl>, drb <dbl>, trb <dbl>, ast <dbl>, stl <dbl>, blk <dbl>, tov <dbl>, pf <dbl>, ## # pts <dbl>, x <lgl>, ortg <dbl>, drtg <dbl> ``` --- ## Let's work from the bottom-up... .pull-left[ - __Review__: We have `\(p\)` variables for `\(n\)` observations `\(x_1,\dots,x_n\)`, - Compute the __distance / dissimilarity__ between observations - e.g. __Euclidean distance__ between observations `\(i\)` and `\(j\)` `$$d(x_i, x_j) = \sqrt{(x_{i1}-x_{j1})^2 + \cdots + (x_{ip}-x_{jp})^2}$$` __What are the distances between these NBA players using `x3pa` and `trb`?__ ] .pull-right[ <img src="07-Hierarchical-clustering_files/figure-html/unnamed-chunk-1-1.png" width="504" /> ] --- ## Compute the distance matrix using `dist()` .pull-left[ - Compute pairwise Euclidean distance ```r *player_dist <- dist(dplyr::select(nba_filtered_stats, x3pa, trb)) ``` - Returns an object of `dist` class - i.e. not a matrix - Can convert to a matrix, then set the row and column names: ```r *player_dist_matrix <- as.matrix(player_dist) rownames(player_dist_matrix) <- nba_filtered_stats$player colnames(player_dist_matrix) <- nba_filtered_stats$player head(player_dist_matrix[1:3, 1:3]) ``` ``` ## Precious Achiuwa Steven Adams Bam Adebayo ## Precious Achiuwa 0.0000000 1.40000 0.7071068 ## Steven Adams 1.4000000 0.00000 2.1023796 ## Bam Adebayo 0.7071068 2.10238 0.0000000 ``` ] .pull-right[ Can convert to a long table for plotting with `ggplot`: ```r long_dist_matrix <- as_tibble(player_dist_matrix) %>% mutate(player1 = rownames(player_dist_matrix)) %>% * pivot_longer(cols = -player1, * names_to = "player2", * values_to = "distance") long_dist_matrix %>% ggplot(aes(x = player1, y = player2, fill = distance)) + geom_tile() + theme_bw() + theme(axis.text = element_blank(), axis.ticks = element_blank(), legend.position = "bottom") + scale_fill_gradient(low = "darkorange", high = "darkblue") ``` ] --- ## This is useless... <img src="07-Hierarchical-clustering_files/figure-html/unnamed-chunk-2-1.png" width="504" style="display: block; margin: auto;" /> --- ### Code interlude: arrange your heatmap with [`seriation`](https://github.com/mhahsler/seriation) .pull-left[ ```r library(seriation) *player_dist_seriate <- seriate(player_dist) *player_order <- get_order(player_dist_seriate) player_names_order <- nba_filtered_stats$player[player_order] long_dist_matrix %>% mutate(player1 = fct_relevel(player1, player_names_order), player2 = fct_relevel(player2, player_names_order)) %>% ggplot(aes(x = player1, y = player2, fill = distance)) + geom_tile() + theme_bw() + theme(axis.text = element_blank(), axis.ticks = element_blank(), legend.position = "bottom") + scale_fill_gradient(low = "darkorange", high = "darkblue") ``` ] .pull-right[ <img src="07-Hierarchical-clustering_files/figure-html/unnamed-chunk-3-1.png" width="504" style="display: block; margin: auto;" /> ] --- ## (Agglomerative) [Hierarchical clustering](https://en.wikipedia.org/wiki/Hierarchical_clustering) Let's pretend all `\(n\)` observations are in their own cluster -- - Step 1: Compute the pairwise dissimilarities between each cluster - e.g., distance matrix on previous slides -- - Step 2: Identify the pair of clusters that are __least dissimilar__ -- - Step 3: Fuse these two clusters into a new cluster! -- - __Repeat Steps 1 to 3 until all observations are in the same cluster__ -- __"Bottom-up"__, agglomerative clustering that forms a __tree / hierarchy__ of merging No mention of any randomness! No mention of the number of clusters `\(K\)`! --- ## (Agglomerative) [Hierarchical clustering](https://en.wikipedia.org/wiki/Hierarchical_clustering) .pull-left[ Start with all observations in their own cluster - Step 1: Compute the pairwise dissimilarities between each cluster - Step 2: Identify the pair of clusters that are __least dissimilar__ - Step 3: Fuse these two clusters into a new cluster! - __Repeat Steps 1 to 3 until all observations are in the same cluster__ ] .pull-right[ <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/b/b5/Clusters.svg/250px-Clusters.svg.png" width="70%" style="display: block; margin: auto;" /> ] --- ## (Agglomerative) [Hierarchical clustering](https://en.wikipedia.org/wiki/Hierarchical_clustering) .pull-left[ Start with all observations in their own cluster - Step 1: Compute the pairwise dissimilarities between each cluster - Step 2: Identify the pair of clusters that are __least dissimilar__ - Step 3: Fuse these two clusters into a new cluster! - __Repeat Steps 1 to 3 until all observations are in the same cluster__ ] .pull-right[ <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/a/ad/Hierarchical_clustering_simple_diagram.svg/418px-Hierarchical_clustering_simple_diagram.svg.png" width="85%" style="display: block; margin: auto;" /> Forms a __dendrogram__ (typically displayed from bottom-up) ] --- ## How do we define dissimilarity between clusters? We know how to compute distance / dissimilarity between two observations __But how do we handle clusters?__ - Dissimilarity between a cluster and an observation, or between two clusters -- We need to choose a __linkage function__! Clusters are built up by __linking them together__ -- Compute all pairwise dissimilarities between observations in cluster 1 with observations in cluster 2 i.e. Compute the distance matrix between observations, `\(d(x_i, x_j)\)` for `\(i \in C_1\)` and `\(j \in C_2\)` -- - __Complete linkage__: Use the __maximum__ value of these dissimilarities: `\(\underset{i \in C_1, j \in C_2}{\text{max}} d(x_i, x_j)\)` -- - __Single linkage__: Use the __minimum__ value: `\(\underset{i \in C_1, j \in C_2}{\text{min}} d(x_i, x_j)\)` -- - __Average linkage__: Use the __average__ value: `\(\frac{1}{|C_1| \cdot |C_2|} \sum_{i \in C_1} \sum_{j \in C_2} d(x_i, x_j)\)` -- Define dissimilarity between two clusters __based on our initial dissimilarity matrix between observations__ --- ## Complete linkage example .pull-left[ - Use the `hclust` function with a `dist()` object - Uses `complete` linkage by default ```r nba_complete_hclust <- * hclust(player_dist, method = "complete") ``` - Need to use `cutree()` to return cluster labels: ```r nba_filtered_stats %>% mutate(player_clusters = * as.factor(cutree(nba_complete_hclust, * k = 4))) %>% ggplot(aes(x = x3pa, y = trb, color = player_clusters)) + geom_point() + ggthemes::scale_color_colorblind() + theme_bw() + theme(legend.position = "bottom") ``` ] .pull-right[ Returns _compact_ clusters, similar to `\(K\)`-means <img src="07-Hierarchical-clustering_files/figure-html/unnamed-chunk-6-1.png" width="504" style="display: block; margin: auto;" /> ] --- ## What are we cutting? Dendrograms .pull-left[ Use the [`ggdendro`](https://cran.r-project.org/web/packages/ggdendro/index.html) package (instead of `plot()`) ```r library(ggdendro) *ggdendrogram(nba_complete_hclust, theme_dendro = FALSE, * labels = FALSE, leaf_labels = FALSE) + labs(y = "Dissimilarity between clusters") + theme_bw() + theme(axis.text.x = element_blank(), axis.title.x = element_blank(), axis.ticks.x = element_blank(), panel.grid = element_blank()) ``` - Each __leaf__ is one observation - __Height of branch indicates dissimilarity between clusters__ - (After first step) Horizontal position along x-axis means nothing ] .pull-right[ <img src="07-Hierarchical-clustering_files/figure-html/unnamed-chunk-7-1.png" width="504" style="display: block; margin: auto;" /> ] --- ## [Textbook example](https://bradleyboehmke.github.io/HOML/hierarchical.html) <img src="https://bradleyboehmke.github.io/HOML/19-hierarchical_files/figure-html/comparing-dendrogram-to-distances-1.png" width="100%" style="display: block; margin: auto;" /> --- ## Cut dendrograms to obtain cluster labels .pull-left[ Specify the height to cut with `h` instead of `k` <img src="07-Hierarchical-clustering_files/figure-html/complete-dendro-cut-1.png" width="80%" /> ] .pull-right[ ```r *cutree(nba_complete_hclust, h = 20) ``` <img src="07-Hierarchical-clustering_files/figure-html/nba-complete-cut-plot-1.png" width="80%" /> ] --- ## Single linkage example .pull-left[ Change the `method` argument to `single` <img src="07-Hierarchical-clustering_files/figure-html/single-dendro-cut-1.png" width="80%" /> ] .pull-right[ Results in a __chaining__ effect <img src="07-Hierarchical-clustering_files/figure-html/nba-single-plot-1.png" width="80%" /> ] --- ## Average linkage example .pull-left[ Change the `method` argument to `average` <img src="07-Hierarchical-clustering_files/figure-html/average-dendro-cut-1.png" width="80%" /> ] .pull-right[ Closer to `complete` but varies in compactness <img src="07-Hierarchical-clustering_files/figure-html/nba-average-plot-1.png" width="80%" /> ] --- ## More linkage functions - __Centroid linkage__: Computes the dissimilarity between the centroid for cluster 1 and the centroid for cluster 2 - i.e. distance between the averages of the two clusters - use `method = centroid` -- - __Ward’s linkage__: Merges a pair of clusters to minimize the within-cluster variance - i.e. aim is to minimize the objection function from `\(K\)`-means - can use `ward.D` or `ward.D2` (different algorithms) -- <img src="https://media1.tenor.com/images/bfb8e3e881ac4413ae12b61c4b982d60/tenor.gif?itemid=5140031" width="30%" style="display: block; margin: auto;" /> --- ## [Minimax linkage](http://statweb.stanford.edu/~tibs/sta306bfiles/minimax-clustering.pdf) - Each cluster is defined __by a prototype__ observation (most representative) - __Identify the point whose farthest point is closest__ (hence the minimax) <img src="https://europepmc.org/articles/PMC4527350/bin/nihms637357f2.jpg" width="60%" style="display: block; margin: auto;" /> - Use this minimum-maximum distance as the measure of cluster dissimilarity - Dendogram interpretation: each point point is `\(\leq h\)` in dissimilarity to the __prototype__ of cluster - __Cluster centers are chosen among the observations themselves - hence prototype__ --- ## Minimax linkage example .pull-left[ - Easily done in `R` via the [`protoclust`](https://github.com/jacobbien/protoclust) package - Use the `protoclust()` function to apply the clustering to the `dist()` object ```r library(protoclust) *nba_minimax <- protoclust(player_dist) ggdendrogram(nba_minimax, theme_dendro = FALSE, labels = FALSE, leaf_labels = FALSE) + labs(y = "Maximum dissimilarity from prototype") + theme_bw() + theme(axis.text.x = element_blank(), axis.title.x = element_blank(), axis.ticks.x = element_blank(), panel.grid = element_blank()) ``` ] .pull-right[ <img src="07-Hierarchical-clustering_files/figure-html/unnamed-chunk-11-1.png" width="504" style="display: block; margin: auto;" /> ] --- ## Minimax linkage example .pull-left[ - Use the `protocut()` function to make the cut - But then access the cluster labels `cl` ```r minimax_player_clusters <- * protocut(nba_minimax, k = 4) nba_filtered_stats %>% mutate(player_clusters = * as.factor(minimax_player_clusters$cl)) %>% ggplot(aes(x = x3pa, y = trb, color = player_clusters)) + geom_point() + ggthemes::scale_color_colorblind() + theme_bw() + theme(legend.position = "bottom") ``` ] .pull-right[ <img src="07-Hierarchical-clustering_files/figure-html/unnamed-chunk-12-1.png" width="504" style="display: block; margin: auto;" /> ] --- ## Minimax linkage example - Want to check out the prototypes for the three clusters - `protocut` returns the indices of the prototypes (in order of the cluster labels) ```r minimax_player_clusters$protos ``` ``` ## [1] 363 134 183 100 ``` - View these player rows using `slice`: ```r nba_filtered_stats %>% dplyr::select(player, pos, age, x3pa, trb) %>% slice(minimax_player_clusters$protos) ``` ``` ## # A tibble: 4 x 5 ## player pos age x3pa trb ## <chr> <chr> <dbl> <dbl> <dbl> ## 1 Torrey Craig SF 30 6.4 12 ## 2 Willy Hernangómez C 26 0.6 18.9 ## 3 Zach LaVine SG 25 11.4 6.9 ## 4 Trent Forrest PG 22 4.2 7.1 ``` --- ## Wrapping up... - For context, how does player position (`pos`) relate to our clustering results? ```r table("Clusters" = minimax_player_clusters$cl, "Positions" = nba_filtered_stats$pos) ``` ``` ## Positions ## Clusters C C-PF PF PF-C PF-SF PG PG-SG SF SF-PF SF-SG SG SG-PG SG-SF ## 1 32 0 16 1 0 3 0 3 1 0 0 0 0 ## 2 30 0 2 0 0 0 0 0 0 0 0 0 0 ## 3 4 2 27 0 0 45 1 33 2 2 73 0 1 ## 4 10 0 36 0 1 27 0 36 0 0 26 3 1 ``` -- - Can see positions tend to fall within particular clusters... - _What's the way to visually compare the two labels?_ -- - __We can easily include more variables__ - just changes our distance matrix -- - But we might want to explore __soft__ assignments instead (Wednesday!)