class: center, middle, inverse, title-slide # Clustering ## Hierarchical clustering ### June 15th, 2022 --- ## Prep NBA player dataset Created dataset of NBA player statistics per 100 possessions using [`ballr`](https://cran.r-project.org/web/packages/ballr/vignettes/use-ballr.html) ```r library(tidyverse) nba_pos_stats <- read_csv("http://www.stat.cmu.edu/cmsac/sure/2022/materials/data/sports/clustering/nba_2022_player_per_pos_stats.csv") # Find rows for players indicating a full season worth of stats tot_players <- nba_pos_stats %>% filter(tm == "TOT") # Stack this dataset with players that played on just one team nba_player_stats <- nba_pos_stats %>% filter(!(player %in% tot_players$player)) %>% bind_rows(tot_players) # Filter to only players with at least 125 minutes played nba_filtered_stats <- nba_player_stats %>% filter(mp >= 125) head(nba_filtered_stats) ``` ``` ## # A tibble: 6 x 31 ## player pos age tm g gs mp fg fga fgpercent x3p x3pa x3ppercent ## <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Preciou… C 22 TOR 73 28 1725 7.7 17.5 0.439 1.6 4.5 0.359 ## 2 Steven … C 28 MEM 76 75 1999 5 9.2 0.547 0 0 0 ## 3 Bam Ade… C 24 MIA 56 56 1825 11.1 20 0.557 0 0.2 0 ## 4 Santi A… PF 21 MEM 32 0 360 7 17.5 0.402 0.8 6.4 0.125 ## 5 LaMarcu… C 36 BRK 47 12 1050 11.6 21.1 0.55 0.6 2.1 0.304 ## 6 Grayson… SG 26 MIL 66 61 1805 6.8 15.1 0.448 4.2 10.4 0.409 ## # … with 18 more variables: x2p <dbl>, x2pa <dbl>, x2ppercent <dbl>, ft <dbl>, fta <dbl>, ## # ftpercent <dbl>, orb <dbl>, drb <dbl>, trb <dbl>, ast <dbl>, stl <dbl>, blk <dbl>, ## # tov <dbl>, pf <dbl>, pts <dbl>, x <lgl>, ortg <dbl>, drtg <dbl> ``` --- ## Let's work from the bottom-up... .pull-left[ - __Review__: We have `\(p\)` variables for `\(n\)` observations `\(x_1,\dots,x_n\)`, - Compute the __distance / dissimilarity__ between observations - e.g. __Euclidean distance__ between observations `\(i\)` and `\(j\)` `$$d(x_i, x_j) = \sqrt{(x_{i1}-x_{j1})^2 + \cdots + (x_{ip}-x_{jp})^2}$$` __What are the distances between these NBA players using `x3pa` and `trb`?__ ] .pull-right[ <img src="07-Hierarchical-clustering_files/figure-html/unnamed-chunk-1-1.png" width="504" /> ] --- ## Remember to standardize! .pull-left[ ```r nba_filtered_stats <- nba_filtered_stats %>% mutate(std_x3pa = as.numeric(scale(x3pa)), std_trb = as.numeric(scale(trb))) nba_filtered_stats %>% ggplot(aes(x = std_x3pa, y = std_trb)) + geom_point(alpha = 0.5) + theme_bw() + coord_fixed() ``` ] .pull-right[ <img src="07-Hierarchical-clustering_files/figure-html/unnamed-chunk-2-1.png" width="504" /> ] --- ## Compute the distance matrix using `dist()` .pull-left[ - Compute pairwise Euclidean distance ```r *player_dist <- dist(dplyr::select(nba_filtered_stats, std_x3pa, std_trb)) ``` - Returns an object of `dist` class - i.e., not a matrix - Can convert to a matrix, then set the row and column names: ```r *player_dist_matrix <- as.matrix(player_dist) rownames(player_dist_matrix) <- nba_filtered_stats$player colnames(player_dist_matrix) <- nba_filtered_stats$player head(player_dist_matrix[1:3, 1:3]) ``` ``` ## Precious Achiuwa Steven Adams Bam Adebayo ## Precious Achiuwa 0.000000 1.6394586 1.2387398 ## Steven Adams 1.639459 0.0000000 0.6652539 ## Bam Adebayo 1.238740 0.6652539 0.0000000 ``` ] .pull-right[ Can convert to a long table for plotting with `ggplot`: ```r long_dist_matrix <- as_tibble(player_dist_matrix) %>% mutate(player1 = rownames(player_dist_matrix)) %>% * pivot_longer(cols = -player1, * names_to = "player2", * values_to = "distance") long_dist_matrix %>% ggplot(aes(x = player1, y = player2, fill = distance)) + geom_tile() + theme_bw() + theme(axis.text = element_blank(), axis.ticks = element_blank(), legend.position = "bottom") + scale_fill_gradient(low = "darkorange", high = "darkblue") ``` ] --- ## This is useless... <img src="07-Hierarchical-clustering_files/figure-html/unnamed-chunk-3-1.png" width="504" style="display: block; margin: auto;" /> --- ### Code interlude: arrange your heatmap with [`seriation`](https://github.com/mhahsler/seriation) .pull-left[ ```r library(seriation) *player_dist_seriate <- seriate(player_dist) *player_order <- get_order(player_dist_seriate) player_names_order <- nba_filtered_stats$player[player_order] long_dist_matrix %>% mutate(player1 = fct_relevel(player1, player_names_order), player2 = fct_relevel(player2, player_names_order)) %>% ggplot(aes(x = player1, y = player2, fill = distance)) + geom_tile() + theme_bw() + theme(axis.text = element_blank(), axis.ticks = element_blank(), legend.position = "bottom") + scale_fill_gradient(low = "darkorange", high = "darkblue") ``` ] .pull-right[ <img src="07-Hierarchical-clustering_files/figure-html/unnamed-chunk-4-1.png" width="504" style="display: block; margin: auto;" /> ] --- ## (Agglomerative) [Hierarchical clustering](https://en.wikipedia.org/wiki/Hierarchical_clustering) Let's pretend all `\(n\)` observations are in their own cluster -- - Step 1: Compute the pairwise dissimilarities between each cluster - e.g., distance matrix on previous slides -- - Step 2: Identify the pair of clusters that are __least dissimilar__ -- - Step 3: Fuse these two clusters into a new cluster! -- - __Repeat Steps 1 to 3 until all observations are in the same cluster__ -- __"Bottom-up"__, agglomerative clustering that forms a __tree / hierarchy__ of merging No mention of any randomness! No mention of the number of clusters `\(K\)`! --- ## (Agglomerative) [Hierarchical clustering](https://en.wikipedia.org/wiki/Hierarchical_clustering) .pull-left[ Start with all observations in their own cluster - Step 1: Compute the pairwise dissimilarities between each cluster - Step 2: Identify the pair of clusters that are __least dissimilar__ - Step 3: Fuse these two clusters into a new cluster! - __Repeat Steps 1 to 3 until all observations are in the same cluster__ ] .pull-right[ <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/b/b5/Clusters.svg/250px-Clusters.svg.png" width="70%" style="display: block; margin: auto;" /> ] --- ## (Agglomerative) [Hierarchical clustering](https://en.wikipedia.org/wiki/Hierarchical_clustering) .pull-left[ Start with all observations in their own cluster - Step 1: Compute the pairwise dissimilarities between each cluster - Step 2: Identify the pair of clusters that are __least dissimilar__ - Step 3: Fuse these two clusters into a new cluster! - __Repeat Steps 1 to 3 until all observations are in the same cluster__ ] .pull-right[ <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/a/ad/Hierarchical_clustering_simple_diagram.svg/418px-Hierarchical_clustering_simple_diagram.svg.png" width="85%" style="display: block; margin: auto;" /> Forms a __dendrogram__ (typically displayed from bottom-up) ] --- ## How do we define dissimilarity between clusters? We know how to compute distance / dissimilarity between two observations __But how do we handle clusters?__ - Dissimilarity between a cluster and an observation, or between two clusters -- We need to choose a __linkage function__! Clusters are built up by __linking them together__ -- Compute all pairwise dissimilarities between observations in cluster 1 with observations in cluster 2 i.e. Compute the distance matrix between observations, `\(d(x_i, x_j)\)` for `\(i \in C_1\)` and `\(j \in C_2\)` -- - __Complete linkage__: Use the __maximum__ value of these dissimilarities: `\(\underset{i \in C_1, j \in C_2}{\text{max}} d(x_i, x_j)\)` -- - __Single linkage__: Use the __minimum__ value: `\(\underset{i \in C_1, j \in C_2}{\text{min}} d(x_i, x_j)\)` -- - __Average linkage__: Use the __average__ value: `\(\frac{1}{|C_1| \cdot |C_2|} \sum_{i \in C_1} \sum_{j \in C_2} d(x_i, x_j)\)` -- Define dissimilarity between two clusters __based on our initial dissimilarity matrix between observations__ --- ## Complete linkage example .pull-left[ - Use the `hclust` function with a `dist()` object - Uses `complete` linkage by default ```r nba_complete_hclust <- * hclust(player_dist, method = "complete") ``` - Need to use `cutree()` to return cluster labels: ```r nba_filtered_stats %>% mutate(player_clusters = * as.factor(cutree(nba_complete_hclust, * k = 4))) %>% ggplot(aes(x = std_x3pa, y = std_trb, color = player_clusters)) + geom_point(alpha = 0.5) + ggthemes::scale_color_colorblind() + theme_bw() + theme(legend.position = "bottom") ``` ] .pull-right[ Returns _compact_ clusters, similar to `\(K\)`-means <img src="07-Hierarchical-clustering_files/figure-html/unnamed-chunk-7-1.png" width="504" style="display: block; margin: auto;" /> ] --- ## What are we cutting? Dendrograms .pull-left[ Use the [`ggdendro`](https://cran.r-project.org/web/packages/ggdendro/index.html) package (instead of `plot()`) ```r library(ggdendro) *ggdendrogram(nba_complete_hclust, theme_dendro = FALSE, * labels = FALSE, leaf_labels = FALSE) + labs(y = "Dissimilarity between clusters") + theme_bw() + theme(axis.text.x = element_blank(), axis.title.x = element_blank(), axis.ticks.x = element_blank(), panel.grid = element_blank()) ``` - Each __leaf__ is one observation - __Height of branch indicates dissimilarity between clusters__ - (After first step) Horizontal position along x-axis means nothing ] .pull-right[ <img src="07-Hierarchical-clustering_files/figure-html/unnamed-chunk-8-1.png" width="504" style="display: block; margin: auto;" /> ] --- ## [Textbook example](https://bradleyboehmke.github.io/HOML/hierarchical.html) <img src="https://bradleyboehmke.github.io/HOML/19-hierarchical_files/figure-html/comparing-dendrogram-to-distances-1.png" width="100%" style="display: block; margin: auto;" /> --- ## Cut dendrograms to obtain cluster labels .pull-left[ Specify the height to cut with `h` instead of `k` <img src="07-Hierarchical-clustering_files/figure-html/complete-dendro-cut-1.png" width="80%" /> ] .pull-right[ ```r *cutree(nba_complete_hclust, h = 6) ``` <img src="07-Hierarchical-clustering_files/figure-html/nba-complete-cut-plot-1.png" width="80%" /> ] --- ## Single linkage example .pull-left[ Change the `method` argument to `single` <img src="07-Hierarchical-clustering_files/figure-html/single-dendro-cut-1.png" width="80%" /> ] .pull-right[ Results in a __chaining__ effect <img src="07-Hierarchical-clustering_files/figure-html/nba-single-plot-1.png" width="80%" /> ] --- ## Average linkage example .pull-left[ Change the `method` argument to `average` <img src="07-Hierarchical-clustering_files/figure-html/average-dendro-cut-1.png" width="80%" /> ] .pull-right[ Closer to `complete` but varies in compactness <img src="07-Hierarchical-clustering_files/figure-html/nba-average-plot-1.png" width="80%" /> ] --- ## More linkage functions - __Centroid linkage__: Computes the dissimilarity between the centroid for cluster 1 and the centroid for cluster 2 - i.e. distance between the averages of the two clusters - use `method = centroid` -- - __Ward’s linkage__: Merges a pair of clusters to minimize the within-cluster variance - i.e. aim is to minimize the objection function from `\(K\)`-means - can use `ward.D` or `ward.D2` (different algorithms) -- <img src="https://media1.tenor.com/images/bfb8e3e881ac4413ae12b61c4b982d60/tenor.gif?itemid=5140031" width="30%" style="display: block; margin: auto;" /> --- ## [Minimax linkage](http://statweb.stanford.edu/~tibs/sta306bfiles/minimax-clustering.pdf) - Each cluster is defined __by a prototype__ observation (most representative) - __Identify the point whose farthest point is closest__ (hence the minimax) <img src="https://europepmc.org/articles/PMC4527350/bin/nihms637357f2.jpg" width="60%" style="display: block; margin: auto;" /> - Use this minimum-maximum distance as the measure of cluster dissimilarity - Dendogram interpretation: each point point is `\(\leq h\)` in dissimilarity to the __prototype__ of cluster - __Cluster centers are chosen among the observations themselves - hence prototype__ --- ## Minimax linkage example .pull-left[ - Easily done in `R` via the [`protoclust`](https://github.com/jacobbien/protoclust) package - Use the `protoclust()` function to apply the clustering to the `dist()` object ```r library(protoclust) *nba_minimax <- protoclust(player_dist) ggdendrogram(nba_minimax, theme_dendro = FALSE, labels = FALSE, leaf_labels = FALSE) + labs(y = "Maximum dissimilarity from prototype") + theme_bw() + theme(axis.text.x = element_blank(), axis.title.x = element_blank(), axis.ticks.x = element_blank(), panel.grid = element_blank()) ``` ] .pull-right[ <img src="07-Hierarchical-clustering_files/figure-html/unnamed-chunk-12-1.png" width="504" style="display: block; margin: auto;" /> ] --- ## Minimax linkage example .pull-left[ - Use the `protocut()` function to make the cut - But then access the cluster labels `cl` ```r minimax_player_clusters <- * protocut(nba_minimax, k = 4) nba_filtered_stats %>% mutate(player_clusters = * as.factor(minimax_player_clusters$cl)) %>% ggplot(aes(x = std_x3pa, y = std_trb, color = player_clusters)) + geom_point(alpha = 0.5) + ggthemes::scale_color_colorblind() + theme_bw() + theme(legend.position = "bottom") ``` ] .pull-right[ <img src="07-Hierarchical-clustering_files/figure-html/unnamed-chunk-13-1.png" width="504" style="display: block; margin: auto;" /> ] --- ## Minimax linkage example - Want to check out the prototypes for the three clusters - `protocut` returns the indices of the prototypes (in order of the cluster labels) ```r minimax_player_clusters$protos ``` ``` ## [1] 468 347 103 251 ``` - View these player rows using `slice`: ```r nba_filtered_stats %>% dplyr::select(player, pos, age, std_x3pa, std_trb) %>% slice(minimax_player_clusters$protos) ``` ``` ## # A tibble: 4 x 5 ## player pos age std_x3pa std_trb ## <chr> <chr> <dbl> <dbl> <dbl> ## 1 Domantas Sabonis C-PF 25 -1.02 1.99 ## 2 Jalen Suggs PG 20 0.161 -0.691 ## 3 Luka Dončić PG 22 1.53 0.955 ## 4 Ben McLemore SG 28 2.47 -1.28 ``` --- ## Wrapping up... - For context, how does player position (`pos`) relate to our clustering results? ```r table("Clusters" = minimax_player_clusters$cl, "Positions" = nba_filtered_stats$pos) ``` ``` ## Positions ## Clusters C C-PF PF PF-SF PG PG-SG SF SF-SG SG SG-PG SG-SF ## 1 71 2 34 0 0 0 8 0 3 0 0 ## 2 13 0 54 1 88 1 76 5 90 2 4 ## 3 1 0 4 0 2 0 6 0 3 0 0 ## 4 0 0 1 0 2 0 2 0 9 1 0 ``` -- - Can see positions tend to fall within particular clusters... - _What's the way to visually compare the two labels?_ -- - __We can easily include more variables__ - just changes our distance matrix -- - But we might want to explore __soft__ assignments instead...