Clustering

class: center, middle, inverse, title-slide

# Clustering
## Hierarchical clustering
### June 15th, 2022

---

## Prep NBA player dataset

Created dataset of NBA player statistics per 100 possessions using [`ballr`](https://cran.r-project.org/web/packages/ballr/vignettes/use-ballr.html)

```r
library(tidyverse)
nba_pos_stats <- 
  read_csv("http://www.stat.cmu.edu/cmsac/sure/2022/materials/data/sports/clustering/nba_2022_player_per_pos_stats.csv")
# Find rows for players indicating a full season worth of stats
tot_players <- nba_pos_stats %>% filter(tm == "TOT")
# Stack this dataset with players that played on just one team
nba_player_stats <- nba_pos_stats %>% filter(!(player %in% tot_players$player)) %>% 
  bind_rows(tot_players)
# Filter to only players with at least 125 minutes played
nba_filtered_stats <- nba_player_stats %>% filter(mp >= 125)
head(nba_filtered_stats)
```

```
## # A tibble: 6 x 31
##   player   pos     age tm        g    gs    mp    fg   fga fgpercent   x3p  x3pa x3ppercent
##   <chr>    <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>     <dbl> <dbl> <dbl>      <dbl>
## 1 Preciou… C        22 TOR      73    28  1725   7.7  17.5     0.439   1.6   4.5      0.359
## 2 Steven … C        28 MEM      76    75  1999   5     9.2     0.547   0     0        0    
## 3 Bam Ade… C        24 MIA      56    56  1825  11.1  20       0.557   0     0.2      0    
## 4 Santi A… PF       21 MEM      32     0   360   7    17.5     0.402   0.8   6.4      0.125
## 5 LaMarcu… C        36 BRK      47    12  1050  11.6  21.1     0.55    0.6   2.1      0.304
## 6 Grayson… SG       26 MIL      66    61  1805   6.8  15.1     0.448   4.2  10.4      0.409
## # … with 18 more variables: x2p <dbl>, x2pa <dbl>, x2ppercent <dbl>, ft <dbl>, fta <dbl>,
## #   ftpercent <dbl>, orb <dbl>, drb <dbl>, trb <dbl>, ast <dbl>, stl <dbl>, blk <dbl>,
## #   tov <dbl>, pf <dbl>, pts <dbl>, x <lgl>, ortg <dbl>, drtg <dbl>
```

---

## Let's work from the bottom-up...

.pull-left[

- __Review__: We have `$p$` variables for `$n$` observations `$x_1,\dots,x_n$`,

- Compute the __distance / dissimilarity__ between observations

- e.g. __Euclidean distance__ between observations `$i$` and `$j$`

`$$d(x_i, x_j) = \sqrt{(x_{i1}-x_{j1})^2 + \cdots + (x_{ip}-x_{jp})^2}$$`

__What are the distances between these NBA players using `x3pa` and `trb`?__

]
.pull-right[

]

---

## Remember to standardize!

.pull-left[

```r
nba_filtered_stats <- nba_filtered_stats %>%
  mutate(std_x3pa = as.numeric(scale(x3pa)),
         std_trb = as.numeric(scale(trb)))
nba_filtered_stats %>%
  ggplot(aes(x = std_x3pa, y = std_trb)) +
  geom_point(alpha = 0.5) + 
  theme_bw() +
  coord_fixed()
```

]
.pull-right[

]

---

## Compute the distance matrix using `dist()`

.pull-left[

- Compute pairwise Euclidean distance

```r
*player_dist <- dist(dplyr::select(nba_filtered_stats,
                                  std_x3pa, std_trb))
```

- Returns an object of `dist` class - i.e., not a matrix

- Can convert to a matrix, then set the row and column names:

```r
*player_dist_matrix <- as.matrix(player_dist)
rownames(player_dist_matrix) <- nba_filtered_stats$player
colnames(player_dist_matrix) <- nba_filtered_stats$player
head(player_dist_matrix[1:3, 1:3])
```

```
##                  Precious Achiuwa Steven Adams Bam Adebayo
## Precious Achiuwa         0.000000    1.6394586   1.2387398
## Steven Adams             1.639459    0.0000000   0.6652539
## Bam Adebayo              1.238740    0.6652539   0.0000000
```

]
.pull-right[

Can convert to a long table for plotting with `ggplot`:

```r
long_dist_matrix <- 
  as_tibble(player_dist_matrix) %>%
  mutate(player1 = rownames(player_dist_matrix)) %>%
* pivot_longer(cols = -player1,
*              names_to = "player2",
*              values_to = "distance")
long_dist_matrix %>%
  ggplot(aes(x = player1, y = player2, 
             fill = distance)) +
  geom_tile() +
  theme_bw() +
  theme(axis.text = element_blank(), 
        axis.ticks = element_blank(),
        legend.position = "bottom") +
  scale_fill_gradient(low = "darkorange", 
                      high = "darkblue")
```

]

---

## This is useless...

---

### Code interlude: arrange your heatmap with [`seriation`](https://github.com/mhahsler/seriation)

.pull-left[

```r
library(seriation)
*player_dist_seriate <- seriate(player_dist)
*player_order <- get_order(player_dist_seriate)
player_names_order <- 
  nba_filtered_stats$player[player_order]

long_dist_matrix %>%
  mutate(player1 = 
           fct_relevel(player1, 
                       player_names_order),
         player2 = 
           fct_relevel(player2, 
                       player_names_order)) %>%
  ggplot(aes(x = player1, y = player2, 
             fill = distance)) +
  geom_tile() + theme_bw() +
  theme(axis.text = element_blank(), 
        axis.ticks = element_blank(),
        legend.position = "bottom") +
  scale_fill_gradient(low = "darkorange",
                      high = "darkblue")
```

]
.pull-right[
<img src="07-Hierarchical-clustering_files/figure-html/unnamed-chunk-4-1.png" width="504" style="display: block; margin: auto;" />
]

---

## (Agglomerative) [Hierarchical clustering](https://en.wikipedia.org/wiki/Hierarchical_clustering)

Let's pretend all `$n$` observations are in their own cluster

- Step 1: Compute the pairwise dissimilarities between each cluster

- e.g., distance matrix on previous slides
  
--
  
- Step 2: Identify the pair of clusters that are __least dissimilar__

- Step 3: Fuse these two clusters into a new cluster!

- __Repeat Steps 1 to 3 until all observations are in the same cluster__

__"Bottom-up"__, agglomerative clustering that forms a __tree / hierarchy__ of merging

No mention of any randomness!

No mention of the number of clusters `$K$`!

---

## (Agglomerative) [Hierarchical clustering](https://en.wikipedia.org/wiki/Hierarchical_clustering)

.pull-left[
Start with all observations in their own cluster

- Step 1: Compute the pairwise dissimilarities between each cluster

- Step 2: Identify the pair of clusters that are __least dissimilar__

- Step 3: Fuse these two clusters into a new cluster!

- __Repeat Steps 1 to 3 until all observations are in the same cluster__
]
.pull-right[
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/b/b5/Clusters.svg/250px-Clusters.svg.png" width="70%" style="display: block; margin: auto;" />
]

---

## (Agglomerative) [Hierarchical clustering](https://en.wikipedia.org/wiki/Hierarchical_clustering)

.pull-left[
Start with all observations in their own cluster

- Step 1: Compute the pairwise dissimilarities between each cluster

- Step 2: Identify the pair of clusters that are __least dissimilar__

- Step 3: Fuse these two clusters into a new cluster!

- __Repeat Steps 1 to 3 until all observations are in the same cluster__
]
.pull-right[
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/a/ad/Hierarchical_clustering_simple_diagram.svg/418px-Hierarchical_clustering_simple_diagram.svg.png" width="85%" style="display: block; margin: auto;" />

Forms a __dendrogram__ (typically displayed from bottom-up)
]

---

## How do we define dissimilarity between clusters?

We know how to compute distance / dissimilarity between two observations

__But how do we handle clusters?__

- Dissimilarity between a cluster and an observation, or between two clusters
  
--

We need to choose a __linkage function__! Clusters are built up by __linking them together__

Compute all pairwise dissimilarities between observations in cluster 1 with observations in cluster 2

i.e. Compute the distance matrix between observations, `$d(x_i, x_j)$` for `$i \in C_1$` and `$j \in C_2$`

- __Complete linkage__: Use the __maximum__ value of these dissimilarities: `$\underset{i \in C_1, j \in C_2}{\text{max}} d(x_i, x_j)$`
  
--

- __Single linkage__: Use the __minimum__ value: `$\underset{i \in C_1, j \in C_2}{\text{min}} d(x_i, x_j)$`

- __Average linkage__: Use the __average__ value: `$\frac{1}{|C_1| \cdot |C_2|} \sum_{i \in C_1} \sum_{j \in C_2} d(x_i, x_j)$`
  
--

Define dissimilarity between two clusters __based on our initial dissimilarity matrix between observations__

---

## Complete linkage example

.pull-left[

- Use the `hclust` function with a `dist()` object

- Uses `complete` linkage by default

```r
nba_complete_hclust <- 
* hclust(player_dist, method = "complete")
```

- Need to use `cutree()` to return cluster labels:

```r
nba_filtered_stats %>%
  mutate(player_clusters = 
*          as.factor(cutree(nba_complete_hclust,
*                           k = 4))) %>%
  ggplot(aes(x = std_x3pa, y = std_trb,
             color = player_clusters)) +
  geom_point(alpha = 0.5) + 
  ggthemes::scale_color_colorblind() +
  theme_bw() +
  theme(legend.position = "bottom")
```

]
.pull-right[

Returns _compact_ clusters, similar to `$K$`-means

<img src="07-Hierarchical-clustering_files/figure-html/unnamed-chunk-7-1.png" width="504" style="display: block; margin: auto;" />
]

---

## What are we cutting? Dendrograms

.pull-left[
Use the [`ggdendro`](https://cran.r-project.org/web/packages/ggdendro/index.html) package (instead of `plot()`)

```r
library(ggdendro)
*ggdendrogram(nba_complete_hclust, theme_dendro = FALSE,
*            labels = FALSE, leaf_labels = FALSE) +
  labs(y = "Dissimilarity between clusters") +
  theme_bw() +
  theme(axis.text.x = element_blank(), 
        axis.title.x = element_blank(),
        axis.ticks.x = element_blank(),
        panel.grid = element_blank())
```

- Each __leaf__ is one observation

- __Height of branch indicates dissimilarity between clusters__

- (After first step) Horizontal position along x-axis means nothing

]
.pull-right[
<img src="07-Hierarchical-clustering_files/figure-html/unnamed-chunk-8-1.png" width="504" style="display: block; margin: auto;" />
]

---

## [Textbook example](https://bradleyboehmke.github.io/HOML/hierarchical.html)

---

## Cut dendrograms to obtain cluster labels

.pull-left[
Specify the height to cut with `h` instead of `k`

]
.pull-right[

```r
*cutree(nba_complete_hclust, h = 6)
```

<img src="07-Hierarchical-clustering_files/figure-html/nba-complete-cut-plot-1.png" width="80%" />
]

---

## Single linkage example

.pull-left[
Change the `method` argument to `single`

]
.pull-right[
Results in a __chaining__ effect

<img src="07-Hierarchical-clustering_files/figure-html/nba-single-plot-1.png" width="80%" />
]

---

## Average linkage example

.pull-left[
Change the `method` argument to `average`

]
.pull-right[
Closer to `complete` but varies in compactness

<img src="07-Hierarchical-clustering_files/figure-html/nba-average-plot-1.png" width="80%" />
]

---

## More linkage functions

- __Centroid linkage__: Computes the dissimilarity between the centroid for cluster 1 and the centroid for cluster 2

- i.e. distance between the averages of the two clusters
  
  - use `method = centroid`
  
--

- __Ward’s linkage__: Merges a pair of clusters to minimize the within-cluster variance

- i.e. aim is to minimize the objection function from `$K$`-means
  
  - can use `ward.D` or `ward.D2` (different algorithms)
  
--

---

## [Minimax linkage](http://statweb.stanford.edu/~tibs/sta306bfiles/minimax-clustering.pdf)

- Each cluster is defined __by a prototype__ observation (most representative)

- __Identify the point whose farthest point is closest__ (hence the minimax)

- Use this minimum-maximum distance as the measure of cluster dissimilarity

- Dendogram interpretation: each point point is `$\leq h$` in dissimilarity to the __prototype__ of cluster

- __Cluster centers are chosen among the observations themselves - hence prototype__

---

## Minimax linkage example

.pull-left[

- Easily done in `R` via the [`protoclust`](https://github.com/jacobbien/protoclust) package

- Use the `protoclust()` function to apply the clustering to the `dist()` object

```r
library(protoclust)
*nba_minimax <- protoclust(player_dist)
ggdendrogram(nba_minimax, 
             theme_dendro = FALSE, 
             labels = FALSE, 
             leaf_labels = FALSE) + 
  labs(y = "Maximum dissimilarity from prototype") +
  theme_bw() +
  theme(axis.text.x = element_blank(), 
        axis.title.x = element_blank(),
        axis.ticks.x = element_blank(),
        panel.grid = element_blank())
```

]
.pull-right[

<img src="07-Hierarchical-clustering_files/figure-html/unnamed-chunk-12-1.png" width="504" style="display: block; margin: auto;" />
]

---

## Minimax linkage example

.pull-left[

- Use the `protocut()` function to make the cut

- But then access the cluster labels `cl`

```r
minimax_player_clusters <- 
* protocut(nba_minimax, k = 4)
nba_filtered_stats %>%
  mutate(player_clusters = 
*          as.factor(minimax_player_clusters$cl)) %>%
  ggplot(aes(x = std_x3pa, y = std_trb,
             color = player_clusters)) +
  geom_point(alpha = 0.5) + 
  ggthemes::scale_color_colorblind() +
  theme_bw() +
  theme(legend.position = "bottom")
```

]
.pull-right[

<img src="07-Hierarchical-clustering_files/figure-html/unnamed-chunk-13-1.png" width="504" style="display: block; margin: auto;" />
]

---

## Minimax linkage example

- Want to check out the prototypes for the three clusters

- `protocut` returns the indices of the prototypes (in order of the cluster labels)

```r
minimax_player_clusters$protos
```

```
## [1] 468 347 103 251
```

- View these player rows using `slice`:

```r
nba_filtered_stats %>%
  dplyr::select(player, pos, age, std_x3pa, std_trb) %>%
  slice(minimax_player_clusters$protos)
```

```
## # A tibble: 4 x 5
##   player           pos     age std_x3pa std_trb
##   <chr>            <chr> <dbl>    <dbl>   <dbl>
## 1 Domantas Sabonis C-PF     25   -1.02    1.99 
## 2 Jalen Suggs      PG       20    0.161  -0.691
## 3 Luka Dončić      PG       22    1.53    0.955
## 4 Ben McLemore     SG       28    2.47   -1.28
```

---

## Wrapping up...

- For context, how does player position (`pos`) relate to our clustering results?

```r
table("Clusters" = minimax_player_clusters$cl, "Positions" = nba_filtered_stats$pos)
```

```
##         Positions
## Clusters  C C-PF PF PF-SF PG PG-SG SF SF-SG SG SG-PG SG-SF
##        1 71    2 34     0  0     0  8     0  3     0     0
##        2 13    0 54     1 88     1 76     5 90     2     4
##        3  1    0  4     0  2     0  6     0  3     0     0
##        4  0    0  1     0  2     0  2     0  9     1     0
```

- Can see positions tend to fall within particular clusters...

- _What's the way to visually compare the two labels?_

- __We can easily include more variables__ - just changes our distance matrix

- But we might want to explore __soft__ assignments instead...