Clustering

class: center, middle, inverse, title-slide

# Clustering
## Hierarchical clustering
### June 14th, 2021

---

## Prep dataset from before

Created dataset of NBA player statistics per 100 possessions using [`ballr`](https://cran.r-project.org/web/packages/ballr/vignettes/use-ballr.html)

```r
library(tidyverse)
nba_pos_stats <- 
  read_csv("http://www.stat.cmu.edu/cmsac/sure/2021/materials/data/clustering/nba_2021_player_per_pos_stats.csv")
# Find rows for players indicating a full season worth of stats
tot_players <- nba_pos_stats %>% filter(tm == "TOT")
# Stack this dataset with players that played on just one team
nba_player_stats <- nba_pos_stats %>% filter(!(player %in% tot_players$player)) %>% 
  bind_rows(tot_players)
# Filter to only players with at least 250 minutes played
nba_filtered_stats <- nba_player_stats %>% filter(mp >= 250)
head(nba_filtered_stats)
```

```
## # A tibble: 6 x 31
##   player   pos     age tm        g    gs    mp    fg   fga fgpercent   x3p  x3pa x3ppercent   x2p
##   <chr>    <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>     <dbl> <dbl> <dbl>      <dbl> <dbl>
## 1 Preciou… PF       21 MIA      61     4   737   8.4  15.4     0.544   0     0.1      0       8.4
## 2 Steven … C        27 NOP      58    58  1605   5.6   9.2     0.614   0     0.1      0       5.6
## 3 Bam Ade… C        23 MIA      64    64  2143  10.6  18.5     0.57    0     0.2      0.25   10.5
## 4 Nickeil… SG       22 NOP      46    13  1007   9.1  21.8     0.419   3.6  10.4      0.347   5.5
## 5 Grayson… SG       25 MEM      50    38  1259   6.6  15.7     0.418   4.1  10.4      0.391   2.5
## 6 Kyle An… PF       27 MEM      69    69  1887   7.8  16.7     0.468   2.4   6.6      0.36    5.4
## # … with 17 more variables: x2pa <dbl>, x2ppercent <dbl>, ft <dbl>, fta <dbl>, ftpercent <dbl>,
## #   orb <dbl>, drb <dbl>, trb <dbl>, ast <dbl>, stl <dbl>, blk <dbl>, tov <dbl>, pf <dbl>,
## #   pts <dbl>, x <lgl>, ortg <dbl>, drtg <dbl>
```

---

## Let's work from the bottom-up...

.pull-left[

- __Review__: We have `$p$` variables for `$n$` observations `$x_1,\dots,x_n$`,

- Compute the __distance / dissimilarity__ between observations

- e.g. __Euclidean distance__ between observations `$i$` and `$j$`

`$$d(x_i, x_j) = \sqrt{(x_{i1}-x_{j1})^2 + \cdots + (x_{ip}-x_{jp})^2}$$`

__What are the distances between these NBA players using `x3pa` and `trb`?__

]
.pull-right[

]

---

## Compute the distance matrix using `dist()`

.pull-left[

- Compute pairwise Euclidean distance

```r
*player_dist <- dist(dplyr::select(nba_filtered_stats,
                                  x3pa, trb))
```

- Returns an object of `dist` class - i.e. not a matrix

- Can convert to a matrix, then set the row and column names:

```r
*player_dist_matrix <- as.matrix(player_dist)
rownames(player_dist_matrix) <- nba_filtered_stats$player
colnames(player_dist_matrix) <- nba_filtered_stats$player
head(player_dist_matrix[1:3, 1:3])
```

```
##                  Precious Achiuwa Steven Adams Bam Adebayo
## Precious Achiuwa        0.0000000      1.40000   0.7071068
## Steven Adams            1.4000000      0.00000   2.1023796
## Bam Adebayo             0.7071068      2.10238   0.0000000
```

]
.pull-right[

Can convert to a long table for plotting with `ggplot`:

```r
long_dist_matrix <- 
  as_tibble(player_dist_matrix) %>%
  mutate(player1 = rownames(player_dist_matrix)) %>%
* pivot_longer(cols = -player1,
*              names_to = "player2",
*              values_to = "distance")
long_dist_matrix %>%
  ggplot(aes(x = player1, y = player2, 
             fill = distance)) +
  geom_tile() +
  theme_bw() +
  theme(axis.text = element_blank(), 
        axis.ticks = element_blank(),
        legend.position = "bottom") +
  scale_fill_gradient(low = "darkorange", 
                      high = "darkblue")
```

]

---

## This is useless...

---

### Code interlude: arrange your heatmap with [`seriation`](https://github.com/mhahsler/seriation)

.pull-left[

```r
library(seriation)
*player_dist_seriate <- seriate(player_dist)
*player_order <- get_order(player_dist_seriate)
player_names_order <- 
  nba_filtered_stats$player[player_order]

long_dist_matrix %>%
  mutate(player1 = 
           fct_relevel(player1, 
                       player_names_order),
         player2 = 
           fct_relevel(player2, 
                       player_names_order)) %>%
  ggplot(aes(x = player1, y = player2, 
             fill = distance)) +
  geom_tile() + theme_bw() +
  theme(axis.text = element_blank(), 
        axis.ticks = element_blank(),
        legend.position = "bottom") +
  scale_fill_gradient(low = "darkorange",
                      high = "darkblue")
```

]
.pull-right[
<img src="07-Hierarchical-clustering_files/figure-html/unnamed-chunk-3-1.png" width="504" style="display: block; margin: auto;" />
]

---

## (Agglomerative) [Hierarchical clustering](https://en.wikipedia.org/wiki/Hierarchical_clustering)

Let's pretend all `$n$` observations are in their own cluster

- Step 1: Compute the pairwise dissimilarities between each cluster

- e.g., distance matrix on previous slides
  
--
  
- Step 2: Identify the pair of clusters that are __least dissimilar__

- Step 3: Fuse these two clusters into a new cluster!

- __Repeat Steps 1 to 3 until all observations are in the same cluster__

__"Bottom-up"__, agglomerative clustering that forms a __tree / hierarchy__ of merging

No mention of any randomness!

No mention of the number of clusters `$K$`!

---

## (Agglomerative) [Hierarchical clustering](https://en.wikipedia.org/wiki/Hierarchical_clustering)

.pull-left[
Start with all observations in their own cluster

- Step 1: Compute the pairwise dissimilarities between each cluster

- Step 2: Identify the pair of clusters that are __least dissimilar__

- Step 3: Fuse these two clusters into a new cluster!

- __Repeat Steps 1 to 3 until all observations are in the same cluster__
]
.pull-right[
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/b/b5/Clusters.svg/250px-Clusters.svg.png" width="70%" style="display: block; margin: auto;" />
]

---

## (Agglomerative) [Hierarchical clustering](https://en.wikipedia.org/wiki/Hierarchical_clustering)

.pull-left[
Start with all observations in their own cluster

- Step 1: Compute the pairwise dissimilarities between each cluster

- Step 2: Identify the pair of clusters that are __least dissimilar__

- Step 3: Fuse these two clusters into a new cluster!

- __Repeat Steps 1 to 3 until all observations are in the same cluster__
]
.pull-right[
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/a/ad/Hierarchical_clustering_simple_diagram.svg/418px-Hierarchical_clustering_simple_diagram.svg.png" width="85%" style="display: block; margin: auto;" />

Forms a __dendrogram__ (typically displayed from bottom-up)
]

---

## How do we define dissimilarity between clusters?

We know how to compute distance / dissimilarity between two observations

__But how do we handle clusters?__

- Dissimilarity between a cluster and an observation, or between two clusters
  
--

We need to choose a __linkage function__! Clusters are built up by __linking them together__

Compute all pairwise dissimilarities between observations in cluster 1 with observations in cluster 2

i.e. Compute the distance matrix between observations, `$d(x_i, x_j)$` for `$i \in C_1$` and `$j \in C_2$`

- __Complete linkage__: Use the __maximum__ value of these dissimilarities: `$\underset{i \in C_1, j \in C_2}{\text{max}} d(x_i, x_j)$`
  
--

- __Single linkage__: Use the __minimum__ value: `$\underset{i \in C_1, j \in C_2}{\text{min}} d(x_i, x_j)$`

- __Average linkage__: Use the __average__ value: `$\frac{1}{|C_1| \cdot |C_2|} \sum_{i \in C_1} \sum_{j \in C_2} d(x_i, x_j)$`
  
--

Define dissimilarity between two clusters __based on our initial dissimilarity matrix between observations__

---

## Complete linkage example

.pull-left[

- Use the `hclust` function with a `dist()` object

- Uses `complete` linkage by default

```r
nba_complete_hclust <- 
* hclust(player_dist, method = "complete")
```

- Need to use `cutree()` to return cluster labels:

```r
nba_filtered_stats %>%
  mutate(player_clusters = 
*          as.factor(cutree(nba_complete_hclust,
*                           k = 4))) %>%
  ggplot(aes(x = x3pa, y = trb,
             color = player_clusters)) +
  geom_point() + 
  ggthemes::scale_color_colorblind() +
  theme_bw() +
  theme(legend.position = "bottom")
```

]
.pull-right[

Returns _compact_ clusters, similar to `$K$`-means

<img src="07-Hierarchical-clustering_files/figure-html/unnamed-chunk-6-1.png" width="504" style="display: block; margin: auto;" />
]

---

## What are we cutting? Dendrograms

.pull-left[
Use the [`ggdendro`](https://cran.r-project.org/web/packages/ggdendro/index.html) package (instead of `plot()`)

```r
library(ggdendro)
*ggdendrogram(nba_complete_hclust, theme_dendro = FALSE,
*            labels = FALSE, leaf_labels = FALSE) +
  labs(y = "Dissimilarity between clusters") +
  theme_bw() +
  theme(axis.text.x = element_blank(), 
        axis.title.x = element_blank(),
        axis.ticks.x = element_blank(),
        panel.grid = element_blank())
```

- Each __leaf__ is one observation

- __Height of branch indicates dissimilarity between clusters__

- (After first step) Horizontal position along x-axis means nothing

]
.pull-right[
<img src="07-Hierarchical-clustering_files/figure-html/unnamed-chunk-7-1.png" width="504" style="display: block; margin: auto;" />
]

---

## [Textbook example](https://bradleyboehmke.github.io/HOML/hierarchical.html)

---

## Cut dendrograms to obtain cluster labels

.pull-left[
Specify the height to cut with `h` instead of `k`

]
.pull-right[

```r
*cutree(nba_complete_hclust, h = 20)
```

<img src="07-Hierarchical-clustering_files/figure-html/nba-complete-cut-plot-1.png" width="80%" />
]

---

## Single linkage example

.pull-left[
Change the `method` argument to `single`

]
.pull-right[
Results in a __chaining__ effect

<img src="07-Hierarchical-clustering_files/figure-html/nba-single-plot-1.png" width="80%" />
]

---

## Average linkage example

.pull-left[
Change the `method` argument to `average`

]
.pull-right[
Closer to `complete` but varies in compactness

<img src="07-Hierarchical-clustering_files/figure-html/nba-average-plot-1.png" width="80%" />
]

---

## More linkage functions

- __Centroid linkage__: Computes the dissimilarity between the centroid for cluster 1 and the centroid for cluster 2

- i.e. distance between the averages of the two clusters
  
  - use `method = centroid`
  
--

- __Ward’s linkage__: Merges a pair of clusters to minimize the within-cluster variance

- i.e. aim is to minimize the objection function from `$K$`-means
  
  - can use `ward.D` or `ward.D2` (different algorithms)
  
--

---

## [Minimax linkage](http://statweb.stanford.edu/~tibs/sta306bfiles/minimax-clustering.pdf)

- Each cluster is defined __by a prototype__ observation (most representative)

- __Identify the point whose farthest point is closest__ (hence the minimax)

- Use this minimum-maximum distance as the measure of cluster dissimilarity

- Dendogram interpretation: each point point is `$\leq h$` in dissimilarity to the __prototype__ of cluster

- __Cluster centers are chosen among the observations themselves - hence prototype__

---

## Minimax linkage example

.pull-left[

- Easily done in `R` via the [`protoclust`](https://github.com/jacobbien/protoclust) package

- Use the `protoclust()` function to apply the clustering to the `dist()` object

```r
library(protoclust)
*nba_minimax <- protoclust(player_dist)
ggdendrogram(nba_minimax, 
             theme_dendro = FALSE, 
             labels = FALSE, 
             leaf_labels = FALSE) + 
  labs(y = "Maximum dissimilarity from prototype") +
  theme_bw() +
  theme(axis.text.x = element_blank(), 
        axis.title.x = element_blank(),
        axis.ticks.x = element_blank(),
        panel.grid = element_blank())
```

]
.pull-right[

<img src="07-Hierarchical-clustering_files/figure-html/unnamed-chunk-11-1.png" width="504" style="display: block; margin: auto;" />
]

---

## Minimax linkage example

.pull-left[

- Use the `protocut()` function to make the cut

- But then access the cluster labels `cl`

```r
minimax_player_clusters <- 
* protocut(nba_minimax, k = 4)
nba_filtered_stats %>%
  mutate(player_clusters = 
*          as.factor(minimax_player_clusters$cl)) %>%
  ggplot(aes(x = x3pa, y = trb,
             color = player_clusters)) +
  geom_point() + 
  ggthemes::scale_color_colorblind() +
  theme_bw() +
  theme(legend.position = "bottom")
```

]
.pull-right[

<img src="07-Hierarchical-clustering_files/figure-html/unnamed-chunk-12-1.png" width="504" style="display: block; margin: auto;" />
]

---

## Minimax linkage example

- Want to check out the prototypes for the three clusters

- `protocut` returns the indices of the prototypes (in order of the cluster labels)

```r
minimax_player_clusters$protos
```

```
## [1] 363 134 183 100
```

- View these player rows using `slice`:

```r
nba_filtered_stats %>%
  dplyr::select(player, pos, age, x3pa, trb) %>%
  slice(minimax_player_clusters$protos)
```

```
## # A tibble: 4 x 5
##   player            pos     age  x3pa   trb
##   <chr>             <chr> <dbl> <dbl> <dbl>
## 1 Torrey Craig      SF       30   6.4  12  
## 2 Willy Hernangómez C        26   0.6  18.9
## 3 Zach LaVine       SG       25  11.4   6.9
## 4 Trent Forrest     PG       22   4.2   7.1
```

---

## Wrapping up...

- For context, how does player position (`pos`) relate to our clustering results?

```r
table("Clusters" = minimax_player_clusters$cl, "Positions" = nba_filtered_stats$pos)
```

```
##         Positions
## Clusters  C C-PF PF PF-C PF-SF PG PG-SG SF SF-PF SF-SG SG SG-PG SG-SF
##        1 32    0 16    1     0  3     0  3     1     0  0     0     0
##        2 30    0  2    0     0  0     0  0     0     0  0     0     0
##        3  4    2 27    0     0 45     1 33     2     2 73     0     1
##        4 10    0 36    0     1 27     0 36     0     0 26     3     1
```

- Can see positions tend to fall within particular clusters...

- _What's the way to visually compare the two labels?_

- __We can easily include more variables__ - just changes our distance matrix

- But we might want to explore __soft__ assignments instead (Wednesday!)