Clustering

class: center, middle, inverse, title-slide

# Clustering
## K-means and hierarchical clustering
### Ron Yurko
### 06/09/2020

---

## Brace yourselves

---

## Into statistical learning with unsupervised learning

What is __statistical learning?__
--
 [Preface of Introduction to Statistical Learning with Applications in R (ISLR)](http://faculty.marshall.usc.edu/gareth-james/ISL/):

> _refers to a set of tools for modeling and understanding complex datasets_

--
What is __unsupervised learning?__

--
We have `$p$` variables for `$n$` observations `$X_1,\dots,X_n$`, and for observation `$i$`:

`$$X_{i1},X_{i2},\ldots,X_{ip} \sim P$$`

- `$P$` is a `$p$`-dimensional distribution that we might not know much about *a priori*.

- _unsupervised_: none of the variables are __response__ variables, i.e., there are no labeled data

--
Think of unsupervised learning as __an extension of EDA...__

--
- `$\Rightarrow$` __there is no unique right answer!__

???

- Statistical learning is the process of ascertaining (discovering) associations between groups of variables

- unsupervised learning - where the goal is to discover interesting things about the data

---

## What is clustering (aka cluster analysis)?

--
[ISLR 10.3](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf): 
> _very broad set of techniques for finding subgroups, or clusters, in a dataset_

--
__Goals__:

- observations __within__ clusters are __more similar__ to each other,

- observations __in different__ clusters are __more different__ from each other

--
How do we define __distance / dissimilarity__ between observations?

--
- e.g. __Euclidean distance__ between observations `$i$` and `$j$`

`$$d(x_i, x_j) = \sqrt{(x_{i1}-x_{j1})^2 + \cdots + (x_{ip}-x_{jp})^2}$$`

--
__Units matter!__

--
- one variable may _dominate_ others when computing Euclidean distance because its range is much larger

- can standardize each variable / column of dataset to have mean 0 and standard divation 1 with `scale()`

- __but we may value the separation in that variable!__ (so just be careful...)

???

It is the partitioning of data into homogeneous subgroups

Goal define clusters for which the within-cluster variation is relatively small, i.e. observations within clusters are similar to each other

---

## What's the clustering objective?

- `$C_1, \dots, C_K$` are _sets_ containing indices of observations in each of the `$K$` clusters

- if observation `$i$` is in cluster `$k$`, then `$i \in C_k$`

--
- We want to minimize the __within-cluster variation__ `$W(C_k)$` for each cluster `$C_k$` and solve:

`$$\underset{C_1, \dots, C_K}{\text{minimize}} \Big\{ \sum_{k=1}^K W(C_k) \Big\}$$`

- Can define using the __squared Euclidean distance__ ( `$|C_k| = n_k =$` # observations in cluster `$k$`)

`$$W(C_k) = \frac{1}{|C_k|}\sum_{i,j \in C_k} d(x_i, x_j)^2$$`

--
__So how can we solve this?__

---

## K-means clustering

- __No universally accepted metric that leads one to conclude that a particular value of `$K$` is optimal__

- Results will change from run to run unless you explicitly set the random number seed

- To mitigate randomness, set `nstart` argument in the function call to a large number (e.g., 50), and select the best result (smallest value of __within-cluster variation__)

???

Step 1 of the algorithm indicates that the algorithm randomly associates data to clusters.

---

## K-means clustering

---

## New dataset - NBA 2020 player stats per 100 possessions

Created dataset of NBA player statistics per 100 possessions using the [`ballr`](https://cran.r-project.org/web/packages/ballr/vignettes/use-ballr.html) - which scrapes tables from [basketball-reference.com](https://www.basketball-reference.com/leagues/NBA_2020_per_poss.html)

```r
library(tidyverse)
nba_pos_stats <- 
  read_csv("http://www.stat.cmu.edu/cmsac/sure/materials/data/clustering/nba_2020_player_per_pos_stats.csv")
head(nba_pos_stats)
```

```
## # A tibble: 6 x 31
##   player pos     age tm        g    gs    mp    fg   fga fgpercent   x3p  x3pa
##   <chr>  <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>     <dbl> <dbl> <dbl>
## 1 Steve… C        26 OKC      58    58  1564   8.2  13.8     0.591   0     0.1
## 2 Bam A… PF       22 MIA      65    65  2235   8.9  15.7     0.567   0     0.3
## 3 LaMar… C        34 SAS      53    53  1754  10.7  21.7     0.493   1.7   4.3
## 4 Nicke… SG       21 NOP      41     0   501   7.1  20.9     0.339   3.7  10.8
## 5 Grays… SG       24 MEM      30     0   498   7.4  16.5     0.449   3.1   8.5
## 6 Jarre… C        21 BRK      64    58  1647   7.7  11.9     0.646   0     0.1
## # … with 19 more variables: x3ppercent <dbl>, x2p <dbl>, x2pa <dbl>, x2ppercent <dbl>,
## #   ft <dbl>, fta <dbl>, ft_2 <dbl>, orb <dbl>, drb <dbl>, trb <dbl>, ast <dbl>,
## #   stl <dbl>, blk <dbl>, tov <dbl>, pf <dbl>, pts <dbl>, x <lgl>, ortg <dbl>,
## #   drtg <dbl>
```

---

## Some initial cleaning...

- Each row is at player-team level (`player`-`tm`)

- For players that switched teams during the season, they have rows with `tm == "TOT"`

```r
tot_players <- nba_pos_stats %>%
  filter(tm == "TOT")
```

- Create a new dataset by stacking `tot_players` with the rows for players that did not switch teams

- Use the [`bind_rows()` function](https://dplyr.tidyverse.org/reference/bind.html)

```r
nba_player_stats <- nba_pos_stats %>%
* filter(!(player %in% tot_players$player)) %>%
* bind_rows(tot_players)
```

---

## Make a cutoff with the ECDF

View the ECDF of the minutes played (`mp`) variable to make a cutoff:

.pull-left[

```r
nba_player_stats %>%
  ggplot(aes(x = mp)) +
  stat_ecdf() +
  theme_bw() +
  labs(x = "Minutes played",
       y = "Proportion of NBA players") +
* geom_vline(xintercept = 250,
*            linetype = "dashed",
*            color = "darkred")
```

- Arbitrary cutoff: at least 250 minutes played

- Keeps around 75% of players

- Use [`geom_vline`](https://ggplot2.tidyverse.org/reference/geom_abline.html) to draw vertical line

```r
nba_filtered_stats <- nba_player_stats %>%
  filter(mp >= 250)
```

]

.pull-right[
<img src="06-intro-clustering_files/figure-html/unnamed-chunk-4-1.png" width="504" />

]

---

## K-means clustering example (3 point attempts and rebounds)

.pull-left[

- Use the `kmeans()` function, __but must provide number of clusters `$K$`__

```r
init_nba_kmeans <- 
  kmeans(dplyr::select(nba_filtered_stats,
                       x3pa, trb),
         2, nstart = 30)
nba_filtered_stats <- nba_filtered_stats %>%
  mutate(player_clusters = 
           as.factor(init_nba_kmeans$cluster))
nba_filtered_stats %>%
  ggplot(aes(x = x3pa, y = trb,
             color = player_clusters)) +
  geom_point() + 
  ggthemes::scale_color_colorblind() +
  theme_bw() +
  labs(x = "3 pt att / 100 pos",
       y = "Rebounds / 100 pos",
       color = "Cluster") +
  theme(legend.position = "bottom")
```

]
.pull-right[

]

---

## K-means clustering example (3 point attempts and rebounds)

.pull-left[

- Use `$K = 3$` instead? (__labels are arbitrary__)

```r
upd_nba_kmeans <- 
  kmeans(dplyr::select(nba_filtered_stats,
                       x3pa, trb),
         3, nstart = 30)
nba_filtered_stats <- nba_filtered_stats %>%
  mutate(player_clusters = 
           as.factor(upd_nba_kmeans$cluster))
nba_filtered_stats %>%
  ggplot(aes(x = x3pa, y = trb,
             color = player_clusters)) +
  geom_point() + 
  ggthemes::scale_color_colorblind() +
  theme_bw() +
  labs(x = "3 pt att / 100 pos",
       y = "Rebounds / 100 pos",
       color = "Cluster") +
  theme(legend.position = "bottom")
```

]
.pull-right[

]

---

## Hierarchical clustering

- __"bottom-up"__ or __agglomerative__ clustering

- Limitation? All clusters lie within other clusters (hence the name "hierarchical")

- Advantage? Does __NOT__ depend on number of clusters `$K$` to start

---

## Hiearchical clustering example

.pull-left[

- Use the `hclust` function __with a distance matrix__, e.g. `dist()` computes Euclidean

- Can view the __dendrogram__

- Check out the [`ggdendro`](https://cran.r-project.org/web/packages/ggdendro/vignettes/ggdendro.html) package!

```r
nba_complete_hclust <- 
* hclust(dist(
*   dplyr::select(nba_filtered_stats,
*                 x3pa, trb)),
*        method = "complete")
plot(nba_complete_hclust)
```

- Dendrograms can be difficult with many observations

]
.pull-right[

<img src="06-intro-clustering_files/figure-html/unnamed-chunk-8-1.png" width="504" style="display: block; margin: auto;" />
]

---

## Hiearchical clustering example

.pull-left[

- Use the `cutree` function to return cluster labels (either number of clusters `k` or height `h`)

```r
*hc_player_clusters <-
* cutree(nba_complete_hclust,
*        k = 3)
nba_filtered_stats <- nba_filtered_stats %>%
  mutate(player_hc_clusters = 
           as.factor(hc_player_clusters))
nba_filtered_stats %>%
  ggplot(aes(x = x3pa, y = trb,
             color = player_hc_clusters)) +
  geom_point() +
  ggthemes::scale_color_colorblind() +
  theme_bw() +
  labs(x = "3 pt att / 100 pos",
       y = "Rebounds / 100 pos",
       color = "Cluster") +
  theme(legend.position = "bottom")
```

]

.pull-right[

]

---

## Need to choose a linkage function!

- __Agglomerative__ clustering: clusters are built up piece by piece by __linking them together__

- There is no unique algorithm for how that linking is done! Average and complete are the most popular

---

## We're not done yet...

---

## [Minimax linkage](http://statweb.stanford.edu/~tibs/sta306bfiles/minimax-clustering.pdf)

- Each cluster is defined __by a prototype__ observation (most representative)

- __Identify the point whose farthest point is closest__ (hence the minimax)

- Dendogram interpretation: each point point is `$\leq h$` in dissimilarity to the center of cluster

- __Centers are chosen among the observations themselves__

- Easily done in `R` via the [`protoclust`](https://github.com/jacobbien/protoclust) package
 
---

## Minimax linkage example

.pull-left[

- Use the `protoclust()` function to apply the clustering

```r
library(protoclust)
*nba_minimax <- protoclust(dist(
*   dplyr::select(nba_filtered_stats,
*                 x3pa, trb)))
plot(nba_minimax)
```

]
.pull-right[

<img src="06-intro-clustering_files/figure-html/unnamed-chunk-13-1.png" width="504" style="display: block; margin: auto;" />
]

---

## Minimax linkage example

.pull-left[

- Use the `protocut()` function cut the tree (either number of clusters `k` or height `h`)

```r
*minimax_player_clusters <-
* protocut(nba_minimax, k = 3)
nba_filtered_stats <- nba_filtered_stats %>%
  mutate(minimax_clusters = 
           as.factor(minimax_player_clusters$cl))
nba_filtered_stats %>%
  ggplot(aes(x = x3pa, y = trb,
             color = minimax_clusters)) +
  geom_point() +
  ggthemes::scale_color_colorblind() +
  theme_bw() +
  labs(x = "3 pt att / 100 pos",
       y = "Rebounds / 100 pos",
       color = "Cluster") +
  theme(legend.position = "bottom")
```

]
.pull-right[

<img src="06-intro-clustering_files/figure-html/unnamed-chunk-14-1.png" width="504" style="display: block; margin: auto;" />
]

---

## Minimax linkage example

- Want to check out the prototypes for the three clusters

- `protocut` returns the indices of the prototypes (in order of the cluster labels)

```r
minimax_player_clusters$protos
```

```
## [1] 395  21 112
```

- View these player rows:

```r
dplyr::select(nba_filtered_stats[minimax_player_clusters$protos, ],
              player, pos, age, x3pa, trb)
```

```
## # A tibble: 3 x 5
##   player      pos     age  x3pa   trb
##   <chr>       <chr> <dbl> <dbl> <dbl>
## 1 Noah Vonleh C-PF     24   1.9  15.7
## 2 RJ Barrett  SG       19   5.6   8  
## 3 Paul George SF       29  12.8   9.3
```

---

## Obviously can improve this...

- What if we account for several variables, how does the clustering results relate to the position column in the dataset?

```r
nba_multidim_clust <- protoclust(dist(dplyr::select(nba_filtered_stats,
                                                    x2ppercent, x3pa,
                                                    ast, stl, trb)))
nba_multidim_clust_cut <- protocut(nba_multidim_clust, k = 5)
table("Positions" = nba_filtered_stats$pos,
      "Clusters" = nba_multidim_clust_cut$cl)
```

--
.pull-left[

```
##          Clusters
## Positions  1  2  3  4  5
##     C     40 32  5  1  0
##     C-PF   2  0  0  0  0
##     PF    15 38 14  9  0
##     PG     0  6 12 28 27
##     SF     0 29 16 22  1
##     SF-PF  0  0  0  1  0
##     SF-SG  0  1  0  1  0
##     SG     0 14 34 45  4
```

]
.pull-right[

- Can see positions tend to fall within particular clusters...

- But we might want to explore __soft__ assignments instead (tomorrow!)

- _What's the way to visually compare the two labels?_

]

---

## And one more thing...

We can use the [`GGally`](https://ggobi.github.io/ggally/index.html) package to easily create __pairs__ plots of multiple variables

.pull-left[

```r
*library(GGally)
nba_filtered_stats <- nba_filtered_stats %>%
  mutate(full_minimax_clusters = 
           as.factor(nba_multidim_clust_cut$cl))
*ggpairs(nba_filtered_stats,
*       columns =
*         c("x2ppercent", "x3pa",
*           "ast", "stl", "trb"),
*       aes(color = full_minimax_clusters))
```

]

.pull-right[
<img src="06-intro-clustering_files/figure-html/unnamed-chunk-16-1.png" width="504" />

]