Clustering

class: center, middle, inverse, title-slide

# Clustering
## K-means
### June 9th, 2021

---

## Brace yourselves

---

## Into statistical learning with unsupervised learning

What is __statistical learning?__
--
 [Preface of Introduction to Statistical Learning with Applications in R (ISLR)](http://faculty.marshall.usc.edu/gareth-james/ISL/):

> _refers to a set of tools for modeling and understanding complex datasets_

--
What is __unsupervised learning?__

--
We have `$p$` variables for `$n$` observations `$x_1,\dots,x_n$`, and for observation `$i$`:

`$$x_{i1},x_{i2},\ldots,x_{ip} \sim P$$`

- `$P$` is a `$p$`-dimensional distribution that we might not know much about *a priori*.

- _unsupervised_: none of the variables are __response__ variables, i.e., there are no labeled data

--
Think of unsupervised learning as __an extension of EDA...__

--
- `$\Rightarrow$` __there is no unique right answer!__

???

- Statistical learning is the process of ascertaining (discovering) associations between groups of variables

- unsupervised learning - where the goal is to discover interesting things about the data

---

## What is clustering (aka cluster analysis)?

--
[ISLR 10.3](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf): 
> _very broad set of techniques for finding subgroups, or clusters, in a dataset_

--
__Goals__:

- observations __within__ clusters are __more similar__ to each other,

- observations __in different__ clusters are __more different__ from each other

--
How do we define __distance / dissimilarity__ between observations?

--
- e.g. __Euclidean distance__ between observations `$i$` and `$j$`

`$$d(x_i, x_j) = \sqrt{(x_{i1}-x_{j1})^2 + \cdots + (x_{ip}-x_{jp})^2}$$`

--
__Units matter!__

--
- one variable may _dominate_ others when computing Euclidean distance because its range is much larger

- can standardize each variable / column of dataset to have mean 0 and standard divation 1 with `scale()`

- __but we may value the separation in that variable!__ (so just be careful...)

???

It is the partitioning of data into homogeneous subgroups

Goal define clusters for which the within-cluster variation is relatively small, i.e. observations within clusters are similar to each other

---

## What's the clustering objective?

- `$C_1, \dots, C_K$` are _sets_ containing indices of observations in each of the `$K$` clusters

- if observation `$i$` is in cluster `$k$`, then `$i \in C_k$`

--
- We want to minimize the __within-cluster variation__ `$W(C_k)$` for each cluster `$C_k$` and solve:

`$$\underset{C_1, \dots, C_K}{\text{minimize}} \Big\{ \sum_{k=1}^K W(C_k) \Big\}$$`

- Can define using the __squared Euclidean distance__ ( `$|C_k| = n_k =$` # observations in cluster `$k$`)

`$$W(C_k) = \frac{1}{|C_k|}\sum_{i,j \in C_k} d(x_i, x_j)^2$$`

- Commonly referred to as the within-cluster sum of squares (WSS)

__So how can we solve this?__

---

## [Lloyd's algorithm](https://en.wikipedia.org/wiki/K-means_clustering)

.pull-left[

1) Choose `$K$` random centers, aka __centroids__

2) Assign each observation closest center (using Euclidean distance)

3) Repeat until cluster assignment stop changing:

- Compute new centroids as the averages of the updated groups
  
  - Reassign each observations to closest center

__Converges to a local optimum__, not the global

__Results will change from run to run__ (set the seed!)

__Takes `$K$` as an input!__

]

.pull-right[
<img src="https://upload.wikimedia.org/wikipedia/commons/e/ea/K-means_convergence.gif" width="80%" style="display: block; margin: auto;" />
]

---

## NBA 2021 player stats per 100 possessions

Created dataset of NBA player statistics per 100 possessions using the [`ballr`](https://cran.r-project.org/web/packages/ballr/vignettes/use-ballr.html) - which scrapes tables from [basketball-reference.com](https://www.basketball-reference.com/leagues/NBA_2021_per_poss.html)

```r
library(tidyverse)
nba_pos_stats <- 
  read_csv("http://www.stat.cmu.edu/cmsac/sure/2021/materials/data/clustering/nba_2021_player_per_pos_stats.csv")
head(nba_pos_stats)
```

```
## # A tibble: 6 x 31
##   player   pos     age tm        g    gs    mp    fg   fga fgpercent   x3p  x3pa x3ppercent   x2p
##   <chr>    <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>     <dbl> <dbl> <dbl>      <dbl> <dbl>
## 1 Preciou… PF       21 MIA      61     4   737   8.4  15.4     0.544   0     0.1      0       8.4
## 2 Jaylen … PG       24 MIL       7     0    18   2.6  20.9     0.125   0     5.2      0       2.6
## 3 Steven … C        27 NOP      58    58  1605   5.6   9.2     0.614   0     0.1      0       5.6
## 4 Bam Ade… C        23 MIA      64    64  2143  10.6  18.5     0.57    0     0.2      0.25   10.5
## 5 LaMarcu… C        35 TOT      26    23   674  10.1  21.3     0.473   2.2   5.8      0.388   7.8
## 6 LaMarcu… C        35 SAS      21    18   544  10.3  22.1     0.464   2.4   6.7      0.36    7.9
## # … with 17 more variables: x2pa <dbl>, x2ppercent <dbl>, ft <dbl>, fta <dbl>, ftpercent <dbl>,
## #   orb <dbl>, drb <dbl>, trb <dbl>, ast <dbl>, stl <dbl>, blk <dbl>, tov <dbl>, pf <dbl>,
## #   pts <dbl>, x <lgl>, ortg <dbl>, drtg <dbl>
```

---

## Some initial cleaning...

- Each row is at player-team level (`player`-`tm`)

- For players that switched teams during the season, they have rows with `tm == "TOT"`

```r
tot_players <- nba_pos_stats %>%
  filter(tm == "TOT")
```

- Create a new dataset by stacking `tot_players` with the rows for players that did not switch teams

- Use the [`bind_rows()` function](https://dplyr.tidyverse.org/reference/bind.html)

```r
nba_player_stats <- nba_pos_stats %>%
* filter(!(player %in% tot_players$player)) %>%
* bind_rows(tot_players)
```

---

## Make a cutoff with the ECDF

View the ECDF of the minutes played (`mp`) variable to make a cutoff:

.pull-left[

```r
nba_player_stats %>%
  ggplot(aes(x = mp)) +
  stat_ecdf() +
  theme_bw() +
  labs(x = "Minutes played",
       y = "Proportion of NBA players") +
* geom_vline(xintercept = 250,
*            linetype = "dashed",
*            color = "darkred")
```

- Arbitrary cutoff: at least 250 minutes played

- Keeps around 75% of players

- Use [`geom_vline`](https://ggplot2.tidyverse.org/reference/geom_abline.html) to draw vertical line

```r
nba_filtered_stats <- nba_player_stats %>%
  filter(mp >= 250)
```

]

.pull-right[
<img src="06-Intro-clustering_files/figure-html/unnamed-chunk-3-1.png" width="504" />

]

---

### K-means clustering example (3 point attempts and rebounds)

.pull-left[

- Use the `kmeans()` function, __but must provide number of clusters `$K$`__

```r
init_nba_kmeans <- 
  kmeans(dplyr::select(nba_filtered_stats,
                       x3pa, trb),
         algorithm = "Lloyd", centers = 4,
         nstart = 1)

nba_filtered_stats %>%
  mutate(player_clusters = 
*          as.factor(init_nba_kmeans$cluster)) %>%
  ggplot(aes(x = x3pa, y = trb,
             color = player_clusters)) +
  geom_point() + 
  ggthemes::scale_color_colorblind() +
  theme_bw() +
  theme(legend.position = "bottom")
```

]
.pull-right[

]

---

### And if we run it again?

.pull-left[

We get different clustering results!

```r
another_nba_kmeans <- 
  kmeans(dplyr::select(nba_filtered_stats,
                       x3pa, trb),
         algorithm = "Lloyd", centers = 4,
         nstart = 1)

nba_filtered_stats %>%
  mutate(player_clusters = 
*          as.factor(another_nba_kmeans$cluster)) %>%
  ggplot(aes(x = x3pa, y = trb,
             color = player_clusters)) +
  geom_point() + 
  ggthemes::scale_color_colorblind() +
  theme_bw() +
  theme(legend.position = "bottom")
```

__Results depend on initialization__

Keep in mind: __the labels / colors are arbitrary__

]
.pull-right[

]

---

### Fix randomness issue with `nstart`

.pull-left[

Run the algorithm `nstart` times, then __pick the results with lowest total within-cluster variation__ (total WSS `$= \sum_k^K W(C_k)$`)

```r
nstart_nba_kmeans <- 
  kmeans(dplyr::select(nba_filtered_stats,
                       x3pa, trb),
         algorithm = "Lloyd", centers = 4,
*        nstart = 30)

nba_filtered_stats %>%
  mutate(player_clusters = 
           as.factor(nstart_nba_kmeans$cluster)) %>% 
  ggplot(aes(x = x3pa, y = trb,
             color = player_clusters)) +
  geom_point() + 
  ggthemes::scale_color_colorblind() +
  theme_bw() +
  theme(legend.position = "bottom")
```

]
.pull-right[

]

---

### By default `R` uses [Hartigan and Wong algorithm](https://en.wikipedia.org/wiki/K-means_clustering#Hartigan%E2%80%93Wong_method)

.pull-left[

Updates based on changing a single observation

__Computational advantages over re-computing distances for every observation__

```r
default_nba_kmeans <- 
  kmeans(dplyr::select(nba_filtered_stats,
                       x3pa, trb),
*        algorithm = "Hartigan-Wong",
         centers = 4, nstart = 30)

nba_filtered_stats %>%
  mutate(player_clusters = 
           as.factor(default_nba_kmeans$cluster)) %>% 
  ggplot(aes(x = x3pa, y = trb,
             color = player_clusters)) +
  geom_point() + 
  ggthemes::scale_color_colorblind() +
  theme_bw() +
  theme(legend.position = "bottom")
```

Very little differences for our purposes...

]
.pull-right[

]

---

### Better alternative to `nstart`: __K-means++__

Pick a random observation to be the center `$c_1$` of the first cluster `$C_1$`

- This initializes a set `$Centers = \{c_1 \}$`
  
--

Then for each remaining cluster `$c^* \in 2, K$`:

- For each observation (that is not a center), compute `$D(x_i) = \underset{c \in Centers}{\text{min}} d(x_i, c)$`
  
    - Distance between observation and its closest center `$c \in Centers$`
    
--

- Randomly pick a point `$x_i$` with probability: `$p_i = \frac{D^2(x_i)}{\sum_{j=1}^n D^2(x_j)}$`

- As distance to closest center increases `$\Rightarrow$` probability of selection increases

- Call this randomly selected observation `$c^*$`, update `$Centers = Centers\ \cup c^*$`
  
    - Same as `centers = c(centers, c_new)`
    
--

__Then run `$K$`-means using these `$Centers$` as the starting points__

---

### K-means++ in R using [`flexclust`](https://cran.r-project.org/web/packages/flexclust/flexclust.pdf)

.pull-left[

```r
library(flexclust)
nba_kmeanspp <- 
* kcca(dplyr::select(nba_filtered_stats,
*                    x3pa, trb), k = 4,
*      control = list(initcent = "kmeanspp"))

nba_filtered_stats %>%
  mutate(player_clusters = 
*          as.factor(nba_kmeanspp@cluster)) %>%
  ggplot(aes(x = x3pa, y = trb,
             color = player_clusters)) +
  geom_point() + 
  ggthemes::scale_color_colorblind() +
  theme_bw() +
  theme(legend.position = "bottom")
```

__Note the use of `@` instead of `$`...__

]
.pull-right[

]

---

### So, how do we choose the number of clusters?!

__There is no universally accepted way to conclude that a particular choice of `$K$` is optimal!__

---

### Popular heuristic: elbow plot (use with caution)

Look at the total within-cluster variation as a function of the number of clusters

```r
# Initialize number of clusters to search over
n_clusters_search <- 2:12
tibble(total_wss = 
         # Compute total WSS for each number by looping with sapply
         sapply(n_clusters_search,
                function(k) {
                  kmeans_results <- kmeans(dplyr::select(nba_filtered_stats,
                                                         x3pa, trb),
                                           centers = k, nstart = 30)
                  # Return the total WSS for choice of k
                  return(kmeans_results$tot.withinss)
                })) %>%
  mutate(k = n_clusters_search) %>%
  ggplot(aes(x = k, y = total_wss)) +
  geom_line() + geom_point() +
  labs(x = "Number of clusters K", y = "Total WSS") +
  theme_bw()
```

---

### Popular heuristic: elbow plot (use with caution)

.pull-left[
Choose `$K$` where marginal improvements is low at the bend (hence the elbow)

__This is just a guideline and should not dictate your choice of `$K$`!__

[Gap statistic](https://web.stanford.edu/~hastie/Papers/gap.pdf) is a popular choice (see [`clusGap` function](https://stat.ethz.ch/R-manual/R-devel/library/cluster/html/clusGap.html) in [`cluster` package](https://cran.r-project.org/web/packages/cluster/cluster.pdf))

__Next Tuesday__: model-based approach to choosing the number of clusters!

]

.pull-right[
<img src="06-Intro-clustering_files/figure-html/unnamed-chunk-10-1.png" width="504" style="display: block; margin: auto;" />

]

---

### Appendix: elbow plot with `flexclust`

```r
# Initialize number of clusters to search over
n_clusters_search <- 2:12
tibble(total_wss = 
         # Compute total WSS for each number by looping with sapply
         sapply(n_clusters_search,
                function(k_choice) {
                  kmeans_results <- kcca(dplyr::select(nba_filtered_stats,
                                                       x3pa, trb), k = k_choice, 
                                         control = list(initcent = "kmeanspp"))
                  # Return the total WSS for choice of k
                  return(sum(kmeans_results@clusinfo$size * 
                               kmeans_results@clusinfo$av_dist))
                })) %>%
  mutate(k = n_clusters_search) %>%
  ggplot(aes(x = k, y = total_wss)) +
  geom_line() + geom_point() +
  labs(x = "Number of clusters K", y = "Total WSS") +
  theme_bw()
```

---

### Appendix: elbow plot with `flexclust`