Clustering

class: center, middle, inverse, title-slide

# Clustering
## K-means
### June 14th, 2022

---

## Brace yourselves

---

## Into statistical learning with unsupervised learning

What is __statistical learning?__
--
 [Preface of Introduction to Statistical Learning with Applications in R (ISLR)](https://www.statlearning.com/):

> _refers to a set of tools for modeling and understanding complex datasets_

--
What is __unsupervised learning?__

--
We have `$p$` variables for `$n$` observations `$x_1,\dots,x_n$`, and for observation `$i$`:

`$$x_{i1},x_{i2},\ldots,x_{ip} \sim P$$`

- `$P$` is a `$p$`-dimensional distribution that we might not know much about *a priori*.

- _unsupervised_: none of the variables are __response__ variables, i.e., there are no labeled data

--
Think of unsupervised learning as __an extension of EDA...__

--
- `$\Rightarrow$` __there is no unique right answer!__

???

- Statistical learning is the process of ascertaining (discovering) associations between groups of variables

- unsupervised learning - where the goal is to discover interesting things about the data

---

## What is clustering (aka cluster analysis)?

--
[ISLR 10.3](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf): 
> _very broad set of techniques for finding subgroups, or clusters, in a dataset_

--
__Goals__:

- observations __within__ clusters are __more similar__ to each other,

- observations __in different__ clusters are __more different__ from each other

--
How do we define __distance / dissimilarity__ between observations?

--
- e.g. __Euclidean distance__ between observations `$i$` and `$j$`

`$$d(x_i, x_j) = \sqrt{(x_{i1}-x_{j1})^2 + \cdots + (x_{ip}-x_{jp})^2}$$`

--
__Units matter!__

--
- one variable may _dominate_ others when computing Euclidean distance because its range is much larger

- can standardize each variable / column of dataset to have mean 0 and standard divation 1 with `scale()`

- __but we may value the separation in that variable!__ (so just be careful...)

???

It is the partitioning of data into homogeneous subgroups

Goal define clusters for which the within-cluster variation is relatively small, i.e. observations within clusters are similar to each other

---

## What's the clustering objective?

- `$C_1, \dots, C_K$` are _sets_ containing indices of observations in each of the `$K$` clusters

- if observation `$i$` is in cluster `$k$`, then `$i \in C_k$`

--
- We want to minimize the __within-cluster variation__ `$W(C_k)$` for each cluster `$C_k$` and solve:

`$$\underset{C_1, \dots, C_K}{\text{minimize}} \Big\{ \sum_{k=1}^K W(C_k) \Big\}$$`

- Can define using the __squared Euclidean distance__ ( `$|C_k| = n_k =$` # observations in cluster `$k$`)

`$$W(C_k) = \frac{1}{|C_k|}\sum_{i,j \in C_k} d(x_i, x_j)^2$$`

- Commonly referred to as the within-cluster sum of squares (WSS)

__So how can we solve this?__

---

## [Lloyd's algorithm](https://en.wikipedia.org/wiki/K-means_clustering)

.pull-left[

1) Choose `$K$` random centers, aka __centroids__

2) Assign each observation closest center (using Euclidean distance)

3) Repeat until cluster assignment stop changing:

- Compute new centroids as the averages of the updated groups
  
  - Reassign each observations to closest center

__Converges to a local optimum__, not the global

__Results will change from run to run__ (set the seed!)

__Takes `$K$` as an input!__

]

.pull-right[
<img src="https://upload.wikimedia.org/wikipedia/commons/e/ea/K-means_convergence.gif" width="80%" style="display: block; margin: auto;" />
]

---

## Gapminder data

Health and income outcomes for 184 countries from 1960 to 2016 from the famous [Gapminder project](https://www.gapminder.org/data)

```r
library(tidyverse)
library(dslabs)
gapminder <- as_tibble(gapminder)
head(gapminder)
```

```
## # A tibble: 6 x 9
##   country  year infant_mortality life_expectancy fertility population      gdp continent
##   <fct>   <int>            <dbl>           <dbl>     <dbl>      <dbl>    <dbl> <fct>    
## 1 Albania  1960            115.             62.9      6.19    1636054 NA       Europe   
## 2 Algeria  1960            148.             47.5      7.65   11124892  1.38e10 Africa   
## 3 Angola   1960            208              36.0      7.32    5270844 NA       Africa   
## 4 Antigu…  1960             NA              63.0      4.43      54681 NA       Americas 
## 5 Argent…  1960             59.9            65.4      3.11   20619075  1.08e11 Americas 
## 6 Armenia  1960             NA              66.9      4.55    1867396 NA       Asia     
## # … with 1 more variable: region <fct>
```

---

## GDP is severely skewed right...

```r
gapminder %>% ggplot(aes(x = gdp)) + geom_histogram() 
```

---

## Some initial cleaning...

- Each row is at the `country`-`year` level

- Will just focus on data for 2011 where `gdp` is not missing

- Take `log()` transformation of `gdp`

```r
clean_gapminder <- gapminder %>%
  filter(year == 2011, !is.na(gdp)) %>%
  mutate(log_gdp = log(gdp))
clean_gapminder
```

```
## # A tibble: 168 x 10
##    country  year infant_mortality life_expectancy fertility population     gdp continent
##    <fct>   <int>            <dbl>           <dbl>     <dbl>      <dbl>   <dbl> <fct>    
##  1 Albania  2011             14.3            77.4      1.75    2886010 6.32e 9 Europe   
##  2 Algeria  2011             22.8            76.1      2.83   36717132 8.11e10 Africa   
##  3 Angola   2011            107.             58.1      6.1    21942296 2.70e10 Africa   
##  4 Antigu…  2011              7.2            75.9      2.12      88152 8.02e 8 Americas 
##  5 Argent…  2011             12.7            76        2.2    41655616 4.73e11 Americas 
##  6 Armenia  2011             15.3            73.5      1.5     2967984 4.29e 9 Asia     
##  7 Austra…  2011              3.8            82.2      1.88   22542371 5.73e11 Oceania  
##  8 Austria  2011              3.4            80.7      1.44    8423559 2.31e11 Europe   
##  9 Azerba…  2011             32.5            70.8      1.96    9227512 2.14e10 Asia     
## 10 Bahamas  2011             11.1            72.6      1.9      366711 6.76e 9 Americas 
## # … with 158 more rows, and 2 more variables: region <fct>, log_gdp <dbl>
```

---

### K-means clustering example (`gdp` and `life_expectancy`)

.pull-left[

- Use the `kmeans()` function, __but must provide number of clusters `$K$`__

```r
init_kmeans <- 
  kmeans(dplyr::select(clean_gapminder,
                       log_gdp, life_expectancy),
         algorithm = "Lloyd", centers = 4,
         nstart = 1)

clean_gapminder %>%
  mutate(country_clusters = 
*          as.factor(init_kmeans$cluster)) %>%
  ggplot(aes(x = log_gdp, y = life_expectancy,
             color = country_clusters)) +
  geom_point() + 
  ggthemes::scale_color_colorblind() +
  theme_bw() +
  theme(legend.position = "bottom") 
```

]
.pull-right[

]

---

## Careful with units...

.pull-left[

- Use the `coord_fixed()` so that the axes match with unit scales

```r
clean_gapminder %>%
  mutate(country_clusters = 
*          as.factor(init_kmeans$cluster)) %>%
  ggplot(aes(x = log_gdp, y = life_expectancy,
             color = country_clusters)) +
  geom_point() + 
  ggthemes::scale_color_colorblind() +
  theme_bw() +
  theme(legend.position = "bottom") +
* coord_fixed()
```

]
.pull-right[

]

---

## Standardize the variables!

.pull-left[

- Use the `scale()` function to first __standardize the variables__, `$\frac{value - mean}{standard\ deviation}$`

```r
clean_gapminder <- clean_gapminder %>%
* mutate(std_log_gdp = as.numeric(scale(log_gdp, center = TRUE, scale = TRUE)),
*        std_life_exp = as.numeric(scale(life_expectancy, center = TRUE, scale = TRUE)))
std_kmeans <- 
  kmeans(dplyr::select(clean_gapminder, std_log_gdp, std_life_exp),
         algorithm = "Lloyd", centers = 4, nstart = 1)

clean_gapminder %>%
  mutate(country_clusters = 
*          as.factor(std_kmeans$cluster)) %>%
  ggplot(aes(x = log_gdp, y = life_expectancy,
             color = country_clusters)) +
  geom_point() + 
  ggthemes::scale_color_colorblind() +
  theme_bw() +
  theme(legend.position = "bottom") +
  coord_fixed()
```

]
.pull-right[

]

---

## Standardize the variables!

.pull-left[

```r
clean_gapminder %>%
  mutate(country_clusters = 
*          as.factor(std_kmeans$cluster)) %>%
  ggplot(aes(x = std_log_gdp, y = std_life_exp,
             color = country_clusters)) +
  geom_point() + 
  ggthemes::scale_color_colorblind() +
  theme_bw() +
  theme(legend.position = "bottom") +
* coord_fixed()
```

]
.pull-right[

]

---

### And if we run it again?

.pull-left[

We get different clustering results!

```r
another_kmeans <- 
  kmeans(dplyr::select(clean_gapminder, std_log_gdp, std_life_exp),
         algorithm = "Lloyd", centers = 4, nstart = 1)

clean_gapminder %>%
  mutate(country_clusters = 
*          as.factor(another_kmeans$cluster)) %>%
  ggplot(aes(x = log_gdp, y = life_expectancy,
             color = country_clusters)) +
  geom_point() + 
  ggthemes::scale_color_colorblind() +
  theme_bw() +
  theme(legend.position = "bottom")
```

__Results depend on initialization__

Keep in mind: __the labels / colors are arbitrary__

]
.pull-right[

]

---

### Fix randomness issue with `nstart`

.pull-left[

Run the algorithm `nstart` times, then __pick the results with lowest total within-cluster variation__ (total WSS `$= \sum_k^K W(C_k)$`)

```r
nstart_kmeans <- 
  kmeans(dplyr::select(clean_gapminder,
                       std_log_gdp, std_life_exp),
         algorithm = "Lloyd", centers = 4,
*        nstart = 30)

clean_gapminder %>%
  mutate(country_clusters = 
           as.factor(nstart_kmeans$cluster)) %>% 
  ggplot(aes(x = log_gdp, y = life_expectancy,
             color = country_clusters)) +
  geom_point() + 
  ggthemes::scale_color_colorblind() +
  theme_bw() +
  theme(legend.position = "bottom")
```

]
.pull-right[

]

---

### By default `R` uses [Hartigan and Wong algorithm](https://en.wikipedia.org/wiki/K-means_clustering#Hartigan%E2%80%93Wong_method)

.pull-left[

Updates based on changing a single observation

__Computational advantages over re-computing distances for every observation__

```r
default_kmeans <- 
  kmeans(dplyr::select(clean_gapminder,
                       std_log_gdp, std_life_exp),
*        algorithm = "Hartigan-Wong",
         centers = 4, nstart = 30)

clean_gapminder %>%
  mutate(country_clusters = 
           as.factor(default_kmeans$cluster)) %>% 
  ggplot(aes(x = log_gdp, y = life_expectancy,
             color = country_clusters)) +
  geom_point() + 
  ggthemes::scale_color_colorblind() +
  theme_bw() +
  theme(legend.position = "bottom")
```

Very little differences for our purposes...

]
.pull-right[

]

---

### Better alternative to `nstart`: __K-means++__

Pick a random observation to be the center `$c_1$` of the first cluster `$C_1$`

- This initializes a set `$Centers = \{c_1 \}$`
  
--

Then for each remaining cluster `$c^* \in 2, \dots, K$`:

- For each observation (that is not a center), compute `$D(x_i) = \underset{c \in Centers}{\text{min}} d(x_i, c)$`
  
    - Distance between observation and its closest center `$c \in Centers$`
    
--

- Randomly pick a point `$x_i$` with probability: `$p_i = \frac{D^2(x_i)}{\sum_{j=1}^n D^2(x_j)}$`

- As distance to closest center increases `$\Rightarrow$` probability of selection increases

- Call this randomly selected observation `$c^*$`, update `$Centers = Centers\ \cup c^*$`
  
    - Same as `centers = c(centers, c_new)`
    
--

__Then run `$K$`-means using these `$Centers$` as the starting points__

---

### K-means++ in R using [`flexclust`](https://cran.r-project.org/web/packages/flexclust/flexclust.pdf)

.pull-left[

```r
library(flexclust)
init_kmeanspp <- 
* kcca(dplyr::select(clean_gapminder,
*                    std_log_gdp, std_life_exp), k = 4,
*      control = list(initcent = "kmeanspp"))

clean_gapminder %>%
  mutate(country_clusters = 
*          as.factor(init_kmeanspp@cluster)) %>%
  ggplot(aes(x = log_gdp, y = life_expectancy,
             color = country_clusters)) +
  geom_point() + 
  ggthemes::scale_color_colorblind() +
  theme_bw() +
  theme(legend.position = "bottom")
```

__Note the use of `@` instead of `$`...__

]
.pull-right[

]

---

### So, how do we choose the number of clusters?!

__There is no universally accepted way to conclude that a particular choice of `$K$` is optimal!__

---

### Popular heuristic: elbow plot (use with caution)

Look at the total within-cluster variation as a function of the number of clusters

```r
# Initialize number of clusters to search over
n_clusters_search <- 2:12
tibble(total_wss = 
         # Compute total WSS for each number by looping with sapply
         sapply(n_clusters_search,
                function(k) {
                  kmeans_results <- kmeans(dplyr::select(clean_gapminder,
                                                         std_log_gdp,
                                                         std_life_exp),
                                           centers = k, nstart = 30)
                  # Return the total WSS for choice of k
                  return(kmeans_results$tot.withinss)
                })) %>%
  mutate(k = n_clusters_search) %>%
  ggplot(aes(x = k, y = total_wss)) +
  geom_line() + geom_point() +
  labs(x = "Number of clusters K", y = "Total WSS") +
  theme_bw()
```

---

### Popular heuristic: elbow plot (use with caution)

.pull-left[
Choose `$K$` where marginal improvements is low at the bend (hence the elbow)

__This is just a guideline and should not dictate your choice of `$K$`!__

[Gap statistic](https://web.stanford.edu/~hastie/Papers/gap.pdf) is a popular choice (see [`clusGap` function](https://stat.ethz.ch/R-manual/R-devel/library/cluster/html/clusGap.html) in [`cluster` package](https://cran.r-project.org/web/packages/cluster/cluster.pdf))

__Next Tuesday__: model-based approach to choosing the number of clusters!

]

.pull-right[
<img src="06-Intro-clustering_files/figure-html/unnamed-chunk-12-1.png" width="504" style="display: block; margin: auto;" />

]

---

### Appendix: elbow plot with `flexclust`

```r
# Initialize number of clusters to search over
n_clusters_search <- 2:12
tibble(total_wss = 
         # Compute total WSS for each number by looping with sapply
         sapply(n_clusters_search,
                function(k_choice) {
                  kmeans_results <- kcca(dplyr::select(clean_gapminder,
                                                         std_log_gdp,
                                                         std_life_exp),
                                         k = k_choice, 
                                         control = list(initcent = "kmeanspp"))
                  # Return the total WSS for choice of k
                  return(sum(kmeans_results@clusinfo$size * 
                               kmeans_results@clusinfo$av_dist))
                })) %>%
  mutate(k = n_clusters_search) %>%
  ggplot(aes(x = k, y = total_wss)) +
  geom_line() + geom_point() +
  labs(x = "Number of clusters K", y = "Total WSS") +
  theme_bw()
```

---

### Appendix: elbow plot with `flexclust`