class: center, middle, inverse, title-slide # Clustering ## K-means ### June 9th, 2021 --- ## Brace yourselves <img src="http://images6.fanpop.com/image/photos/42900000/Thor-Ragnarok-2017-Loki-and-Thor-mood-compilation-thor-ragnarok-42998758-268-200.gif" width="60%" style="display: block; margin: auto;" /> --- ## Into statistical learning with unsupervised learning What is __statistical learning?__ -- [Preface of Introduction to Statistical Learning with Applications in R (ISLR)](http://faculty.marshall.usc.edu/gareth-james/ISL/): > _refers to a set of tools for modeling and understanding complex datasets_ -- What is __unsupervised learning?__ -- We have `\(p\)` variables for `\(n\)` observations `\(x_1,\dots,x_n\)`, and for observation `\(i\)`: `$$x_{i1},x_{i2},\ldots,x_{ip} \sim P$$` - `\(P\)` is a `\(p\)`-dimensional distribution that we might not know much about *a priori*. - _unsupervised_: none of the variables are __response__ variables, i.e., there are no labeled data -- Think of unsupervised learning as __an extension of EDA...__ -- - `\(\Rightarrow\)` __there is no unique right answer!__ ??? - Statistical learning is the process of ascertaining (discovering) associations between groups of variables - unsupervised learning - where the goal is to discover interesting things about the data --- ## What is clustering (aka cluster analysis)? -- [ISLR 10.3](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf): > _very broad set of techniques for finding subgroups, or clusters, in a dataset_ -- __Goals__: - observations __within__ clusters are __more similar__ to each other, - observations __in different__ clusters are __more different__ from each other -- How do we define __distance / dissimilarity__ between observations? -- - e.g. __Euclidean distance__ between observations `\(i\)` and `\(j\)` `$$d(x_i, x_j) = \sqrt{(x_{i1}-x_{j1})^2 + \cdots + (x_{ip}-x_{jp})^2}$$` -- __Units matter!__ -- - one variable may _dominate_ others when computing Euclidean distance because its range is much larger - can standardize each variable / column of dataset to have mean 0 and standard divation 1 with `scale()` - __but we may value the separation in that variable!__ (so just be careful...) ??? It is the partitioning of data into homogeneous subgroups Goal define clusters for which the within-cluster variation is relatively small, i.e. observations within clusters are similar to each other --- ## What's the clustering objective? - `\(C_1, \dots, C_K\)` are _sets_ containing indices of observations in each of the `\(K\)` clusters - if observation `\(i\)` is in cluster `\(k\)`, then `\(i \in C_k\)` -- - We want to minimize the __within-cluster variation__ `\(W(C_k)\)` for each cluster `\(C_k\)` and solve: `$$\underset{C_1, \dots, C_K}{\text{minimize}} \Big\{ \sum_{k=1}^K W(C_k) \Big\}$$` - Can define using the __squared Euclidean distance__ ( `\(|C_k| = n_k =\)` # observations in cluster `\(k\)`) `$$W(C_k) = \frac{1}{|C_k|}\sum_{i,j \in C_k} d(x_i, x_j)^2$$` - Commonly referred to as the within-cluster sum of squares (WSS) -- __So how can we solve this?__ --- ## [Lloyd's algorithm](https://en.wikipedia.org/wiki/K-means_clustering) .pull-left[ 1) Choose `\(K\)` random centers, aka __centroids__ 2) Assign each observation closest center (using Euclidean distance) 3) Repeat until cluster assignment stop changing: - Compute new centroids as the averages of the updated groups - Reassign each observations to closest center __Converges to a local optimum__, not the global __Results will change from run to run__ (set the seed!) __Takes `\(K\)` as an input!__ ] .pull-right[ <img src="https://upload.wikimedia.org/wikipedia/commons/e/ea/K-means_convergence.gif" width="80%" style="display: block; margin: auto;" /> ] --- ## NBA 2021 player stats per 100 possessions Created dataset of NBA player statistics per 100 possessions using the [`ballr`](https://cran.r-project.org/web/packages/ballr/vignettes/use-ballr.html) - which scrapes tables from [basketball-reference.com](https://www.basketball-reference.com/leagues/NBA_2021_per_poss.html) ```r library(tidyverse) nba_pos_stats <- read_csv("http://www.stat.cmu.edu/cmsac/sure/2021/materials/data/clustering/nba_2021_player_per_pos_stats.csv") head(nba_pos_stats) ``` ``` ## # A tibble: 6 x 31 ## player pos age tm g gs mp fg fga fgpercent x3p x3pa x3ppercent x2p ## <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Preciou… PF 21 MIA 61 4 737 8.4 15.4 0.544 0 0.1 0 8.4 ## 2 Jaylen … PG 24 MIL 7 0 18 2.6 20.9 0.125 0 5.2 0 2.6 ## 3 Steven … C 27 NOP 58 58 1605 5.6 9.2 0.614 0 0.1 0 5.6 ## 4 Bam Ade… C 23 MIA 64 64 2143 10.6 18.5 0.57 0 0.2 0.25 10.5 ## 5 LaMarcu… C 35 TOT 26 23 674 10.1 21.3 0.473 2.2 5.8 0.388 7.8 ## 6 LaMarcu… C 35 SAS 21 18 544 10.3 22.1 0.464 2.4 6.7 0.36 7.9 ## # … with 17 more variables: x2pa <dbl>, x2ppercent <dbl>, ft <dbl>, fta <dbl>, ftpercent <dbl>, ## # orb <dbl>, drb <dbl>, trb <dbl>, ast <dbl>, stl <dbl>, blk <dbl>, tov <dbl>, pf <dbl>, ## # pts <dbl>, x <lgl>, ortg <dbl>, drtg <dbl> ``` --- ## Some initial cleaning... - Each row is at player-team level (`player`-`tm`) - For players that switched teams during the season, they have rows with `tm == "TOT"` ```r tot_players <- nba_pos_stats %>% filter(tm == "TOT") ``` - Create a new dataset by stacking `tot_players` with the rows for players that did not switch teams - Use the [`bind_rows()` function](https://dplyr.tidyverse.org/reference/bind.html) ```r nba_player_stats <- nba_pos_stats %>% * filter(!(player %in% tot_players$player)) %>% * bind_rows(tot_players) ``` --- ## Make a cutoff with the ECDF View the ECDF of the minutes played (`mp`) variable to make a cutoff: .pull-left[ ```r nba_player_stats %>% ggplot(aes(x = mp)) + stat_ecdf() + theme_bw() + labs(x = "Minutes played", y = "Proportion of NBA players") + * geom_vline(xintercept = 250, * linetype = "dashed", * color = "darkred") ``` - Arbitrary cutoff: at least 250 minutes played - Keeps around 75% of players - Use [`geom_vline`](https://ggplot2.tidyverse.org/reference/geom_abline.html) to draw vertical line ```r nba_filtered_stats <- nba_player_stats %>% filter(mp >= 250) ``` ] .pull-right[ <img src="06-Intro-clustering_files/figure-html/unnamed-chunk-3-1.png" width="504" /> ] --- ### K-means clustering example (3 point attempts and rebounds) .pull-left[ - Use the `kmeans()` function, __but must provide number of clusters `\(K\)`__ ```r init_nba_kmeans <- kmeans(dplyr::select(nba_filtered_stats, x3pa, trb), algorithm = "Lloyd", centers = 4, nstart = 1) nba_filtered_stats %>% mutate(player_clusters = * as.factor(init_nba_kmeans$cluster)) %>% ggplot(aes(x = x3pa, y = trb, color = player_clusters)) + geom_point() + ggthemes::scale_color_colorblind() + theme_bw() + theme(legend.position = "bottom") ``` ] .pull-right[ <img src="06-Intro-clustering_files/figure-html/unnamed-chunk-4-1.png" width="504" /> ] --- ### And if we run it again? .pull-left[ We get different clustering results! ```r another_nba_kmeans <- kmeans(dplyr::select(nba_filtered_stats, x3pa, trb), algorithm = "Lloyd", centers = 4, nstart = 1) nba_filtered_stats %>% mutate(player_clusters = * as.factor(another_nba_kmeans$cluster)) %>% ggplot(aes(x = x3pa, y = trb, color = player_clusters)) + geom_point() + ggthemes::scale_color_colorblind() + theme_bw() + theme(legend.position = "bottom") ``` __Results depend on initialization__ Keep in mind: __the labels / colors are arbitrary__ ] .pull-right[ <img src="06-Intro-clustering_files/figure-html/unnamed-chunk-5-1.png" width="504" /> ] --- ### Fix randomness issue with `nstart` .pull-left[ Run the algorithm `nstart` times, then __pick the results with lowest total within-cluster variation__ (total WSS `\(= \sum_k^K W(C_k)\)`) ```r nstart_nba_kmeans <- kmeans(dplyr::select(nba_filtered_stats, x3pa, trb), algorithm = "Lloyd", centers = 4, * nstart = 30) nba_filtered_stats %>% mutate(player_clusters = as.factor(nstart_nba_kmeans$cluster)) %>% ggplot(aes(x = x3pa, y = trb, color = player_clusters)) + geom_point() + ggthemes::scale_color_colorblind() + theme_bw() + theme(legend.position = "bottom") ``` ] .pull-right[ <img src="06-Intro-clustering_files/figure-html/unnamed-chunk-6-1.png" width="504" /> ] --- ### By default `R` uses [Hartigan and Wong algorithm](https://en.wikipedia.org/wiki/K-means_clustering#Hartigan%E2%80%93Wong_method) .pull-left[ Updates based on changing a single observation __Computational advantages over re-computing distances for every observation__ ```r default_nba_kmeans <- kmeans(dplyr::select(nba_filtered_stats, x3pa, trb), * algorithm = "Hartigan-Wong", centers = 4, nstart = 30) nba_filtered_stats %>% mutate(player_clusters = as.factor(default_nba_kmeans$cluster)) %>% ggplot(aes(x = x3pa, y = trb, color = player_clusters)) + geom_point() + ggthemes::scale_color_colorblind() + theme_bw() + theme(legend.position = "bottom") ``` Very little differences for our purposes... ] .pull-right[ <img src="06-Intro-clustering_files/figure-html/unnamed-chunk-7-1.png" width="504" /> ] --- ### Better alternative to `nstart`: __K-means++__ Pick a random observation to be the center `\(c_1\)` of the first cluster `\(C_1\)` - This initializes a set `\(Centers = \{c_1 \}\)` -- Then for each remaining cluster `\(c^* \in 2, K\)`: - For each observation (that is not a center), compute `\(D(x_i) = \underset{c \in Centers}{\text{min}} d(x_i, c)\)` - Distance between observation and its closest center `\(c \in Centers\)` -- - Randomly pick a point `\(x_i\)` with probability: `\(p_i = \frac{D^2(x_i)}{\sum_{j=1}^n D^2(x_j)}\)` -- - As distance to closest center increases `\(\Rightarrow\)` probability of selection increases - Call this randomly selected observation `\(c^*\)`, update `\(Centers = Centers\ \cup c^*\)` - Same as `centers = c(centers, c_new)` -- __Then run `\(K\)`-means using these `\(Centers\)` as the starting points__ --- ### K-means++ in R using [`flexclust`](https://cran.r-project.org/web/packages/flexclust/flexclust.pdf) .pull-left[ ```r library(flexclust) nba_kmeanspp <- * kcca(dplyr::select(nba_filtered_stats, * x3pa, trb), k = 4, * control = list(initcent = "kmeanspp")) nba_filtered_stats %>% mutate(player_clusters = * as.factor(nba_kmeanspp@cluster)) %>% ggplot(aes(x = x3pa, y = trb, color = player_clusters)) + geom_point() + ggthemes::scale_color_colorblind() + theme_bw() + theme(legend.position = "bottom") ``` __Note the use of `@` instead of `$`...__ ] .pull-right[ <img src="06-Intro-clustering_files/figure-html/unnamed-chunk-8-1.png" width="504" /> ] --- ### So, how do we choose the number of clusters?! <img src="https://i.pinimg.com/originals/86/90/6c/86906c4cb23094b8bfb851031509b9f4.gif" width="60%" style="display: block; margin: auto;" /> -- __There is no universally accepted way to conclude that a particular choice of `\(K\)` is optimal!__ --- ### Popular heuristic: elbow plot (use with caution) Look at the total within-cluster variation as a function of the number of clusters ```r # Initialize number of clusters to search over n_clusters_search <- 2:12 tibble(total_wss = # Compute total WSS for each number by looping with sapply sapply(n_clusters_search, function(k) { kmeans_results <- kmeans(dplyr::select(nba_filtered_stats, x3pa, trb), centers = k, nstart = 30) # Return the total WSS for choice of k return(kmeans_results$tot.withinss) })) %>% mutate(k = n_clusters_search) %>% ggplot(aes(x = k, y = total_wss)) + geom_line() + geom_point() + labs(x = "Number of clusters K", y = "Total WSS") + theme_bw() ``` --- ### Popular heuristic: elbow plot (use with caution) .pull-left[ Choose `\(K\)` where marginal improvements is low at the bend (hence the elbow) __This is just a guideline and should not dictate your choice of `\(K\)`!__ [Gap statistic](https://web.stanford.edu/~hastie/Papers/gap.pdf) is a popular choice (see [`clusGap` function](https://stat.ethz.ch/R-manual/R-devel/library/cluster/html/clusGap.html) in [`cluster` package](https://cran.r-project.org/web/packages/cluster/cluster.pdf)) __Next Tuesday__: model-based approach to choosing the number of clusters! ] .pull-right[ <img src="06-Intro-clustering_files/figure-html/unnamed-chunk-10-1.png" width="504" style="display: block; margin: auto;" /> ] --- ### Appendix: elbow plot with `flexclust` ```r # Initialize number of clusters to search over n_clusters_search <- 2:12 tibble(total_wss = # Compute total WSS for each number by looping with sapply sapply(n_clusters_search, function(k_choice) { kmeans_results <- kcca(dplyr::select(nba_filtered_stats, x3pa, trb), k = k_choice, control = list(initcent = "kmeanspp")) # Return the total WSS for choice of k return(sum(kmeans_results@clusinfo$size * kmeans_results@clusinfo$av_dist)) })) %>% mutate(k = n_clusters_search) %>% ggplot(aes(x = k, y = total_wss)) + geom_line() + geom_point() + labs(x = "Number of clusters K", y = "Total WSS") + theme_bw() ``` --- ### Appendix: elbow plot with `flexclust` <img src="06-Intro-clustering_files/figure-html/unnamed-chunk-11-1.png" width="504" style="display: block; margin: auto;" />