Clustering Lab

Introduction

In this lab you will practice generating and interpreting clustering results, following the examples from the lectures on K-means, hierarchical clustering and Gaussian mixture models.

Data

The dataset you will be using is all pitches thrown by Max Scherzer, Gerrit Cole, Jacob deGrom, Charlie Morton, and Walker Buehler in the 2019 season (including playoffs).

library(tidyverse)
mlb_pitch_data <- 
  read_csv("http://www.stat.cmu.edu/cmsac/sure/materials/data/clustering/sample_2019_mlb_pitches.csv")
head(mlb_pitch_data)

## # A tibble: 6 x 6
##   pitch_type release_speed release_spin_rate  pfx_x  pfx_z pitcher     
##   <chr>              <dbl>             <dbl>  <dbl>  <dbl> <chr>       
## 1 CH                  85.3              1291 -1.29  0.182  Max Scherzer
## 2 FF                  95.3              2547 -0.628 1.30   Max Scherzer
## 3 FF                  94.5              2526 -0.690 1.35   Max Scherzer
## 4 SL                  82.8              2404  0.396 0.0376 Max Scherzer
## 5 FF                  94.9              2509 -0.779 1.21   Max Scherzer
## 6 SL                  85.3              2345  0.269 0.0672 Max Scherzer

Each row in the dataset is a single pitch, with the following six columns:

pitch_type: two letter abbreviation denoting the type of pitch that was thrown (see more info below),
release_speed: release speed of pitch in miles per hour (mph)
release_spin_rate: spin rate of pitch in revolutions per minute (rpm) as tracked by Statcast
pfx_x: horizontal movement in feet from the catcher’s perspective,
pfx_z: vertical movement in feet from the catcher’s perpsective
pitcher: name of pitcher throwing the pitch.

The two letter pitch_type abbreviations represent the following types of pitches that can be summarized by two groups, (1) fastballs:

FF: four-seam fastball (most common pitch in baseball),
FT: two-seam fastball (more movement than FF),
FC: cutter (look up Mariano Rivera),
FS, SI, or SF: sinker / split-fingered,

and (2) offspeed pitches:

SL: slider,
CH: changeup,
CB or CU: curveball,
KC: knuckle-curve,
KN: knuckleball,
EP: eephus.

Note: these five pitchers do NOT all throw the same type of pitches, and do NOT throw all of the labeled pitch types above.

Exercises

EDA

Spend time exploring the dataset, create visualizations of the different continuous variables: release_speed, release_spin_rate, pfx_x, and pfx_z. Experiment with visualizations including all pitchers versus displaying each pitcher separately (hint: facet_wrap()). Do you observe differences between the different possible pitch_type abbreviations based on the measurements (hint: use color = pitch_type), and does it vary by pitcher?

Which two continuous variables do you think perform best at detecting clusters / subgroups within the data based on your EDA? How many clusters do you think there are? Justify your answers based on your EDA. Do you think you will need to use any scaling of the variables?

K-means and hierarchical clustering

Using your two selected variables and selected number of clusters $K$, generate clustering results using kmeans() and hclust() as in the lecture slides (feel free to try out the protoclust package as well for minimax linkage). Remember to set nstart within kmeans due to its random initialization. Experiment with different types of linkage functions for hclust (hint: view help(hclust) and see the method argument with descriptions of each in the Details section). Display your clustering results on a scatterplot with your two selected variables.

How do the results change when you cluster the pitches using all pitchers together versus clustering pitches thrown by each pitcher separately (hint: use filter() to create separate datasets for each pitcher, and apply kmeans and hclust to each separately, but remember this may impact your number of clusters!).

Compare your clustering results to the provided pitch_type labels, how do they compare? Remember, the provided pitch types are not necessarily correct as we saw in Mike Pane’s presentation. How do the results change when you use only use the two variables you did NOT select? What happens when you use all four variables together?

Model-based clustering

Next use the mclust package and Mclust() function to generate the clustering results with Gaussian mixture models following the lecture example. Start with your originally selected two variables, what model (covariance constraint, e.g. VVV) is selected along with how many clusters based on the BIC? View the hard-assignment clustering results for this selection (hint: mclust_results$classification but replace mclust_results with whatever you assigned the results to), and how do they compare with the known pitch type labels. What happens when you use all four variables? Again, how do the results change when you cluster the pitches using all pitchers together versus clustering pitches thrown by each pitcher separately. Which do you think is more appropriate?

View the distribution of cluster membership probabilities (follow the lecture example). What do the distributions look like? View the uncertainty in each cluster (hint: mclust_results$uncertainty) as in the lecture example. Then view the uncertainty by the actual pitch_type label. Are there certain pitch types that display higher uncertainty values? How does this compare across the different pitchers?

Clustering Lab

6/15/2020

Introduction

Data

Exercises