In this lab you will practice generating and interpreting clustering results, following the examples from the lectures on K-means, hierarchical clustering and Gaussian mixture models.
The dataset you will be using is all pitches thrown by Max Scherzer, Gerrit Cole, Jacob deGrom, Charlie Morton, and Walker Buehler in the 2019 season (including playoffs).
library(tidyverse)
mlb_pitch_data <-
read_csv("http://www.stat.cmu.edu/cmsac/sure/materials/data/clustering/sample_2019_mlb_pitches.csv")
head(mlb_pitch_data)
## # A tibble: 6 x 6
## pitch_type release_speed release_spin_rate pfx_x pfx_z pitcher
## <chr> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 CH 85.3 1291 -1.29 0.182 Max Scherzer
## 2 FF 95.3 2547 -0.628 1.30 Max Scherzer
## 3 FF 94.5 2526 -0.690 1.35 Max Scherzer
## 4 SL 82.8 2404 0.396 0.0376 Max Scherzer
## 5 FF 94.9 2509 -0.779 1.21 Max Scherzer
## 6 SL 85.3 2345 0.269 0.0672 Max Scherzer
Each row in the dataset is a single pitch, with the following six columns:
pitch_type
: two letter abbreviation denoting the type of pitch that was thrown (see more info below),release_speed
: release speed of pitch in miles per hour (mph)release_spin_rate
: spin rate of pitch in revolutions per minute (rpm) as tracked by Statcastpfx_x
: horizontal movement in feet from the catcher’s perspective,pfx_z
: vertical movement in feet from the catcher’s perpsectivepitcher
: name of pitcher throwing the pitch.The two letter pitch_type abbreviations represent the following types of pitches that can be summarized by two groups, (1) fastballs:
and (2) offspeed pitches:
Note: these five pitchers do NOT all throw the same type of pitches, and do NOT throw all of the labeled pitch types above.
Spend time exploring the dataset, create visualizations of the different continuous variables: release_speed
, release_spin_rate
, pfx_x
, and pfx_z
. Experiment with visualizations including all pitchers versus displaying each pitcher separately (hint: facet_wrap()
). Do you observe differences between the different possible pitch_type
abbreviations based on the measurements (hint: use color = pitch_type
), and does it vary by pitcher?
Which two continuous variables do you think perform best at detecting clusters / subgroups within the data based on your EDA? How many clusters do you think there are? Justify your answers based on your EDA. Do you think you will need to use any scaling of the variables?
Using your two selected variables and selected number of clusters \(K\), generate clustering results using kmeans()
and hclust()
as in the lecture slides (feel free to try out the protoclust
package as well for minimax linkage). Remember to set nstart
within kmeans
due to its random initialization. Experiment with different types of linkage functions for hclust
(hint: view help(hclust)
and see the method
argument with descriptions of each in the Details section). Display your clustering results on a scatterplot with your two selected variables.
How do the results change when you cluster the pitches using all pitchers together versus clustering pitches thrown by each pitcher separately (hint: use filter()
to create separate datasets for each pitcher, and apply kmeans
and hclust
to each separately, but remember this may impact your number of clusters!).
Compare your clustering results to the provided pitch_type
labels, how do they compare? Remember, the provided pitch types are not necessarily correct as we saw in Mike Pane’s presentation. How do the results change when you use only use the two variables you did NOT select? What happens when you use all four variables together?
Next use the mclust
package and Mclust()
function to generate the clustering results with Gaussian mixture models following the lecture example. Start with your originally selected two variables, what model (covariance constraint, e.g. VVV) is selected along with how many clusters based on the BIC? View the hard-assignment clustering results for this selection (hint: mclust_results$classification
but replace mclust_results
with whatever you assigned the results to), and how do they compare with the known pitch type labels. What happens when you use all four variables? Again, how do the results change when you cluster the pitches using all pitchers together versus clustering pitches thrown by each pitcher separately. Which do you think is more appropriate?
View the distribution of cluster membership probabilities (follow the lecture example). What do the distributions look like? View the uncertainty in each cluster (hint: mclust_results$uncertainty
) as in the lecture example. Then view the uncertainty by the actual pitch_type
label. Are there certain pitch types that display higher uncertainty values? How does this compare across the different pitchers?