Variable Selection for Consistent Clustering


A common problem encountered in clustering analysis is obtaining different clusters for the same data set using different methods. We might be interested in discovering which clusters (if any) are consistent across methods. Similar to the framework of the maximum clustering similarity (MCS) method by Albatineh and Niewiadomska-Bugaj (2011), this paper describes an approach to simultaneously select variables and number of clusters yielding consistent clustering results. Following Raftery and Dean (2006), a greedy search algorithm finds the set of variables and number of clusters with the highest level of consistency as measured by the Hubert-Arabie ARI (1985). Additionally, we address variation and incorporate confidence in our selections through bootstrapping, where the next choice is based on a distributional overlap measure. We present results for both simulated and benchmark clustering data sets.

Reston, VA
Ron Yurko
PhD Student in Statistics & Data Science

My research interests include developing selective inference methodology for applications in statistical genetics and genomics, as well as clustering problems and research on statistics in sports.