An Overview of Clustering: Finding and Extracting Group Structure in High-Dimensional Data
Clustering is the search for similar or homogeneous subgroups in a population, say, of consumers, patients, genes, images, text documents, anything that can possibly contain group structure. For example, consumers might be divided into different market segments based on their preferences and spending habits; advertising models could be tailored to these segments. These ideas can be extended to personalized advertising online, including images of related products and suggested links. In public health, patients might have different responses to different interventions over time. We might be interested in how to predict which outcome group a patient is likely to be in given their symptoms, past history, and current treatment. The pixels of an image can also be segmented into similar, but spatially connected groups to find objects such as a tumor in an X-ray or the position of a potentially moving object across several frames. In document clustering, the goal is to group similar pieces of text (blogs, emails, posts, letters, articles, etc) based on the words used, the frequency, and other text features.
In all cases, the goal is extract structure from potentially high-dimensional data. Clustering methods can uncover this structure, if it exists.
The difficulty, however, often lies in which clustering approach to adopt, particularly given that results are rarely independent of approach.
This tutorial will give an overview of algorithmic and statistical approaches to clustering with an emphasis on how to choose an approach and its related parameters. Throughout the tutorial, we use a broad range of examples of at least moderate dimensionality with some specific attention paid to longitudinal trajectory data over time. The goal of the tutorial is to better inform practitioners of the wide variety of available clustering tools, their underlying assumptions, and their advantages and disadvantages. Handouts, reference lists, and example R code will be made available.
Note that while we use the statistical software package R, it is for illustrative purposes. Many of the discussed techniques are readily available on other platforms. Our focus will be on understanding the related assumptions and consequences.
Topics and Objectives
Our primary goal is to provide the practitioner with a solid background in the variety of available clustering approaches and their related assumptions, necessary parameter choices, cluster shapes and sizes, and advantages/disadvantages. The practitioners will also gain skills in critiquing and interpreting their final cluster solution and identifying unstable or undesirable clusters. Topics (may) include: (may be interspersed as appropriate)
- Deterministic Algorithms
- Hierarchical Linkage Clustering
- K-Means (including fuzzy version)
- K-Medoids
- Statistical Approaches
- Parametric mixture models/model-based clustering
- Nonparametric bump hunting or mode finding
- Spectral Clustering or Image Segmentation
- Longitudinal Clustering
- Validation and Visualization
- Uncertainty
- Cluster Validation Strength
- Silhouettes
- Stripes and Neighborhoods
Instructors
Rebecca Nugent
Professor Nugent is an Associate Teaching Professor in the Department of Statistics at Carnegie Mellon University. She received her Bachelor's in Mathematics, Statistics, and Spanish from Rice University, her Master's in Statistics from Stanford University, and her PhD in Statistics from the University of Washington. Her research primarily focuses on finding and visualizing high-dimensional structure. She was the 2009 Chikio Hayashi Award recipient (a Young Promising Researcher award presented by the International Federation of Classification Societies). She has served as the President of the Classification Society (of North America) and is active in the ASA Sections on Statistical Computing and Statistical Graphics. Her publications largely focus on clustering methodology for a broad range of applications including educational data mining, psychometrics, public health, and record linkage. At Carnegie Mellon University, she has taught undergraduate and graduate classes in statistical learning, regression, document clustering, record linkage, among others. She has also won several teaching awards, including the Elliott Dunlap Smith Award for Distinguished Teaching and Educational Service.
Sam Ventura
Samuel L. Ventura is a PhD Candidate in the Department of Statistics at Carnegie Mellon University. He received his Bachelor's in Statistics and Computational Finance and his Master's in Statistics from CMU. His research focus is on large-scale clustering and classification techniques and brings extensive statistical computing knowledge and experience. He has been an invited speaker at conferences focusing on clustering, classification, record linkage, and/or statistical learning. Sam has also taught several summer courses on Probability while at CMU.