Rebecca Nugent | Department of Statistics, Carnegie Mellon University

An Overview of Clustering: Finding and Extracting Group Structure in High-Dimensional Data

Clustering is the search for similar or homogeneous subgroups in a population, say, of consumers, patients, genes, images, text documents, anything that can possibly contain group structure. For example, consumers might be divided into different market segments based on their preferences and spending habits; advertising models could be tailored to these segments. These ideas can be extended to personalized advertising online, including images of related products and suggested links. In public health, patients might have different responses to different interventions over time. We might be interested in how to predict which outcome group a patient is likely to be in given their symptoms, past history, and current treatment. The pixels of an image can also be segmented into similar, but spatially connected groups to find objects such as a tumor in an X-ray or the position of a potentially moving object across several frames. In document clustering, the goal is to group similar pieces of text (blogs, emails, posts, letters, articles, etc) based on the words used, the frequency, and other text features. In all cases, the goal is extract structure from potentially high-dimensional data. Clustering methods can uncover this structure, if it exists. The difficulty, however, often lies in which clustering approach to adopt, particularly given that results are rarely independent of approach. This tutorial will give an overview of algorithmic and statistical approaches to clustering with an emphasis on how to choose an approach and its related parameters. Throughout the tutorial, we use a broad range of examples of at least moderate dimensionality with some specific attention paid to longitudinal trajectory data over time. The goal of the tutorial is to better inform practitioners of the wide variety of available clustering tools, their underlying assumptions, and their advantages and disadvantages. Handouts, reference lists, and example R code will be made available. Note that while we use the statistical software package R, it is for illustrative purposes. Many of the discussed techniques are readily available on other platforms. Our focus will be on understanding the related assumptions and consequences.