An Overview of Clustering: Finding Group Structure in Educational Research Data
Clustering is the search for similar or homogeneous subgroups in a population, say, of students, items, log files, or anything that can possibly contain group structure. A class of students may have different learning curves or different latent skill set profiles; subsets of test questions might have similar patterns of skills or errors; groups of similar users can explore interactive learning environments or educational games in different ways. Clustering methods can uncover these differences, if they exist. With the advent of educational learning systems that can track student progress via keystrokes, time on task, tasks clicked or explored, educational data sets have become richer, higher in dimensionality, and consequently more complicated. Traditional psychometric or statistical estimation procedures can have difficulty handling data sets of this size. For example, cognitive diagnosis models typically can handle up to about ten skills; many intelligent tutoring systems are tracking skills in the hundreds. Clustering offers solutions when other approaches may not be feasible. For example, recent work has shown consistency of some types of clusters when estimating skill set profiles. Even if another approach will eventually be the focus of the final analysis, first understanding the group structure in these high-dimensional data sets (if present) can be very beneficial.
This tutorial will focus on an overview of algorithmic and statistical approaches to clustering with references to current related EDM work where applicable. Deterministic algorithms include hierarchical linkage clustering, k-means, and k-medoids. There are several statistical approaches that can be broadly divided into parametric and nonparametric. Parametric methodology assumes an underlying mixture distribution in the population; each subgroup gets its own density component (most often Gaussian). We focus on model-based clustering, an approach that searches for the mixture of groups that best fits the data set. Nonparametric methodology searches for high frequency areas in the feature space, allowing for any shape or size. These approaches can also be very computationally intensive. We briefly discuss some tools to help visualize the high frequency areas or modes of the density and provide references and supplemental material for more technical methods. Many clustering methods, regardless of assumptions, also require an estimate of the number of subgroups in the population; we give an overview of some commonly used choices. We also give special focus to the cluster analysis of longitudinal data and illustrate how to determine different group trajectories over time, a inherent goal in many educational research questions.
Throughout the tutorial, we use examples rooted in the educational data mining literature with a special emphasis on data sets of at least moderate dimensionality. The goal of the tutorial is to better inform practitioners of the wide variety of available clustering tools, their underlying assumptions, and their advantages and disadvantages. Handouts, reference lists, and example R code will be made available.