Visualizing and Learning the Structure in Data
What are the location and severity patterns of earthquakes off Fiji? What characteristics of the purchases you make on Amazon determine what products they recommend? Can we predict the presence of roads and infrastructure from images of the earth's surface? Which words should be flagged in emails as indicating spam? Or in Internet chatter indicating potential acts of terror? Is divorce contagious among friends? What about obesity? What grocery store items tend to be purchased together? Beer and diapers?
All of these questions can be answered by discerning, visualizing, and learning the structure in data. While data sets used to be limited in size by the cost of their collection, today's data analytics problems are large, complicated, and messy. Mistakes in data entry and odd anomalies can derail analyses if not identified and addressed. Often the real information in the data or "signal" is hidden in all the background noise. Statistical learning methods are designed to learn, extract, and model the important information and features in a data set. These methods should always be coupled with appropriate graphical displays of the quantitative information. This course will serve to introduce the student to both the most common forms of graphical displays (and their uses and misuses) and different supervised and unsupervised learning techniques (i.e. "learning with and without labels") focusing on clustering and classification methodology. Students will also engage in projects using both graphical methods and learning models to understand data from real, interdisciplinary research problems.
Topics and Objectives
Our primary goal is to provide a background in the variety of available visualization approaches as a jumping off point for statistical learning algorithms, both supervised and unsupervised. Students will not be expected to become experts in the areas but rather gain exposure to the right kinds of questions to ask and a sense of where to start looking for the answers.
Our "syllabus" is flexible given the needs of the students and will be updated below:
- July 1: Introduction, Motivation
- July 2: Continuous Distributions, Histograms, Boxplots, Box-Percentile Plots, Kernel Density Estimates, Violin Plots, Bean Plots, Conditional Density Plot
- July 5: Scatterplots, Jitter, Sunflower plots, 2-D Histograms, 2-D Kernel Density Estimates, Level Sets/Contours/Heat Maps, (Rotating) Perspectives
- July 6: Linear Regression, Diagnostics
- July 7: Locally Weighted Scatterplot Smoothers (LOWESS); Cubic Polynomial Splines
- July 8: Multivariate Regression (and Diagnostics); Regression Trees
- July 11: Classification Trees; General Discrimination Analysis (Posterior Probabilities)
- July 12: Linear/Quadratic Discriminant Classifiers; Icons/Glyphs, Chernoff Faces
- July 14: Clustering Approaches: Deterministic vs Statistical; Hierarchical Clustering, K-Means
- July 15: Clustering: K-Means, Spherical K-Means, Document Clustering
- July 18: Clustering: Spectral Clustering, Model-Based Clustering, Nonparametric Clustering
- July 19: Putting it all together (Research examples)
Rebecca Nugent
Professor Nugent is a Teaching Professor in the Department of Statistics at Carnegie Mellon University and Director of the nationally ranked Undergraduate Statistics program. She received her Bachelor's in Mathematics, Statistics, and Spanish from Rice University, her Master's in Statistics from Stanford University, and her PhD in Statistics from the University of Washington. Her research primarily focuses on finding and visualizing high-dimensional structure. She was the 2009 Chikio Hayashi Award recipient (a Young Promising Researcher award presented by the International Federation of Classification Societies). She has served as the President of the Classification Society (of North America) and is active in the ASA Section on Statistical Computing and Statistical Graphics and the ASA Section on Statistical Education. Her publications largely focus on clustering methodology for a broad range of applications including educational data mining, psychometrics, public health, and record linkage. At Carnegie Mellon University, she has taught undergraduate and graduate classes in statistical learning, regression, document clustering, record linkage, among others. She has also won several teaching awards, including the national American Statistical Association 2015 Waller Education Award for innovation in statistics education.