Cosma Shalizi

Statistics 36-350: Data Mining

Fall 2008

MW 10:30--11:20 Porter Hall 226B
F 10:30--11:20 Doherty Hall 1217

Data mining is the art of extracting useful patterns from large bodies of data; finding seams of actionable knowledge in the raw ore of information. The rapid growth of computerized data, and the computer power available to analyze it, creates great opportunities for data mining in business, medicine, science, government, etc. The aim of this course is to help you take advantage of these opportunities in a responsible way. After taking the class, when you're faced with a new problem, you should be able to (1) select appropriate methods, and justify their choice, (2) use and program statistical software to implement them, and (3) critically evaluate the results and communicate them to colleagues in business, science, etc.

Data mining is related to statistics and to machine learning, but has its own aims and scope. Statistics is a mathematical science, studying how reliable inferences can be drawn from imperfect data. Machine learning is a branch of engineering, developing a technology of automated induction. We will freely use tools from statistics and from machine learning, but we will use them as tools, not things to study in their own right. We will do a lot of calculations, but will not prove many theorems, and we will do even more experiments than calculations.

Outline, Notes, Readings

This is a rough outline of the material. The first five items should take us to the middle of the semester, and (6) and (7) will definitely be covered; the others will depend on time and class interests.

Searching by similarity: Searching by content (texts, images, genes, ...); attributes, representations and definitions of similarity and distance; choice of representation; multi-dimensional scaling; classifications; image search and invariants; user feedback; evaluating searches
Information: information and uncertainty; classes and attributes; interactions among attributes; relative distributions
Clustering: supervised and unsupervised learning; categorization; unsupervised category-learning, a.k.a. clustering; k-means clustering; hierarchical clustering; geometry of clusters; what makes a good cluster?
Data-reduction and feature-enhancement: Standardizing data; using principal components to eliminate attributes; using factor analysis to eliminate attributes; limits and pitfalls of PCA and factor analysis; nonlinear dimensionality reduction
Regression Review of linear regression; transformations to linearity; the truth about linear regression; local linear regression; polynomial regression; kernel regression; other non-parametric methods
Prediction: Evaluating predictive models; over-fitting and capacity control; regression trees; classification trees; neural networks; combining predictive models; forests; how to gamble if you must
Classification: Supervised categorization; linear classifiers; logistic regression; the kernel trick; base rates and multiple testing; spam, fraud, credit cards, profiling
Change over time: trends and time series models; Markov models; hidden Markov models; adapting to change and failure to adapt
Modeling interventions: Estimating causal impacts without experiments; matching; graphical causal models and Tetrad.
Waste and Abuse: what happens when the data are bad; what happens with the wrong kinds of data; situations when data mining will fail; trying not to be evil; some failures

Lecture Notes

Introduction to the course (25 August)
Information retrieval and similarity searching (25 August)
Multidimensional scaling and a first glance and classification (27 August)
A little about page-rank (29 August)
Homework #1, due 8 September: assignment, R, newsgroups.tgz data file
Solutions
Image search, abstraction and invariance; the accompanying slides (8 September)
Finding informative features (10 September)
Information and interaction among features (12 September)
Homework #2, due 19 22 September: assignment, solutions, solutions code
Note: Information theory, axiomatic foundations, connections to statistics — elaboration on some points raised in lecture (12 September)
Categorization: types of categorization, basic classifiers and finding simple clusters in data (15 September)
Clustering: hierarchical clustering; how many clusters? (17 September)
Yet more clustering (19 September; slides)
Making better features: transformations; projections and principal components analysis (22 September)
Mathematics of principal components analysis; interpretations and limitations of PCA (24 September)
Yet more on linear dimensionality reduction: PCA + information retrieval = Latent semantic indexing. Factor analysis: motivations, historical roots, preliminaries to estimation (26 September)
Homework #3, due 3 October: assignment
More on Factor Analysis: estimation and the rotation problem (29 September)
Principal Components versus Factor Analysis: worked examples, basic goodness-of-fit testing for factor analysis; R code for lecture (1 October)
The truth about principal components and factor analysis: strengths, limitations, factor models as graphical models, factor models and mixture models, Thomson's sampling model; R code for Thomson's model (3 October)
Homework #4, due Friday, 10 October: assignment, nci.kmeans, nci.pca2.kmeans; solutions
Regression: predicting quantiative features: point prediction; expectations and mean-square optimality; regression functions; regression as smoothing; linear regression as linear smoothing; other kinds of linear smoothers; nearest-neighbor regression; kernel regression. R code for figures, data for running example. (6 October)
The truth about linear regression: optimal linear prediction; shifting distributions and omitted variables; rights and obligations of probabilistic assumptions; abuses of linear regression; how to hurt angels (8 October)
Extending linear regression: weighted least-squares, heteroskedasticity, local linear regression. R code for figures, data for running example (10 October)
Review session: no handout (13 October)
Midterm: exam and solutions (15 October)
Evaluating preditive models: in-sample and generalization error; over-fitting and under-fitting; model selection, capacity control, cross-validation. R for figures. (20 October)
Using cross-validation: mechanics and examples (22 October)
Using non-parametric smoothing: adaptive smoothing, testing parametric forms (24 October)
Homework #5, due Friday, 31 October: assignment; solutions, R for solutions
Prediction trees 1: mostly regression trees, plus a "classification tree we can believe in" (27 October)
Homework #6, due Friday, 7 November: assignment; solutions
Prediction trees 2: classification trees (29 October and 3 November)
Bootstrapping, Bagging, and Random Forests (5 November)
Combining Predictive Models and the Power of Diversity (7 November)
Linear Classifiers and the Perceptron Algorithm (10 November)
Logistic Regression and Newton's Method (12 November)
Homework #7, due Friday, 21 November: assignment; solutions
Neural Networks: The Mathematical Reality (14 November)
Neural Networks: The Biological Myth (17 November)
Support Vector Machines (19 November)
Support vector machines continued (21 November; same handout as previous)
Homework #8, due Monday, 1 December: assignment; solutions
The Lecture Full of Fail: The wrong data, lying data, covariate shift, low base-rates and overwhelming false positives, response Waste, fraud and abuse (24 November; notes forthcoming)
Homework #9, due 15 December: assignment; solutions

Readings

David P. Feldman, "Introduction to Information Theory", ch. 1: To accompany lecture 5
Aleks Jakulin and Ivan Bratko, "Quantifying and Visualizing Attribute Interactions", arxiv:cs.AI/0308002: To accompany lecture 6
Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer and Richard Harshman, "Indexing by Latent Semantic Analysis" [PDF]: To accompany lecture 12 (optional)
Thomas K. Landauer and Susan T. Dumais, "A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge" [PDF]: To accompany lecture 12 (optional)
L. L. Thurstone, "The Vectors of Mind": To accompany lecture 12 (optional)
David Hand, "Classifier Technology and the Illusion of Progress", arxiv:math.ST/0606441: To accompany lecture 34
Bernard E. Harcourt, "Against Prediction: Sentencing, Policing, and Punishing in an Actuarial Age", ssrn/756945: To accompany lecture 34

Course Mechanics

Details on grading, exams, etc., can be found in the full syllabus.

Blackboard

Homework, solutions, grades, class announcements, etc., will be handled through Blackboard; let me know if you cannot access the course site.

Textbook

Our textbook is Principles of Data Mining by Hand, Mannila and Smyth. It should be at the campus bookstore by 26 August, but you can also buy it online (I like Powell's), or directly from MIT Press.

R

R is a free, open-source software package/programming language for statistical computing. (It is a dialect of a language called S, whose commercial version is S-plus.) You can expect at least one assignment every week which uses R. If you do not have ready access to a computer which can run it, contact me at once.

Here are some resources for learning R:

The official intro, "An Introduction to R", available online in HTML and PDF
John Verzani, "simpleR", in PDF
Quick-R. This is primarily aimed at those who already know a commercial statistics package like SAS, SPSS or Stata, but it's very clear and well-organized, and others may find it useful as well.
There are now many books about R. Two I can recommend, having read them, are Braun and Murdoch's A First Course in Statistical Programming with R (official site, Powell's) and Venables and Ripley's Modern Applied Statistics with S (official site, Powell's). Braun and Murdoch is more suitable for beginners; Venables and Ripley has a lot of useful but more advanced material.