Cosma Shalizi

Statistics 36-350: Data Mining, Fall 2006

This page links to copies of my lecture handouts for this class. These are derivative work from those made by Tom Minka when he inaugurated the class a few years ago. (The originals are viewable here.)

August 28 (Lecture 1): Searching Documents by Similarity (Lecture 1). Why similarity search? Defining similarity and distance. The bag-of-words representation. Normalizations. Some results.
August 30 (Lecture 2): More on Similarity Search. Stemming, linguistic issues. Picking out good features, or at least ignoring non-discriminative ones. Inverse document frequency. Using feedback from the searcher. Multi-dimensional scaling.
September 6 (Lecture 3): Searching Images by Similarity. Representation and abstraction. How to search images without looking at images; a failure-mode. The bag-of-colors representation. More examples. Invariance and representation. See also: slides illustrating this lecture.
September 11 and September 13 (Lecture 4): Finding Informative Features. More on finding good features. Entropy and uncertainty. Information and entropy. Ranking features by informativeness. Examples.
September 18 (Lecture 5): Interactions Among Features. Redundancy and enhancement of information. Information-sharing graphs. Examples.
September 20 and 25 (Lecture 6): Partitioning Data into Clusters. Supervised and unsupervised learning. Social and organizational aspects of categorization. Finding categories in data via clustering. Characteristics of good clusters. The k-means algorithm for clustering. Search algorithms, search landscapes, hill climbing, local minima. Algorithms for hierarchical clustering. Avoiding spherical clusters. See also: slides to accompany the second half, showing clustering of images.
September 27 (Lecture 7): Making Better Features. Transforming features to enhance invariance. Transforming features to improve their distribution. Projecting high-dimensional data into lower dimensions. Principal component analysis: informal description and example.
October 2 (Lecture 8): More on Principal Component Analysis. Mathematical basis: maximizing the variance of the projected points. Mathematical basis: minimizing reconstruction error. Interpretation of PCA results.
October 4: Review of course to date. (No handout.)
October 9 (Lecture 9): Evaluating Predictive Models. Classification and linear regression as examples of predictive modeling. Error measures a.k.a. loss functions; examples. In-sample error. Out-of-sample or generalization error; why it matters, relation to in-sample error. Model selection. An example of over-fitting. Approaches to limiting over-fitting and its ill effects.
October 11 (Lecture 10): Regression Trees. Difficulties of fitting global models in complex systems. Recursive partitioning and simple local models as a solution. Prediction trees in general. Regression trees in particular. An example. Tree growing. Tree pruning via cross-validation.