Cosma Shalizi
Statistics 36-350: Data Mining, Fall 2006
This page links to copies of my lecture handouts for this class. These are
derivative work from those made
by Tom Minka when he
inaugurated the class a few years ago. (The originals are viewable here.)
- August 28 (Lecture 1): Searching Documents by
Similarity (Lecture 1). Why similarity search? Defining similarity and
distance. The bag-of-words representation. Normalizations. Some results.
- August 30 (Lecture 2): More on Similarity
Search. Stemming, linguistic issues. Picking out good features, or at
least ignoring non-discriminative ones. Inverse document frequency. Using
feedback from the searcher. Multi-dimensional scaling.
- September 6 (Lecture 3): Searching Images by
Similarity. Representation and abstraction. How to search images without
looking at images; a failure-mode. The bag-of-colors representation. More
examples. Invariance and representation. See
also: slides illustrating this lecture.
- September 11 and September 13 (Lecture 4): Finding
Informative Features. More on finding good features. Entropy and
uncertainty. Information and entropy. Ranking features by informativeness.
Examples.
- September 18 (Lecture 5): Interactions Among Features. Redundancy and enhancement of information. Information-sharing graphs.
Examples.
- September 20 and 25 (Lecture 6): Partitioning Data
into Clusters. Supervised and unsupervised learning. Social and
organizational aspects of categorization. Finding categories in data via
clustering. Characteristics of good clusters. The k-means algorithm for
clustering. Search algorithms, search landscapes, hill climbing, local minima.
Algorithms for hierarchical clustering. Avoiding spherical clusters. See
also: slides to accompany the
second half, showing clustering of images.
- September 27 (Lecture 7): Making Better
Features. Transforming features to enhance invariance. Transforming
features to improve their distribution. Projecting high-dimensional data into
lower dimensions. Principal component analysis: informal description and
example.
- October 2 (Lecture 8): More on Principal Component
Analysis. Mathematical basis: maximizing the variance of the projected
points. Mathematical basis: minimizing reconstruction error. Interpretation
of PCA results.
- October 4: Review of course to date. (No handout.)
- October 9 (Lecture 9): Evaluating Predictive
Models. Classification and linear regression as examples of predictive
modeling. Error measures a.k.a. loss functions; examples. In-sample error.
Out-of-sample or generalization error; why it matters, relation to in-sample
error. Model selection. An example of over-fitting. Approaches to limiting
over-fitting and its ill effects.
- October 11 (Lecture 10): Regression Trees.
Difficulties of fitting global models in complex systems. Recursive
partitioning and simple local models as a solution. Prediction trees in
general. Regression trees in particular. An example. Tree growing. Tree
pruning via cross-validation.