Statistics 36-462/662: Data Mining
Spring 2020
Prof. Cosma Shalizi
Tuesdays and Thursdays 1:30--2:50 Porter Hall 100
Data mining is the art of extracting useful patterns from large bodies of data.
(Metaphorically: finding seams of actionable knowledge in the raw ore of
information.) This involves both methods for discovering possible patterns,
and methods for checking or validating candidate patterns. The on-going growth
in our ability to use computers to store and analyze data opens opportunities
for data mining in business, science, government and elsewhere. The aim of
this course is to help you take advantage of these opportunities in a
responsible way.
After taking the class, when you're faced with a new data-mining problem,
you should be able to (1) select appropriate methods of pattern discovery and
validation, and justify your choices, (2) use and program statistical software
to implement those methods, (3) critically evaluate the results and communicate
them to professional colleagues who are not also statisticians or
data-analysts, and (4) be aware of the most common ethical issues involved.
Data mining is related to statistics
and to machine learning, but has its own
aims and scope. Statistics deal with systems for creating reliable inferences
from imperfect data. Machine learning develops technologies of automated
prediction and induction. We will use tools from statistics and from machine
learning, but we will use them as tools, rather than as things to study in
their own right. We will do a lot of calculations, but will not prove many
theorems, and we will do many more experiments than calculations.
Prerequisites
For this semester, the prerequisite is
passing 36-401 (modern
regression) with a grade of C or better. The only exceptions are for
graduate students taking the graduate section (36-662; see below).
Graduate students and 36-662
This course is primarily intended as an advanced elective for undergraduate
students majoring in statistics. A limited number of graduate students are
welcome to take it as 36-662. (Graduate students who register for 36-462 will
be dropped from the roster.) Formally, 662 has no prerequisites, but students
are expected to have a background at least equivalent to 36-401, and the
prerequisites to 401: linear regression modeling, probability, mathematical
statistics, linear algebra, calculus, and competence in using R for data
analysis. This class does not have the resources to help those without this
background catch up.
Undergraduates must take the course as 36-462; those who register for 662
will be dropped from the roster.
Instructors
Professor | Dr. Cosma Shalizi | cshalizi [at] cmu.edu |
| | Baker Hall 229C |
Teaching assistant | Ms. Robin Dunn | not to be bothered with e-mails | not be bothered in offices either |
| Mr. Tudor Manole | | |
| Mr. Aleksandr Podkopaev | | |
| Mr. Ian Waudby-Smith | | |
Topics and Techniques to be Covered
These may change during the semester.
Methods of pattern discovery (a.k.a. "heuristics")
- "Supervised" prediction methods for classification and regression
- Regression=predicting a numerical variable; classification=predicting a categorical variable
- General concepts in evaluating prediction models: the bias-variance decomposition of total error, the bias-variance trade-off. Specific error measures for classification: misclassification rate, false positive and false negative errors; ROC curves.
- Nearest neighbor methods: basic ideas (averaging nearby-cases), trade-offs in the number of neighbors, mathematical and statistical properties of nearest neighbor prediction, computational considerations
- Tree-based methods: basic ideas (averaging similar cases), trade-offs in the size o the tree, mathematical and statistical properties
- Methods for combining multiple models (especially trees): bagging, random forests, boosting, stacking
- Kernel-based methods: kernels as similarity measures; averaging similar cases; selection of kernels; support vector machines; mathematical and computational issues.
- "Unsupervised", not-always-predictive methods
- Dimension reduction: forming new features by combining old ones; linear dimension reduction by principal components; linear dimension reduction by factor models; linear dimension reduction by random projections; nonlinear dimension reduction by local linear embedding and by diffusion maps.
- Clustering: assigning data points to categories without knowing the right categories; geometric clustering and the k-means algorithm; probabilistic mixture models.
- Information-theoretic foundations
- Basic concepts of information theory (entropy, information); the informational view of prediction, dimension reduction and clustering; the information bottleneck; feature selection.
Methods of pattern validation
- Evaluation of predictive performance
- In-sample vs. out-of-sample and generalization performance; the concept of over-fitting; analytical approximations to generalization performance; effective degrees of freedom and other measures of model complexity.
- Cross-validation
- Evaluation of uncertainty
- Bootstrap
- Inference after model selection / data mining
- Issues with multiple testing and with post-selection inference; sample splitting.
Case studies and concerns
- Recommendation engines
- "You may also like"; recommender systems based on nearest neighbors; systems based on factor models; hybrid systems; consequences of recommendation engines.
- Information retrieval
- Searching by content (text, images, genes...); choosing representations of content; the bag-of-words representation of text; evaluating search; dimension reduction and clustering for text; word2vec representation of text; incorporating user feedback.
- Fairness in prediction
- Case studies: lending decisions, criminal justice. The COMPAS/ProPublica controversy. Definitions of fairness; trade-offs between different notions of fairness, and between fairness and predictive accuracy.
- Sources of technical failure in data mining
- Data quality. Measurement vs. reality. Weak predictive
relationships. Non-existent predictive relationships. The curse of dimensionality. Data-set shift.
- Should we do it / what are we doing?
- Privacy (or lack thereof). Human rights concerns. Social consequences, e.g., are we building a dystopia to
make each other click on ads?
Course Mechanics and Grading
There are three reasons you will get assignments in this course. In order of
decreasing importance:
- Practice. Practice is essential to developing the skills
you are learning in this class. It also actually helps you learn, because some
things which seem murky clarify when you actually do them, and sometimes trying
to do something shows that you only thought you understood it.
- Feedback. By seeing what you can and cannot do, and what
comes easily and what you struggle with, I can help you learn better, by giving
advice and, if need be, adjusting the course.
- Evaluation. The university is, in the end, going to
stake its reputation (and that of its faculty) on assuring the world that you
have mastered the skills and learned the material that goes with your degree.
Before doing that, it requires an assessment of how well you have, in fact,
mastered the material and skills being taught in this course.
To serve these goals, there will be three kinds of assignment in this
course.
- Homework
- Every week will have a homework assignment, divided into a series of
questions or problems. These will have a common theme, and will usually build
on each other, but different problems may involve doing or applying some
theory, analyzing real data sets on the computer, and communicating the
results.
- All homework will be submitted electronically through Canvas and
Gradescope. Most weeks, homework will be due at 10:00 pm on
Thursdays. Homework assignments will always be released by Friday of
the previous week, and sometimes before.
- There are specific formatting requirements for
homework --- see below.
- Online assignments
- Most homework assignments will be accompanied by short-answer or multiple
choice questions, to be done on Canvas, before the homework itself is
due. These will typically be due on Sunday nights; consult Canvas for the
exact schedule.
- In-class exercises
- Most lectures will have in-class exercises. These will be short (10--20
minutes) assignments, emphasizing problem solving, done in class in small
groups of no more than four people. The assignments will be given out in
class, and must be handed in on paper during the class period. On most days, a
randomly-selected group will be asked to present their solution to the class.
- The in-class exercises will either be problems from that week's homework,
or close enough that seeing how to do the exercise should tell you how to do
some of the problems.
There will be no exams.
Time expectations
You should expect to spend 5--7 hours on assignments every week, averaging over
the semester. (This follows from the university's rules about how course
credits translate into hours of student time.) If you find yourself spending
significantly more time than that on the class, please come to talk to me.
Grading
Grades will be broken down as follows:
- In-class exercises: 10%. All exercises will have equal weight. Your
lowest three exercise grades will be dropped; if you complete all
exercises with a score of at least 60%, your lowest five grades will be
dropped.
- Homework: 90%. All homework assignments will be weighted equally. Most
assignments will include online short-answer questions, due before the
main assignment (see above); these will be incorporated into the grades for the
corresponding assignments. The online questions will amount to about 10% of
your final grade.
Your lowest two homework grades will be dropped. If you complete all
homework assignments, with a minimum score of at least 60%, your
lowest three homework grades will be dropped.
Late homework will not be accepted for any reason.
The point of dropping the low grades is to let you schedule interviews,
social engagements, and other obligations, with flexibility, and without my
having to decide what is or is not important enough to change the rules. If
something --- illness, family crisis, or anything else --- is interfering with
your ability to do the work of the class, come talk to me.
Grade boundaries will be as follows:
A | [90, 100] |
B | [80, 90) |
C | [70, 80) |
D | [60, 70) |
R | < 60 |
To be fair to everyone, these boundaries will be held to strictly.
Grade changes and regrading: If you think that particular
assignment was wrongly graded, tell me as soon as possible. Direct any
questions or complaints about your grades to me; the teaching assistants have
no authority to make changes. (This also goes for your final letter grade.)
Complaints that the thresholds for letter grades are unfair, that you deserve a
higher grade, etc., will accomplish much less than pointing to concrete
problems in the grading of specific assignments.
As a final word of advice about grading, "what is the least amount of work I
need to do in order to get the grade I want?" is a much worse way to approach
higher education than "how can I learn the most from this class and from my
teachers?".
Lectures
Lectures will be used to amplify the readings, provide examples and demos, and
answer questions and generally discuss the material. They are also when you
will do the graded in-class assignments which will help consolidate your
understanding of the material, and help with your homework.
Readings
There are (up to) three classes of readings associated with each lecture.
- Key readings: Usually from required textbooks,
occasionally from other sources which will be made available well in advance.
Do this reading before coming to lecture, and lecture will make a lot more
sense. (Also, the online exercises will be easy.)
- Suggested readings: Usually easier-to-read academic papers, or
portions of the non-required texts, which will deepen your understand of what
we're doing in lecture and in the homework, but aren't critical.
- Background readings: Usually harder-to-follow or more historical papers,
which give further details about methods and applications.
TL;DR: definitely do the key readings before class; try to do the suggested readings too, they're easy; and the background readings are if you want to get into the weeds.
Electronics
Please don't use any electronic devices during lecture: no laptops, no
tablets, no phones, no recording devices, no watches that do
more than tell time. The only exception to this rule is for electronic
assistive devices, properly documented with CMU's Office of Disability
Resources.
(The no-electronics rule is not arbitrary meanness on my part. Experiments
show,
pretty
clearly, that students learn more in electronics-free classrooms, not least
because your device isn't distracting your neighbors, who aren't as good at
multitasking as you are.)
Missing class and emergencies
You can miss up to three class sessions during the semester without
(grade) penalty. In cases of routine illness (for example, you have a bad cold
and a day of sleep and fluids would help), stay home, take a free
day, and let yourself rest and recover, rather than dragging
yourself to class. The grading policy gives you space for this.
In cases of extended illness or emergency, we can deal with make-ups for
missed work after the crisis. If extended illness or emergency
happens (including mental health crises), or a personal or family crisis,
please don't wait to get help. Just let me know that you're dealing with an
emergency, and that you are already getting help.
R, R Markdown, and Reproducibility
R is a free, open-source software
package/programming language for statistical computing. You should have begun
to learn it in 36-401 (if not before). In this class, you will use it for
every homework. If you are not able to use R, or do not have ready,
reliable access to a computer on which you can do so, let me know at once.
Communicating your results to others is as important as getting good results
in the first place. Every homework assignment will require you to write about
that week's data analysis and what you learned from it; this writing is part of
the assignment and will be graded. Raw computer output and R code is not
acceptable; your document must be humanly readable.
All homework and exam assignments are to be written up
in R Markdown. (If you know
what knitr is and would rather use it,
ask first.) R Markdown is a system that lets you embed R code, and its output,
into a single document. This helps ensure that your work
is reproducible, meaning that other people can re-do your
analysis and get the same results. It also helps ensure that what you report
in your text and figures really is the result of your code (and not some
brilliant trick that you then forget, or obscure bug that you didn't notice).
For help on using R Markdown,
see "Using R Markdown
for Class Reports". (Part of the first lecture will be going over
the basics of R Markdown.)
For each assignment, you should generally submit two, and only two, files:
- an R Markdown source file, integrating text, generated figures and R code, uploaded to Canvas;
- the "knitted," humanly-readable document, in PDF format, uploaded to
Gradescope. (Using PDF is
important for the grading software.)
(Using two systems is inconvenient for everyone; Gradescope is much better for efficiently sharing work across multiple graders, but will only work with PDFs.)
I will be re-running the R Markdown file
of randomly selected students; you should expect to be picked for this about
once in the semester. You will lose points if your R Markdown file does not,
in fact, generate your knitted file (making obvious allowances for random
numbers, etc.).
Some problems in the homework will require you to do math. R Markdown
provides a simple but powerful system for type-setting math. (It's based on
the LaTeX document-preparation system widely used in the sciences.) If you
can't get it to work, you can hand-write the math and include scans or photos
of your writing in the appropriate places in your R Markdown document. (You
should upload these additional files to Canvas.) You will, however, lose
points for doing so, starting with no penalty for homework 1, and growing to a
90% penalty (for those problems) by the last homework.
Canvas, Gradescope and Piazza
Homework will be submitted electronically through Canvas and Gradescope;
Canvas will also be used as the gradebook. Some readings and course materials
will also be distributed through Canvas.
We will be using the Piazza website for question-answering. You will
receive an invitation within the first week of class.
Anonymous-to-other-students posting of questions and replies will be allowed,
at least initially. (Postings will not be anonymous to instructors.)
Anonymity will go away for everyone if it is abused. As noted above, homework
will generally be due at 10 pm on Wednesdays; however, you should not expect
the instructors to answer questions on Piazza (or by e-mail) after 6 pm.
Solutions
Solutions for all homework will be available, after their due date, through
Canvas. Please don't share them with anyone, even after the course has ended.
Office Hours
Mondays | 12:30--1:30 | Mr. Manole | Wean Hall 3509 |
Tuesdays | 10:30--11:30 | Mr. Waudby-Smith | Wean Hall 3715 |
Wednesdays | 12:30--1:30 | Prof. Shalizi | Baker Hall 229C |
Thursdays | 3:00--4:00 | Mr. Podkopaev | Porter Hall A19A |
If you want help with computing, please bring a laptop.
If you cannot make the regular office hours, or have concerns you'd rather
discuss privately, please e-mail me about making an appointment.
Textbook
The following three books are required:
- D. J. Hand, Heikki Mannila
and Padhraic Smyth,
Principles of Data
Mining (Cambridge, Massachusetts: MIT Press, 2001)
- Cathy O'Neil, Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy (New York: Crown Books, 2016)
- Michael Kearns and Aaron Roth, The Ethical Algorithm: The Science of Socially Aware Algorithm Design (New York: Oxford University Press, 2019)
Hand et al. and O'Neil are (currently) available electronically through
the university library, but that's limited to one user at a time. You should
make sure you have access to copies of all three books.
The following books are recommended but not required.
- Richard A. Berk, Statistical Learning from a Regression Perspective (2nd edition, New York: Springer, 2016) [available online through the university library]
- Trevor Hastie, Robert Tibshirani and Jerome Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2nd edition, Berlin: Springer-Verlag, 2009) [free online]
- Jure Leskovec, Anand Rajaraman and Jeffrey D. Ullman,
Mining of Massive Datasets (second edition, Cambridge, England: Cambridge University Press, 2014) [free online]
- Paul Teetor, The R Cookbook (Sebastopol, California: O'Reilly, 2011) [R's help files are organized around "What does command X do?"; this is organized around "I want to do Y, what commands do I need to use?"]
Except for explicit group exercises,
everything you turn in for a grade must be your own work, or a clearly
acknowledged borrowing from an approved source; this includes all mathematical
derivations, computer code and output, figures, and text. Any use of permitted
sources must be clearly acknowledged in your work, with citations letting the
reader verify your source. You are free to consult the textbook and
recommended class texts, lecture slides and demos, any resources provided
through the class website, solutions provided to this semester's
previous assignments in this course, books and papers in the library, or
legitimate online resources, though again, all use of these sources must be
acknowledged in your work. (Websites which compile course materials
are not legitimate online resources.)
In general, you are free to discuss, in general terms, homework with other
students in the class, though not to share or compare work; such conversations
must be acknowledged in your assignments. You may not discuss the
content of assignments with anyone other than current students, the
instructors, or your teachers in other current classes at CMU, until after the
assignments are due. (Exceptions can be made, with prior permission, for
approved tutors.) You are free to complain, in general terms, about any aspect
of the course, to whomever you like.
Any use of solutions provided for any assignment in this course in previous
semesters is strictly prohibited. This prohibition applies even to students
who are re-taking the course. Do not copy the old solutions (in whole or in
part), do not "consult" them, do not read them, do not ask your friend who took
the course last year if they "happen to remember" or "can give you a hint".
Doing any of these things, or anything like these things, is cheating, it is
easily detected cheating, and those who thought they could get away with it in
the past have failed the course. Even more importantly: doing any of those
things means that the assignment doesn't give you a chance to
practice; it makes any feedback you get meaningless; and of course it makes any
evaluation based on that assignment unfair.
If you are unsure about what is or is not appropriate, please ask me before
submitting anything; there will be no penalty for asking. If you do violate
these policies but then think better of it, it is your responsibility to tell
me as soon as possible to discuss how your mis-deeds might be rectified.
Otherwise, violations of any sort will lead to formal disciplinary action,
under the terms of the university's
policy
on academic integrity.
You must complete "homework 0" on the content of the university's policy on
academic integrity, and on these course policies. This assignment will not
factor into your grade, but you must complete it, with a grade of at
least 80%, before you can get any credit for any other assignment.
Accommodations for Students with Disabilities
The Office of
Disability Resources provides support services for both physically disabled
and learning disabled students. For individualized academic adjustment based
on a documented disability, contact Disability Resources at access [at]
andrew.cmu.edu or 412-268-6121; they will coordinate with me.
Schedule, Lecture Notes and Readings
Lecture notes and readings will be linked to here, as they become available.
Lecture topics are tentative and subject to change. Don't expect required and
suggested readings to finalize until about a week before each lecture.
PDM = Principles of Data Mining; WMD = Weapons of Math Destriction; TEA = The Ethical Algorithm
TBD = to be determined = I'll figure it out between when whenever you're reading this and when the readings need to be made available.
Lecture 1 (Tuesday, 14 January): Introduction to the course
- What is data mining? how is it used? where did it come from? where does the
data come from? Some technical themes: prediction by averaging similar cases;
deciding what's a similar case; verification by predicting new data. R
Markdown. Course policy on cheating.
- Readings
- Key readings:
- Background readings:
- Homework:
- After class:
Lecture 2 (Thursday, 16 January):
Lightning review of linear regression
- Linear regression as a prediction method. Expected squared error; the
bias-variance decomposition; the conditional expectation function as the
optimal regression function. The optimal linear predictor: why it always
exists (regardless of whether anything is really linear, let alone
linear-and-Gaussian), and its general mathematical form. Ordinary least
squares as an estimator of the optimal linear predictor; OLS in matrix form.
The hat matrix and predictions. Model checking and diagnostics: when they
matter and when they don't. What linear models can and cannot do.
- Readings
- Key reading: PDM, sections 11.1--11.3
- Suggested readings:
- Berk, chapter 1
- Hastie et al., sections 2.3.1 and 2.6, and chapter 3 (especially through section 3.4)
- Background reading:
- Richard A. Berk, Regression Analysis: A Constructive Critique (Thousand Oaks, California: Sage, 2004)
- R. W. Farebrother, Fitting Linear Relationships: A History of the Calculus of Observations, 1750--1900 (New York: Springer-Verlag, 1999)
- CRS, The Truth About Linear Regression
- Norbert Wiener, Extrapolation, Interpolation and Smoothing of Stationary
Time-Series: with Engineering Application (Cambridge, Massachusetts: The Technology Press, 1949 [but first published as a classified technical report, Washington, D.C.: National Defense Research Council, 1942])
- After class:
- Slides (typo-corrected!)
- .Rmd source file for the slides
Sunday, 19 January at 10 pm
- Homework 0 due
- Online questions for homework 1 due
Lecture 3 (Tuesday, 21 January):
Nearest neighbors I: Mostly theoretical
- k-nearest neighbors for classification, regression, and general prediction.
Performance of nearest-neighbor prediction (=1NN) for noise-free function
learning; how and why the distance to the nearest neighbor goes to zero.
Effects of noise. Why 1NN converges to a risk of twice the minimum possible
risk. Bias-variance trade-off for k-nearest-neighbors.
- Readings
- Key reading: PDM, sections 6.3.3, 10.1--10.2 and 10.6
- Suggested reading: Hastie et al., chapter 13 (skimming section 13.2)
- Background readings:
- Thomas M. Cover, "Estimation by the Nearest Neighbor Rule", IEEE Transactions on Information Theory 14 (1968): 50--55 [PDF reprint via Prof. Cover]
- Thomas M. Cover, "Rates of Convergence for Nearest Neighbor Procedures", pp. 413--415 in B. K. Kinariwala and F. F. Kuo (eds.), Proceedings of the Hawaii International Conference on Systems Sciences (Honolulu: University of Hawaii Press, 1968) [PDF reprint via Prof. Cover]
- Thomas M. Cover and P. E. Hart, "Nearest Neighbor Pattern Classification", IEEE Transactions on Information Theory 13 (1967): 21--27 [PDF reprint via Prof. Cover]
After class: Notes (.Rmd), also covering the material for lecture 4.
Lecture 4 (Thursday, 23 January):
Nearest neighbors II: Mostly statistical
- The bias-variance trade-off for k-nearest-neighbors (reprise). The problem
of selecting the number of nearest neighbors. Selecting k by
cross-validation: v-fold versus leave-one-out. Some examples.
- Readings
- Key reading: PDM, sections 10.10--10.11
- Homework:
- After class: Notes (.Rmd), also covering the material for lecture 3.
Sunday, 26 January
Online questions for homework 2 due at 10 pm
Lecture 5 (Tuesday, 28 January):
Trees and ensembles I
- Classification and regression trees; the CART algorithm for building trees; "entropy" and "deviance"; some examples.
- Reading
- Key reading: PDM, sections 5.2 and 10.5
- Suggested readings:
- Berk, chapter 3
- Hastie et al., section 9.2
- After class: the chapter on trees from Advanced Data Analysis from an Elementary Point of View
Lecture 6 (Thursday, 30 January):
Trees and ensembles II
- Improving on single trees by combining multiple trees. "Bagging", or
"bootstrap averaging": combining multiple trees from slightly different data
sets. Random forests: bagging plus random feature selection. Two perspectives
on why multiple trees help: averaging reduces variance (but less so if what's
averaged is positively correlated); the performance of the average is the
average performance plus the diversity.
- Reading
- Suggested readings:
- Berk, chapters 4 and 5
- Hastie et al., section 8.7 (bagging) and chapter 15 (random forests)
- Background reading:
- Leo Breiman, "Bagging Predictors", Machine Learning 24 (1996): 123--140
- Leo Breiman, "Random Forests", Machine Learning 45 (2001): 5--32
- Hastie et al., sections 8.8 (stacking) and 9.5 (hierarchical mixtures of experts)
- Anders Krogh and Jesper Vedelsby, "Neural Network Ensembles, Cross Validation, and Active Learning", pp. 231--238 in Gerald Tesauro, David Tourtetsky and Todd Leen (eds.) Advances in Neural Information Processing Systems 7 [NIPS 1994] (Cambridge, Massachusetts: MIT Press, 1995)
- After class: Notes, including the algebra I short-circuited in class (.Rmd)
- Homework
Lecture 7 (Tuesday, 4 February):
Linear classifiers and logistic regression
- The "prototype method" for classification (= which class has the closer average vector?). Linear classifiers more broadly. The idea of a decision boundary. Logistic regression: a linear classifier plus an estimated class probability.
- Reading
- Key reading: PDM, sections 10.4 and 10.7
- Suggested reading: Hastie et al., chapter 4
- After class: Slides (.Rmd)
Lecture 8 (Thursday, 6 February):
Kernel methods, support vector machines, and random kitchen sinks
- Nonlinear classification by using a linear classifiers with
a nonlinear set of features. The "kernel trick" for avoiding explicitly
calculating features. Support vector machines. Random nonlinear features.
- Reading
- Key reading: PDM, sections 10.3
- Suggested readings:
- Berk, chapter 7
- Hastie et al., chapter 12
- Background reading:
- After lecture: Notes (.Rmd source)
- Homework
Sunday, 9 February
Online questions for homework 4 due at 10 pm No online reading questions this week.
Lecture 9 (Tuesday, 11 February):
Information measures I: Scene-setting
- Prediction needs statistical dependence; reminder of the definition
of statistical independence and how it connects to prediction. Mutual
information quantify dependence between random variables. Alternate take: entropy measures uncertainty or spread in a random variable; conditional entropy measures uncertainty after conditioning; mutual information is the reduction in entropy due to conditioning.
- Bonus material not covered in lecture: proofs of the "it can be shown that" claims; connection between entropy and coding;
measuring differences between distributions with divergence a.k.a. relative entropy; connections between classification and conditional entropy; connections between classification and divergence; maximum likelihood estimation is really minimum divergence estimation; estimating entropy and information from sample data.
- After class: Notes (.Rmd)
Lecture 10 (Thursday, 13 February):
Information measures II: The information-theoretic view of prediction, feature selection, and summarization
- Predictive information. Sufficient statistics. Statistical
relevance bases. The information-bottleneck method. Selecting features to maximize predictive information; to minimize
stored information. Creation of new features to capture information:
the information bottleneck again; word2vec.
- Readings
- Suggested reading:
- Naftali Tishby, Fernando C. Pereira and William Bialek, "The Information Bottleneck Method", pp. 368--377 in B. Hajek and R. S. Sreenivas (eds.), Proceedings of the 37th Annual Allerton Conference on
Communication, Control and Computing (Urbana, Illinois: University of Illinois Press, 1999), arxiv:physics/0004057
- After class: Notes (with extensive worked examples); R code for reproducing the notes (except for a few deliberately-omitted items);
nyt.frame.csv data frame
- Homework
Sunday, 16 February
Online questions for homework 5 due at 10 pm
NO CLASS on Tuesday, 18 February
Lecture 11 (Thursday, 20 February):
Prelude to dimension reduction
- Goals of dimension reduction. Linear algebra
review: vectors, matrices, inner products, projections. Eigen-decomposition
of matrices.
- Readings: None
- Homework
Sunday, 23 February
Online questions for homework 6 due at 10 pm
Lecture 12 (Tuesday, 25 February):
Linear dimension reduction
- "Principal components analysis" = finding the linear surface which comes closest to the original data points. Performing and interpreting PCA.
Example with historical complexity. "Latent semantic indexing" = PCA on
bag-of-words vectors. Multidimensional scaling and PCA. Random linear
projections as an alternative to PCA.
- Reading
- Key reading: PDM, chapter 3, especially sections 3.6 and 3.7
- Suggested readings:
- Background readings:
- After class: Slides, with amplifications / clarifications based on class discussion (.Rmd source file)
Lecture 13 (Thursday, 27 February):
Linear dimensionality reduction, once more with feeling
- Recapitulation of the math; worked example with real data.
- After class: Slides (.Rmd source file)
- Homework
--- The original plan was to do nonlinear dimension reduction, as follows:
- Problems with using linear methods when the data lie near a curved surface. The concept of a "manifold". Exploiting manifold structure: local linear embedding. Exploiting a pre-existing measure of similarity: diffusion maps.
- Reading
- Key readings:
- Background reading:
- Mikhail Belkin and Partha Niyogi, "Laplacian Eigenmaps for Dimensionality Reduction and Data Representation",
Neural Computation 15 (2003): 1373--1396 [PDF via Prof. Belkin]
- Ann B. Lee and Larry Wasserman, "Spectral Connectivity Analysis", Journal of the American Statistical Association 105 (2010): 1241--1255, arxiv:0811.0121
- Sam T. Roweis and Laurence K. Saul, "Nonlinear Dimensionality Reduction by Locally Linear Embedding", Science 290 (2000): 2323--2326
- Lawrence K. Saul and Sam T. Roweis, "Think Globally, Fit Locally: Supervised Learning of Low Dimensional Manifolds", Journal of Machine Learning Research 4 (2003): 119--155
Sunday, 1 March
Online questions for homework 7 due at 10 pm
Lecture 14 (Tuesday, 3 March):
Prelude to clustering
- "Clustering" is dividing data into categories, without knowing what the
categories are, or having an examples to follow. Why this is
not totally hopeless, and some rough ideas about what good clusters
might look like. Lightning review of probability distributions: cumulative
distribution function, probability density function, median, mode, mean;
goodness-of-fit testing for distributions (chi-squared, Kolmogorov-Smirnov);
entropy and relative entropy again.
- Reading
- Key reading: PDM, sections 6.4 and 9.2.1--9.2.3, 9.2.6
Lecture 15 (Thursday, 5 March):
Clustering I: Clusters without Probability
- The most basic algorithm for clustering: k-means. Using clustering to find
(approximate) nearest neighbors ("locality-sensitive hashing").
- Readings
- Key readings:
- PDM, sections 9.3--9.5 (skimming 9.1 and 9.2 is recommended)
- O'Neil, Chapters 9 and 10
- Suggested readings:
- Hastie et al., sections 14.3.1--14.3.6
- Background reading:
- Aristides Gionis, Piotr Indyk and Rajeev Motwani, "Similarity Search in High Dimensions via Hashing", pp. 518--529 in M. P. Atkinson et al. (eds.),
Proceedings of the 25th International Conference on Very Large Data Bases [VLDB '99] (San Francisco: Morgan Kaufmann, 1999)
- Jacob Kogan, Introduction to Clustering Large and High-Dimensional Data (Cambridge, UK: Cambridge University Press, 2006)
- After class: Notes
- Homework
- Homework 7 due at 10 pm
Homework 8 assigned
10 and 12 March: NO CLASS
Enjoy spring break.
Monday, 16 March
Online questions for homework 8 due at 10 pm (note day!)
Lecture 16 (Tuesday, 17 March): NO CLASS
The university has canceled all instruction for Monday and Tuesday.
Lecture 17 (Thursday, 19 March):
Clustering II
- Probabilistic clustering with mixture models; the EM algorithm; k-means
again. Using log-likelihood to pick the number of clusters. Checking
goodness of fit of cluster models. Advanced topics in probabilistic clustering: "mixed membership"
models, and their roles in modeling text ("topic models", "latent
Dirichlet allocation") and genetics.
- Reading
- Key reading: PDM, sections 9.2.4--9.2.5 (again), and section 9.6
- Suggested reading:
- Background readings:
- April Galyardt, "Interpreting Mixed Membership", in Edoardo M. Airoldi, David Blei, Elena A. Erosheva, Stephen E. Fienberg (eds.), Handbook of Mixed Membership Models and Their Applications (New York; Chapman and Hall, 2016) [PDF]
- David M. Blei, Andrew Y. Ng and Michael I. Jordan, "Latent Dirichlet Allocation", Journal of Machine Learning Research 3 (2003): 993--1022
- J. K. Pritchard, M. Stephens and P. Donnelly, "Inference of population structure using multilocus genotype data", Genetics 155 (2000): 945--959
- Homework
- Homework 8 ("graded survey") due at 10 pm
- Homework 9: Assignment
Lecture 18 (Tuesday, 24 March):
How big are the errors of our predictive models?
- Error measures or "loss functions": squared error for regression,
mis-labeling ("0/1 loss") for classification, log-likelihood for probabilities.
Why we should never evaluate predictive models in-sample. (Review of why
R-squared is useless and/or misleading.) Simple (too simple?) formulas for
estimating out-of-sample prediction error. The concept of effective
degrees of freedom. Other measures of model complexity.
- Reading
- Key reading: PDM, chapter 7
- Suggested readings:
- Berk, section 1.6 (skipping 1.6.7, "Basis Functions" and 1.6.8, "The Curse of Dimensionality")
- Hastie et al., chapter 7
- Background readings:
- Notes (.Rmd)
Lecture 19 (Thursday, 26 March):
What kinds of errors do our predictive models make?
- Regression diagnostics, focusing on prediction error: over- and under- prediction, non-constant variance ("heteroskedasticity"). Errors for classifiers: false positives versus false negatives; the trade-off between false positives and false negatives, and the "receiver operating characteristic" (ROC) curve. "Calibration" of probability models; "proper scoring rules".
- Reading
- Key reading: PDM, chapter 10
- Suggested readings:
- Hastie et al., section 9.2.5
- Background readings:
- Tilmann Gneiting, Fadoua Balabdaoui and Adrian E. Raftery,
"Probabilistic Forecasts, Calibration and Sharpness", Journal of the Royal Statistical Society B 69 (2007): 243--268
- Deborah G. Mayo and Aris Spanos, "Methodology in Practice: Statistical Misspecification Testing", Philosophy of Science 71 (2004): 1007--1025
- Homework
Saturday, 28 March
Homework 9 due, along with online reading questions (an extension)
Lecture 20 (Tuesday, 31 March):
Cross-validation
- Splitting the sample to estimate generalization error. Cross-validation:
"v-fold" (or "k-fold") and "leave-one-out". Short-cut formulas for LOOCV.
- Readings
- Key reading: PDM, chapter 10, especially sections 10.6 and 10.10
- Suggested reading:
- Hastie et al., section 7.10
- Background reading:
- Notes: Slides (.Rmd source file)
Lecture 21 (Thursday, 2 April):
Bootstrapping
- The concept of the bootstrap (again): imitating generalizing from a sample
to a whole population by re-sampling the data. Bootstrap estimates of
prediction error. Bootstrap confidence intervals. Bootstrap model
averaging (again).
- Reading
- Suggested reading: Hastie et al., sec. 7.11
- Homework
- Slides (.Rmd)
Lecture 22 (Tuesday, 7 April):
Recommender systems I: Broad Ideas and Technical Issues
- The idea of a recommender system. Best-seller
lists. Nearest-neighbor-based approaches (whether based on users or on items).
The Netflix problem, singular value decomposition, and PCA/factor-model style
approaches. Recommendations using social graphs. Missing-data issues.
- Readings
- Key readings:
- Suggested reading:
- Background readings:
- Gérard Biau, Benoît Cadre, Laurent Rouvière, "Statistical analysis of k-nearest neighbor collaborative recommendation", Annals of Statistics
38 (2010): 1568--1592, arxiv:1010.0499
- Maurizio Ferrari Dacrema, Paolo Cremonesi and Dietmar Jannach, "Are We Really Making Much Progress? A Worrying Analysis of Recent Neural Recommendation Approaches", in Proceedings of the 13th ACM Conference on Recommender Systems [RecSys 2019],
arxiv:1907.06902
- Benjamin Marlin, Richard S. Zemel, Sam Roweis, Malcolm Slaney, "Collaborative Filtering and the Missing at Random Assumption", arxiv:1206.5267
- Notes (.Rmd)
Lecture 23 (Thursday, 9 April):
Recommender systems II: What's Not to Love?
- Homogenization; paradoxes of diversification; rabbit holes and echo chambers; discrimination; what does the system optimize for; when inferior recommendations are more profitable; natural monopoly (or oligopoly). Also: do recommendations do anything at all?
- Readings
- Suggested reading (in this order):
- "Amazon.com Recommendations Understand Area Woman Better Than Husband", The Onion 9 January 2007
- Tom Slee, "Online Monoculture and the End of the Niche", Whimsley 15 March 2009
- Tommy Wallach, "In Which Netflix Achieves Sentience As A Result of My Terrible Decision-making", The Toast, 4 August 2014
- Zeynep Tufekci, "YouTube's Recommendation Algorithm Has a Dark Side",
Scientific American 320:4 (April 2019): 77
- Zeynep Tufekci, "YouTube, the Great Radicalizer", New York Times, 10 March 2018
- Notes (.Rmd)
- Homework
- Homework 11 due at 10 pm
- Homework 12: CANCELLED
Lecture 24 (Tuesday, 14 April):
Information retrieval
-
Finding data by content. Approaches we will neglect: metadata; "coding" or
"content analysis". Textual features. Vector representations of documents:
the bag-of-words representation; word2vec. Measures of similarity and
the importance of normalization; inverse document frequency and other tweaks to
the representation. Example with
the New
York Times Annotated Corpus. The trick to searching: queries are documents and can be represented in the
same feature space. Search evaluation: precision, recall, precision-recall
curves; error rates. Incorporating user feedback. Proxies for relevance like
"engagement". Abstraction and generalizing to other
types of data.
- Readings
- Key reading: PDM, chapter 14
- Background reading:
- Marco Baroni, Georgiana Dinu and German Kruszewski, "Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors", pp. 238--247 in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics [ACL14] (Baltimore, MD: Association for Computational Linguistics, 2014)
- Yoav Goldberg and Omer Levy, "word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method", arxiv:1402.3722
- Kian Kenyon-Dean, "Word Embedding Algorithms as Generalized Low Rank Models and their Canonical Form", arxiv:1911.02639
- Lingpeng Kong, Cyprien de Masson d'Autume, Wang Ling, Lei Yu, Zihang Dai, Dani Yogatama, "A Mutual Information Maximization Perspective of Language Representation Learning", arxiv:1910.08350
- Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean, "Efficient Estimation of Word Representations in Vector Space", arxiv:1301.3781
- Notes (.Rmd)
Thursday, 16 April: OPTIONAL lecture on epidemic modeling
Originally, there wasn't supposed to be any class this day because of
Carnival. The university now says it's a day for instruction, but I promised
you wouldn't have to do anything, so we'll cover something that is,
strictly speaking, outside the scope of the course, but statistical
and topical, and make it completely optional.
- The basic "susceptible-infectious-removed" (SIR) epidemic model. The
probability model and its deterministic limit. The idea of the "basic
reproductive number" R0 and how it relates to the rates of transmission and
removal. Why diseases do not necessarily evolve to be less lethal to their
hosts. The epidemic threshold when R0=1.
Complications: gaps between being infected and becoming infectious; the
possibility of being infectious without showing symptoms; re-infection.
Epidemics in social networks, and how network structure affects the epidemic
threshold; why high-degree people tend to be among the first infected, and
disease-control strategies based on "destroying the hubs". Statistical issues
in connecting epidemic models to data.
- Reading:
- Slides (.Rmd), revised after the lecture to fix typos, amplify points, incorporate things that came up in discussion, etc.
- Homework
Lecture 25 (Tuesday, 21 April):
Fairness in prediction, especially classification
- Notions of "fair" prediction: not using "protected" features; parity of
error rates across groups; calibration of prediction. "Inference" issues:
using features that carry information about protected features. Trade-offs
between different forms of fairness. Trade-offs between forms of fairness and
accuracy. Some critiques of these notions of "fairness".
- Readings
- Key readings:
- WMD, chapters 1, 5, 6 and 8
- TEA, chapter 2
- Suggested readings:
- Julia Angwin, Jeff Larson, Surya Mattu and Lauren Kirchner, "Machine Bias", ProPublica 23 May 2016
- Alexandra Chouldechova, "Fair prediction with disparate impact: A study of bias in recidivism prediction instruments", arxiv:1610.07524
- Sam Corbett-Davies and Sharad Goel, "The Measure and
Mismeasure of Fairness: A Critical Review of Fair Machine
Learning", arxiv:1808.00023
- Simon DeDeo, "Wrong side of the tracks: Big Data and Protected Categories", pp. 31--42 in Cassidy R. Sugimoto, Hamid R. Ekbia and Michael Mattioli (eds.), Big Data Is Not a Monolith (Cambridge, Mass.: MIT Press, 2016), arxiv:1412.4643
- Sharad Goel, Jake M. Hofman, M. Irmak Sirer, "Who Does What on the Web: A Large-Scale Study of Browsing Behavior", Sixth International AAI Conference on Weblogs and Social Media [ICWSM 2012] [Preprint via Prof. Goel]
- Background reading:
- Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, Aram Galstyan, "A Survey on Bias and Fairness in Machine Learning", arxiv:1908.09635
- Notes (.Rmd)
Lecture 26 (Thursday, 23 April):
Waste, fraud and abuse I: when is our data-mining project going to be a technical failure?
- Weak or non-existent predictive
relationships. Changing predictive relationship. Changing distributions. The curse of dimensionality.
- Readings
- Key readings:
- PDM, chapter 2, section 6.5, section 1.7
- Suggested readings:
- Chris Anderson, "The End of Theory: The Data Deluge Makes the Scientific Method Obsolete", Wired 16:17 (June 2008)
- Berk, chapter 9
- Jesse Frederik and Maurits Martijn, "The new dot com bubble is here: it’s called online advertising", The Correspondent 2 November 2019 [While a popular article, this is well-sourced in serious research, which it links to]
- Andrew Gelman and David Weakliem, "Of Beauty, Sex and Power",
American Scientist 97 (2009): 310
- Fernando Pereira, comments on Anderson, Earning My Turns, 23 June 2008
- Christian Grothoff and J.M. Porup, "The NSA’s SKYNET program may be killing thousands of innocent people", Ars Technica 16 February 2016
- Notes (.Rmd)
Saturday, 25 April
Homework
Lecture 27 (Tuesday, 28 April):
Waste, Fraud, and Abuse II: More sources of technical failure
- Bad data: errors in data vs. bad measurements, measuring the wrong thing.
Not accounting for model search and/or model selection; sample splitting as a
cure for post-selection inference. The "garden of forking paths" and "researcher degrees of freedom". Prediction vs. causality; taking credit
for what would have happened anyway.
- Readings
- Key reading: TEA, chapter 4
- Suggested readings:
- Background readings
- Jorge Luis Borges, "The Garden of Forking Paths" (as "El jardín de senderos que se bifurcan", 1941; translated by Anthony Kerrigan in Ficciones [New York: Grove Press, 1962] and by Anthony Hurley in Collected Fictions [New York: Viking, 1998])
- Howard Becker, Evidence (Chicago: University of Chicago Press, 2017)
- Luke Oakden-Rayner, "Exploring large scale public medical image datasets", arxiv:1907.12720
- Amit Sharma, Jake M. Hofman and Duncan J. Watts, "Estimating the Causal Impact of Recommendation Systems from Observational Data", pp. 453--470 in Proceedings of the Sixteenth ACM Conference on Economics and Computation [EC '15], arxiv:1510.05569
Lecture 28 (Thursday, 30 April):
Waste, fraud and abuse III: what are we doing to ourselves? should we be doing this?
- Privacy and surveillance. Rights-violating applications of data mining.
Legal (probably) but still destructive applications. Is it any better if it doesn't work (well)? Some ethical check-list questions: who benefits? who is harmed? how much? how surely? are you OK with being responsible for that?
- Readings
- Key readings:
- WMD, Conclusion
- TEA, chapters 1 and 5
- Suggested readings:
- Maciej Ceglowski, "What Happens Next Will Amaze You"
- Henry Farrell, "Seeing Like a Finite-State Machine", Crooked Timber 25 November 2019
- Kashmir Hill, "The Secretive Company That Might End Privacy As We Know It", New York Times 19 January 2020 [published in the print edition as "A Facial Recognition App That Puts Privacy in Peril", p. 1]
See also:
- Betsy Swan, "Facial-Recognition Company That Works With Law Enforcement Says Entire Client List Was Stolen", Daily Best 26 February 2020
- Stuart A. Thompson and Charlie Warzel, "Twelve Million Phones, One Dataset, Zero Privacy", New York Times 19 December 2019
- Zeynep Tufekci, "We're building a dystopia just to make people click on ads", TED
- Background readings:
- Francis Spufford, Red Plenty (London: Faber and Faber, 2010) [About the first time brilliant and well-intentioned people tried to use data, computers and optimization to make the world a better place]
- Norbert Wiener, The Human Use of Human Beings: Cybernetics and Society (Boston: Houghton Mifflin, 1950)
Friday, 1 May
Homework 14 due at 10 pm