Statistics 36-462/662: Data Mining

Spring 2020

Prof. Cosma Shalizi
Tuesdays and Thursdays 1:30--2:50 Porter Hall 100

Actual mining
Data mining is the art of extracting useful patterns from large bodies of data. (Metaphorically: finding seams of actionable knowledge in the raw ore of information.) This involves both methods for discovering possible patterns, and methods for checking or validating candidate patterns. The on-going growth in our ability to use computers to store and analyze data opens opportunities for data mining in business, science, government and elsewhere. The aim of this course is to help you take advantage of these opportunities in a responsible way.

After taking the class, when you're faced with a new data-mining problem, you should be able to (1) select appropriate methods of pattern discovery and validation, and justify your choices, (2) use and program statistical software to implement those methods, (3) critically evaluate the results and communicate them to professional colleagues who are not also statisticians or data-analysts, and (4) be aware of the most common ethical issues involved.

Data mining is related to statistics and to machine learning, but has its own aims and scope. Statistics deal with systems for creating reliable inferences from imperfect data. Machine learning develops technologies of automated prediction and induction. We will use tools from statistics and from machine learning, but we will use them as tools, rather than as things to study in their own right. We will do a lot of calculations, but will not prove many theorems, and we will do many more experiments than calculations.

Prerequisites

For this semester, the prerequisite is passing 36-401 (modern regression) with a grade of C or better. The only exceptions are for graduate students taking the graduate section (36-662; see below).

Graduate students and 36-662

This course is primarily intended as an advanced elective for undergraduate students majoring in statistics. A limited number of graduate students are welcome to take it as 36-662. (Graduate students who register for 36-462 will be dropped from the roster.) Formally, 662 has no prerequisites, but students are expected to have a background at least equivalent to 36-401, and the prerequisites to 401: linear regression modeling, probability, mathematical statistics, linear algebra, calculus, and competence in using R for data analysis. This class does not have the resources to help those without this background catch up.

Undergraduates must take the course as 36-462; those who register for 662 will be dropped from the roster.

Instructors

Professor Dr. Cosma Shalizi cshalizi [at] cmu.edu
Baker Hall 229C
Teaching assistant Ms. Robin Dunn not to be bothered with e-mails not be bothered in offices either
Mr. Tudor Manole
Mr. Aleksandr Podkopaev
Mr. Ian Waudby-Smith

Topics and Techniques to be Covered

These may change during the semester.

Methods of pattern discovery (a.k.a. "heuristics")

"Supervised" prediction methods for classification and regression
Regression=predicting a numerical variable; classification=predicting a categorical variable
General concepts in evaluating prediction models: the bias-variance decomposition of total error, the bias-variance trade-off. Specific error measures for classification: misclassification rate, false positive and false negative errors; ROC curves.
Nearest neighbor methods: basic ideas (averaging nearby-cases), trade-offs in the number of neighbors, mathematical and statistical properties of nearest neighbor prediction, computational considerations
Tree-based methods: basic ideas (averaging similar cases), trade-offs in the size o the tree, mathematical and statistical properties
Methods for combining multiple models (especially trees): bagging, random forests, boosting, stacking
Kernel-based methods: kernels as similarity measures; averaging similar cases; selection of kernels; support vector machines; mathematical and computational issues.
"Unsupervised", not-always-predictive methods
Dimension reduction: forming new features by combining old ones; linear dimension reduction by principal components; linear dimension reduction by factor models; linear dimension reduction by random projections; nonlinear dimension reduction by local linear embedding and by diffusion maps.
Clustering: assigning data points to categories without knowing the right categories; geometric clustering and the k-means algorithm; probabilistic mixture models.
Information-theoretic foundations
Basic concepts of information theory (entropy, information); the informational view of prediction, dimension reduction and clustering; the information bottleneck; feature selection.

Methods of pattern validation

Evaluation of predictive performance
In-sample vs. out-of-sample and generalization performance; the concept of over-fitting; analytical approximations to generalization performance; effective degrees of freedom and other measures of model complexity.
Cross-validation
Evaluation of uncertainty
Bootstrap
Inference after model selection / data mining
Issues with multiple testing and with post-selection inference; sample splitting.

Case studies and concerns

Recommendation engines
"You may also like"; recommender systems based on nearest neighbors; systems based on factor models; hybrid systems; consequences of recommendation engines.
Information retrieval
Searching by content (text, images, genes...); choosing representations of content; the bag-of-words representation of text; evaluating search; dimension reduction and clustering for text; word2vec representation of text; incorporating user feedback.
Fairness in prediction
Case studies: lending decisions, criminal justice. The COMPAS/ProPublica controversy. Definitions of fairness; trade-offs between different notions of fairness, and between fairness and predictive accuracy.
Sources of technical failure in data mining
Data quality. Measurement vs. reality. Weak predictive relationships. Non-existent predictive relationships. The curse of dimensionality. Data-set shift.
Should we do it / what are we doing?
Privacy (or lack thereof). Human rights concerns. Social consequences, e.g., are we building a dystopia to make each other click on ads?

Course Mechanics and Grading

There are three reasons you will get assignments in this course. In order of decreasing importance:
  1. Practice. Practice is essential to developing the skills you are learning in this class. It also actually helps you learn, because some things which seem murky clarify when you actually do them, and sometimes trying to do something shows that you only thought you understood it.
  2. Feedback. By seeing what you can and cannot do, and what comes easily and what you struggle with, I can help you learn better, by giving advice and, if need be, adjusting the course.
  3. Evaluation. The university is, in the end, going to stake its reputation (and that of its faculty) on assuring the world that you have mastered the skills and learned the material that goes with your degree. Before doing that, it requires an assessment of how well you have, in fact, mastered the material and skills being taught in this course.

To serve these goals, there will be three kinds of assignment in this course.

Homework
Every week will have a homework assignment, divided into a series of questions or problems. These will have a common theme, and will usually build on each other, but different problems may involve doing or applying some theory, analyzing real data sets on the computer, and communicating the results.
All homework will be submitted electronically through Canvas and Gradescope. Most weeks, homework will be due at 10:00 pm on Thursdays. Homework assignments will always be released by Friday of the previous week, and sometimes before.
There are specific formatting requirements for homework --- see below.
Online assignments
Most homework assignments will be accompanied by short-answer or multiple choice questions, to be done on Canvas, before the homework itself is due. These will typically be due on Sunday nights; consult Canvas for the exact schedule.
In-class exercises
Most lectures will have in-class exercises. These will be short (10--20 minutes) assignments, emphasizing problem solving, done in class in small groups of no more than four people. The assignments will be given out in class, and must be handed in on paper during the class period. On most days, a randomly-selected group will be asked to present their solution to the class.
The in-class exercises will either be problems from that week's homework, or close enough that seeing how to do the exercise should tell you how to do some of the problems.
There will be no exams.

Time expectations

You should expect to spend 5--7 hours on assignments every week, averaging over the semester. (This follows from the university's rules about how course credits translate into hours of student time.) If you find yourself spending significantly more time than that on the class, please come to talk to me.

Grading

Grades will be broken down as follows:

The point of dropping the low grades is to let you schedule interviews, social engagements, and other obligations, with flexibility, and without my having to decide what is or is not important enough to change the rules. If something --- illness, family crisis, or anything else --- is interfering with your ability to do the work of the class, come talk to me.

Grade boundaries will be as follows:
A [90, 100]
B [80, 90)
C [70, 80)
D [60, 70)
R < 60

To be fair to everyone, these boundaries will be held to strictly.

Grade changes and regrading: If you think that particular assignment was wrongly graded, tell me as soon as possible. Direct any questions or complaints about your grades to me; the teaching assistants have no authority to make changes. (This also goes for your final letter grade.) Complaints that the thresholds for letter grades are unfair, that you deserve a higher grade, etc., will accomplish much less than pointing to concrete problems in the grading of specific assignments.

As a final word of advice about grading, "what is the least amount of work I need to do in order to get the grade I want?" is a much worse way to approach higher education than "how can I learn the most from this class and from my teachers?".

Lectures

Lectures will be used to amplify the readings, provide examples and demos, and answer questions and generally discuss the material. They are also when you will do the graded in-class assignments which will help consolidate your understanding of the material, and help with your homework.
Readings

There are (up to) three classes of readings associated with each lecture.

  1. Key readings: Usually from required textbooks, occasionally from other sources which will be made available well in advance. Do this reading before coming to lecture, and lecture will make a lot more sense. (Also, the online exercises will be easy.)
  2. Suggested readings: Usually easier-to-read academic papers, or portions of the non-required texts, which will deepen your understand of what we're doing in lecture and in the homework, but aren't critical.
  3. Background readings: Usually harder-to-follow or more historical papers, which give further details about methods and applications.
TL;DR: definitely do the key readings before class; try to do the suggested readings too, they're easy; and the background readings are if you want to get into the weeds.
Electronics

Please don't use any electronic devices during lecture: no laptops, no tablets, no phones, no recording devices, no watches that do more than tell time. The only exception to this rule is for electronic assistive devices, properly documented with CMU's Office of Disability Resources.

(The no-electronics rule is not arbitrary meanness on my part. Experiments show, pretty clearly, that students learn more in electronics-free classrooms, not least because your device isn't distracting your neighbors, who aren't as good at multitasking as you are.)

Missing class and emergencies
You can miss up to three class sessions during the semester without (grade) penalty. In cases of routine illness (for example, you have a bad cold and a day of sleep and fluids would help), stay home, take a free day, and let yourself rest and recover, rather than dragging yourself to class. The grading policy gives you space for this.

In cases of extended illness or emergency, we can deal with make-ups for missed work after the crisis. If extended illness or emergency happens (including mental health crises), or a personal or family crisis, please don't wait to get help. Just let me know that you're dealing with an emergency, and that you are already getting help.

R, R Markdown, and Reproducibility

R is a free, open-source software package/programming language for statistical computing. You should have begun to learn it in 36-401 (if not before). In this class, you will use it for every homework. If you are not able to use R, or do not have ready, reliable access to a computer on which you can do so, let me know at once.

Communicating your results to others is as important as getting good results in the first place. Every homework assignment will require you to write about that week's data analysis and what you learned from it; this writing is part of the assignment and will be graded. Raw computer output and R code is not acceptable; your document must be humanly readable.

All homework and exam assignments are to be written up in R Markdown. (If you know what knitr is and would rather use it, ask first.) R Markdown is a system that lets you embed R code, and its output, into a single document. This helps ensure that your work is reproducible, meaning that other people can re-do your analysis and get the same results. It also helps ensure that what you report in your text and figures really is the result of your code (and not some brilliant trick that you then forget, or obscure bug that you didn't notice). For help on using R Markdown, see "Using R Markdown for Class Reports". (Part of the first lecture will be going over the basics of R Markdown.)

Format Requirements for Homework

For each assignment, you should generally submit two, and only two, files:

  1. an R Markdown source file, integrating text, generated figures and R code, uploaded to Canvas;
  2. the "knitted," humanly-readable document, in PDF format, uploaded to Gradescope. (Using PDF is important for the grading software.)
(Using two systems is inconvenient for everyone; Gradescope is much better for efficiently sharing work across multiple graders, but will only work with PDFs.) I will be re-running the R Markdown file of randomly selected students; you should expect to be picked for this about once in the semester. You will lose points if your R Markdown file does not, in fact, generate your knitted file (making obvious allowances for random numbers, etc.).

Some problems in the homework will require you to do math. R Markdown provides a simple but powerful system for type-setting math. (It's based on the LaTeX document-preparation system widely used in the sciences.) If you can't get it to work, you can hand-write the math and include scans or photos of your writing in the appropriate places in your R Markdown document. (You should upload these additional files to Canvas.) You will, however, lose points for doing so, starting with no penalty for homework 1, and growing to a 90% penalty (for those problems) by the last homework.

Canvas, Gradescope and Piazza

Homework will be submitted electronically through Canvas and Gradescope; Canvas will also be used as the gradebook. Some readings and course materials will also be distributed through Canvas.

We will be using the Piazza website for question-answering. You will receive an invitation within the first week of class. Anonymous-to-other-students posting of questions and replies will be allowed, at least initially. (Postings will not be anonymous to instructors.) Anonymity will go away for everyone if it is abused. As noted above, homework will generally be due at 10 pm on Wednesdays; however, you should not expect the instructors to answer questions on Piazza (or by e-mail) after 6 pm.

Solutions

Solutions for all homework will be available, after their due date, through Canvas. Please don't share them with anyone, even after the course has ended.

Office Hours

Mondays 12:30--1:30 Mr. Manole Wean Hall 3509
Tuesdays 10:30--11:30 Mr. Waudby-Smith Wean Hall 3715
Wednesdays 12:30--1:30 Prof. Shalizi Baker Hall 229C
Thursdays 3:00--4:00 Mr. Podkopaev Porter Hall A19A

If you want help with computing, please bring a laptop.

If you cannot make the regular office hours, or have concerns you'd rather discuss privately, please e-mail me about making an appointment.

Textbook

The following three books are required:
  1. D. J. Hand, Heikki Mannila and Padhraic Smyth, Principles of Data Mining (Cambridge, Massachusetts: MIT Press, 2001)
  2. Cathy O'Neil, Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy (New York: Crown Books, 2016)
  3. Michael Kearns and Aaron Roth, The Ethical Algorithm: The Science of Socially Aware Algorithm Design (New York: Oxford University Press, 2019)
Hand et al. and O'Neil are (currently) available electronically through the university library, but that's limited to one user at a time. You should make sure you have access to copies of all three books.

The following books are recommended but not required.

Collaboration, Cheating and Plagiarism

Except for explicit group exercises, everything you turn in for a grade must be your own work, or a clearly acknowledged borrowing from an approved source; this includes all mathematical derivations, computer code and output, figures, and text. Any use of permitted sources must be clearly acknowledged in your work, with citations letting the reader verify your source. You are free to consult the textbook and recommended class texts, lecture slides and demos, any resources provided through the class website, solutions provided to this semester's previous assignments in this course, books and papers in the library, or legitimate online resources, though again, all use of these sources must be acknowledged in your work. (Websites which compile course materials are not legitimate online resources.)

In general, you are free to discuss, in general terms, homework with other students in the class, though not to share or compare work; such conversations must be acknowledged in your assignments. You may not discuss the content of assignments with anyone other than current students, the instructors, or your teachers in other current classes at CMU, until after the assignments are due. (Exceptions can be made, with prior permission, for approved tutors.) You are free to complain, in general terms, about any aspect of the course, to whomever you like.

Any use of solutions provided for any assignment in this course in previous semesters is strictly prohibited. This prohibition applies even to students who are re-taking the course. Do not copy the old solutions (in whole or in part), do not "consult" them, do not read them, do not ask your friend who took the course last year if they "happen to remember" or "can give you a hint". Doing any of these things, or anything like these things, is cheating, it is easily detected cheating, and those who thought they could get away with it in the past have failed the course. Even more importantly: doing any of those things means that the assignment doesn't give you a chance to practice; it makes any feedback you get meaningless; and of course it makes any evaluation based on that assignment unfair.

If you are unsure about what is or is not appropriate, please ask me before submitting anything; there will be no penalty for asking. If you do violate these policies but then think better of it, it is your responsibility to tell me as soon as possible to discuss how your mis-deeds might be rectified. Otherwise, violations of any sort will lead to formal disciplinary action, under the terms of the university's policy on academic integrity.

You must complete "homework 0" on the content of the university's policy on academic integrity, and on these course policies. This assignment will not factor into your grade, but you must complete it, with a grade of at least 80%, before you can get any credit for any other assignment.

Accommodations for Students with Disabilities

The Office of Disability Resources provides support services for both physically disabled and learning disabled students. For individualized academic adjustment based on a documented disability, contact Disability Resources at access [at] andrew.cmu.edu or 412-268-6121; they will coordinate with me.

Schedule, Lecture Notes and Readings

Lecture notes and readings will be linked to here, as they become available. Lecture topics are tentative and subject to change. Don't expect required and suggested readings to finalize until about a week before each lecture.

PDM = Principles of Data Mining; WMD = Weapons of Math Destriction; TEA = The Ethical Algorithm
TBD = to be determined = I'll figure it out between when whenever you're reading this and when the readings need to be made available.

Lecture 1 (Tuesday, 14 January): Introduction to the course

Lecture 2 (Thursday, 16 January): Lightning review of linear regression

Sunday, 19 January at 10 pm

Lecture 3 (Tuesday, 21 January): Nearest neighbors I: Mostly theoretical

  • After class: Notes (.Rmd), also covering the material for lecture 4.

    Lecture 4 (Thursday, 23 January): Nearest neighbors II: Mostly statistical

    Sunday, 26 January

    Online questions for homework 2 due at 10 pm

    Lecture 5 (Tuesday, 28 January): Trees and ensembles I

    Lecture 6 (Thursday, 30 January): Trees and ensembles II

    Lecture 7 (Tuesday, 4 February): Linear classifiers and logistic regression

    Lecture 8 (Thursday, 6 February): Kernel methods, support vector machines, and random kitchen sinks

    Sunday, 9 February

    Online questions for homework 4 due at 10 pm No online reading questions this week.

    Lecture 9 (Tuesday, 11 February): Information measures I: Scene-setting

    Lecture 10 (Thursday, 13 February): Information measures II: The information-theoretic view of prediction, feature selection, and summarization

    Sunday, 16 February

    Online questions for homework 5 due at 10 pm

    NO CLASS on Tuesday, 18 February

    Lecture 11 (Thursday, 20 February): Prelude to dimension reduction

    Sunday, 23 February

    Online questions for homework 6 due at 10 pm

    Lecture 12 (Tuesday, 25 February): Linear dimension reduction

    Lecture 13 (Thursday, 27 February): Linear dimensionality reduction, once more with feeling

    --- The original plan was to do nonlinear dimension reduction, as follows:

    Sunday, 1 March

    Online questions for homework 7 due at 10 pm

    Lecture 14 (Tuesday, 3 March): Prelude to clustering

    Lecture 15 (Thursday, 5 March): Clustering I: Clusters without Probability

    10 and 12 March: NO CLASS

    Enjoy spring break.

    Monday, 16 March

    Online questions for homework 8 due at 10 pm (note day!)

    Lecture 16 (Tuesday, 17 March): NO CLASS

    The university has canceled all instruction for Monday and Tuesday.

    Lecture 17 (Thursday, 19 March): Clustering II

    Lecture 18 (Tuesday, 24 March): How big are the errors of our predictive models?

    Lecture 19 (Thursday, 26 March): What kinds of errors do our predictive models make?

    Saturday, 28 March

    Homework 9 due, along with online reading questions (an extension)

    Lecture 20 (Tuesday, 31 March): Cross-validation

    Lecture 21 (Thursday, 2 April): Bootstrapping

    Lecture 22 (Tuesday, 7 April): Recommender systems I: Broad Ideas and Technical Issues

    Lecture 23 (Thursday, 9 April): Recommender systems II: What's Not to Love?

    Lecture 24 (Tuesday, 14 April): Information retrieval

    Thursday, 16 April: OPTIONAL lecture on epidemic modeling

    Originally, there wasn't supposed to be any class this day because of Carnival. The university now says it's a day for instruction, but I promised you wouldn't have to do anything, so we'll cover something that is, strictly speaking, outside the scope of the course, but statistical and topical, and make it completely optional.

    Lecture 25 (Tuesday, 21 April): Fairness in prediction, especially classification

    Lecture 26 (Thursday, 23 April): Waste, fraud and abuse I: when is our data-mining project going to be a technical failure?

    Saturday, 25 April

  • Homework

    Lecture 27 (Tuesday, 28 April): Waste, Fraud, and Abuse II: More sources of technical failure

    Lecture 28 (Thursday, 30 April): Waste, fraud and abuse III: what are we doing to ourselves? should we be doing this?

    Friday, 1 May

    Homework 14 due at 10 pm