Statistics 36-462/662: Data Mining

Spring 2020

Prof. Cosma Shalizi
Tuesdays and Thursdays 1:30--2:50 Porter Hall 100

Data mining is the art of extracting useful patterns from large bodies of data. (Metaphorically: finding seams of actionable knowledge in the raw ore of information.) This involves both methods for discovering possible patterns, and methods for checking or validating candidate patterns. The on-going growth in our ability to use computers to store and analyze data opens opportunities for data mining in business, science, government and elsewhere. The aim of this course is to help you take advantage of these opportunities in a responsible way.

After taking the class, when you're faced with a new data-mining problem, you should be able to (1) select appropriate methods of pattern discovery and validation, and justify your choices, (2) use and program statistical software to implement those methods, (3) critically evaluate the results and communicate them to professional colleagues who are not also statisticians or data-analysts, and (4) be aware of the most common ethical issues involved.

Data mining is related to statistics and to machine learning, but has its own aims and scope. Statistics deal with systems for creating reliable inferences from imperfect data. Machine learning develops technologies of automated prediction and induction. We will use tools from statistics and from machine learning, but we will use them as tools, rather than as things to study in their own right. We will do a lot of calculations, but will not prove many theorems, and we will do many more experiments than calculations.

Prerequisites

For this semester, the prerequisite is passing 36-401 (modern regression) with a grade of C or better. The only exceptions are for graduate students taking the graduate section (36-662; see below).

Graduate students and 36-662

This course is primarily intended as an advanced elective for undergraduate students majoring in statistics. A limited number of graduate students are welcome to take it as 36-662. (Graduate students who register for 36-462 will be dropped from the roster.) Formally, 662 has no prerequisites, but students are expected to have a background at least equivalent to 36-401, and the prerequisites to 401: linear regression modeling, probability, mathematical statistics, linear algebra, calculus, and competence in using R for data analysis. This class does not have the resources to help those without this background catch up.

Undergraduates must take the course as 36-462; those who register for 662 will be dropped from the roster.

Instructors

Professor	Dr. Cosma Shalizi	cshalizi [at] cmu.edu
		Baker Hall 229C
Teaching assistant	Ms. Robin Dunn	not to be bothered with e-mails	not be bothered in offices either
	Mr. Tudor Manole
	Mr. Aleksandr Podkopaev
	Mr. Ian Waudby-Smith

Topics and Techniques to be Covered

These may change during the semester.

Methods of pattern discovery (a.k.a. "heuristics")

"Supervised" prediction methods for classification and regression: Regression=predicting a numerical variable; classification=predicting a categorical variable; General concepts in evaluating prediction models: the bias-variance decomposition of total error, the bias-variance trade-off. Specific error measures for classification: misclassification rate, false positive and false negative errors; ROC curves.; Nearest neighbor methods: basic ideas (averaging nearby-cases), trade-offs in the number of neighbors, mathematical and statistical properties of nearest neighbor prediction, computational considerations; Tree-based methods: basic ideas (averaging similar cases), trade-offs in the size o the tree, mathematical and statistical properties; Methods for combining multiple models (especially trees): bagging, random forests, boosting, stacking; Kernel-based methods: kernels as similarity measures; averaging similar cases; selection of kernels; support vector machines; mathematical and computational issues.
"Unsupervised", not-always-predictive methods: Dimension reduction: forming new features by combining old ones; linear dimension reduction by principal components; linear dimension reduction by factor models; linear dimension reduction by random projections; nonlinear dimension reduction by local linear embedding and by diffusion maps.; Clustering: assigning data points to categories without knowing the right categories; geometric clustering and the k-means algorithm; probabilistic mixture models.
Information-theoretic foundations: Basic concepts of information theory (entropy, information); the informational view of prediction, dimension reduction and clustering; the information bottleneck; feature selection.

Methods of pattern validation

Evaluation of predictive performance: In-sample vs. out-of-sample and generalization performance; the concept of over-fitting; analytical approximations to generalization performance; effective degrees of freedom and other measures of model complexity.; Cross-validation
Evaluation of uncertainty: Bootstrap
Inference after model selection / data mining: Issues with multiple testing and with post-selection inference; sample splitting.

Case studies and concerns

Recommendation engines: "You may also like"; recommender systems based on nearest neighbors; systems based on factor models; hybrid systems; consequences of recommendation engines.
Information retrieval: Searching by content (text, images, genes...); choosing representations of content; the bag-of-words representation of text; evaluating search; dimension reduction and clustering for text; word2vec representation of text; incorporating user feedback.
Fairness in prediction: Case studies: lending decisions, criminal justice. The COMPAS/ProPublica controversy. Definitions of fairness; trade-offs between different notions of fairness, and between fairness and predictive accuracy.
Sources of technical failure in data mining: Data quality. Measurement vs. reality. Weak predictive relationships. Non-existent predictive relationships. The curse of dimensionality. Data-set shift.
Should we do it / what are we doing?: Privacy (or lack thereof). Human rights concerns. Social consequences, e.g., are we building a dystopia to make each other click on ads?

Course Mechanics and Grading

There are three reasons you will get assignments in this course. In order of decreasing importance:

Practice. Practice is essential to developing the skills you are learning in this class. It also actually helps you learn, because some things which seem murky clarify when you actually do them, and sometimes trying to do something shows that you only thought you understood it.
Feedback. By seeing what you can and cannot do, and what comes easily and what you struggle with, I can help you learn better, by giving advice and, if need be, adjusting the course.
Evaluation. The university is, in the end, going to stake its reputation (and that of its faculty) on assuring the world that you have mastered the skills and learned the material that goes with your degree. Before doing that, it requires an assessment of how well you have, in fact, mastered the material and skills being taught in this course.

To serve these goals, there will be three kinds of assignment in this course.

Homework: Every week will have a homework assignment, divided into a series of questions or problems. These will have a common theme, and will usually build on each other, but different problems may involve doing or applying some theory, analyzing real data sets on the computer, and communicating the results.; All homework will be submitted electronically through Canvas and Gradescope. Most weeks, homework will be due at 10:00 pm on Thursdays. Homework assignments will always be released by Friday of the previous week, and sometimes before.; There are specific formatting requirements for homework --- see below.
Online assignments: Most homework assignments will be accompanied by short-answer or multiple choice questions, to be done on Canvas, before the homework itself is due. These will typically be due on Sunday nights; consult Canvas for the exact schedule.
In-class exercises: Most lectures will have in-class exercises. These will be short (10--20 minutes) assignments, emphasizing problem solving, done in class in small groups of no more than four people. The assignments will be given out in class, and must be handed in on paper during the class period. On most days, a randomly-selected group will be asked to present their solution to the class.; The in-class exercises will either be problems from that week's homework, or close enough that seeing how to do the exercise should tell you how to do some of the problems.

There will be no exams.

Time expectations

You should expect to spend 5--7 hours on assignments every week, averaging over the semester. (This follows from the university's rules about how course credits translate into hours of student time.) If you find yourself spending significantly more time than that on the class, please come to talk to me.

Grading

Grades will be broken down as follows:

In-class exercises: 10%. All exercises will have equal weight. Your lowest three exercise grades will be dropped; if you complete all exercises with a score of at least 60%, your lowest five grades will be dropped.
Homework: 90%. All homework assignments will be weighted equally. Most assignments will include online short-answer questions, due before the main assignment (see above); these will be incorporated into the grades for the corresponding assignments. The online questions will amount to about 10% of your final grade.
Your lowest two homework grades will be dropped. If you complete all homework assignments, with a minimum score of at least 60%, your lowest three homework grades will be dropped.
Late homework will not be accepted for any reason.

The point of dropping the low grades is to let you schedule interviews, social engagements, and other obligations, with flexibility, and without my having to decide what is or is not important enough to change the rules. If something --- illness, family crisis, or anything else --- is interfering with your ability to do the work of the class, come talk to me.

Grade boundaries will be as follows:

A [90, 100]

B [80, 90)

C [70, 80)

D [60, 70)

R < 60

To be fair to everyone, these boundaries will be held to strictly.

Grade changes and regrading: If you think that particular assignment was wrongly graded, tell me as soon as possible. Direct any questions or complaints about your grades to me; the teaching assistants have no authority to make changes. (This also goes for your final letter grade.) Complaints that the thresholds for letter grades are unfair, that you deserve a higher grade, etc., will accomplish much less than pointing to concrete problems in the grading of specific assignments.

As a final word of advice about grading, "what is the least amount of work I need to do in order to get the grade I want?" is a much worse way to approach higher education than "how can I learn the most from this class and from my teachers?".

Lectures

Lectures will be used to amplify the readings, provide examples and demos, and answer questions and generally discuss the material. They are also when you will do the graded in-class assignments which will help consolidate your understanding of the material, and help with your homework.

Readings

There are (up to) three classes of readings associated with each lecture.

Key readings: Usually from required textbooks, occasionally from other sources which will be made available well in advance. Do this reading before coming to lecture, and lecture will make a lot more sense. (Also, the online exercises will be easy.)
Suggested readings: Usually easier-to-read academic papers, or portions of the non-required texts, which will deepen your understand of what we're doing in lecture and in the homework, but aren't critical.
Background readings: Usually harder-to-follow or more historical papers, which give further details about methods and applications.

TL;DR: definitely do the key readings before class; try to do the suggested readings too, they're easy; and the background readings are if you want to get into the weeds.

Electronics

Please don't use any electronic devices during lecture: no laptops, no tablets, no phones, no recording devices, no watches that do more than tell time. The only exception to this rule is for electronic assistive devices, properly documented with CMU's Office of Disability Resources.

(The no-electronics rule is not arbitrary meanness on my part. Experiments show, pretty clearly, that students learn more in electronics-free classrooms, not least because your device isn't distracting your neighbors, who aren't as good at multitasking as you are.)

Missing class and emergencies

You can miss up to three class sessions during the semester without (grade) penalty. In cases of routine illness (for example, you have a bad cold and a day of sleep and fluids would help), stay home, take a free day, and let yourself rest and recover, rather than dragging yourself to class. The grading policy gives you space for this.

In cases of extended illness or emergency, we can deal with make-ups for missed work after the crisis. If extended illness or emergency happens (including mental health crises), or a personal or family crisis, please don't wait to get help. Just let me know that you're dealing with an emergency, and that you are already getting help.

R, R Markdown, and Reproducibility

R is a free, open-source software package/programming language for statistical computing. You should have begun to learn it in 36-401 (if not before). In this class, you will use it for every homework. If you are not able to use R, or do not have ready, reliable access to a computer on which you can do so, let me know at once.

Communicating your results to others is as important as getting good results in the first place. Every homework assignment will require you to write about that week's data analysis and what you learned from it; this writing is part of the assignment and will be graded. Raw computer output and R code is not acceptable; your document must be humanly readable.

All homework and exam assignments are to be written up in R Markdown. (If you know what knitr is and would rather use it, ask first.) R Markdown is a system that lets you embed R code, and its output, into a single document. This helps ensure that your work is reproducible, meaning that other people can re-do your analysis and get the same results. It also helps ensure that what you report in your text and figures really is the result of your code (and not some brilliant trick that you then forget, or obscure bug that you didn't notice). For help on using R Markdown, see "Using R Markdown for Class Reports". (Part of the first lecture will be going over the basics of R Markdown.)

Format Requirements for Homework

For each assignment, you should generally submit two, and only two, files:

an R Markdown source file, integrating text, generated figures and R code, uploaded to Canvas;
the "knitted," humanly-readable document, in PDF format, uploaded to Gradescope. (Using PDF is important for the grading software.)

(Using two systems is inconvenient for everyone; Gradescope is much better for efficiently sharing work across multiple graders, but will only work with PDFs.) I will be re-running the R Markdown file of randomly selected students; you should expect to be picked for this about once in the semester. You will lose points if your R Markdown file does not, in fact, generate your knitted file (making obvious allowances for random numbers, etc.).

Some problems in the homework will require you to do math. R Markdown provides a simple but powerful system for type-setting math. (It's based on the LaTeX document-preparation system widely used in the sciences.) If you can't get it to work, you can hand-write the math and include scans or photos of your writing in the appropriate places in your R Markdown document. (You should upload these additional files to Canvas.) You will, however, lose points for doing so, starting with no penalty for homework 1, and growing to a 90% penalty (for those problems) by the last homework.

Canvas, Gradescope and Piazza

Homework will be submitted electronically through Canvas and Gradescope; Canvas will also be used as the gradebook. Some readings and course materials will also be distributed through Canvas.

We will be using the Piazza website for question-answering. You will receive an invitation within the first week of class. Anonymous-to-other-students posting of questions and replies will be allowed, at least initially. (Postings will not be anonymous to instructors.) Anonymity will go away for everyone if it is abused. As noted above, homework will generally be due at 10 pm on Wednesdays; however, you should not expect the instructors to answer questions on Piazza (or by e-mail) after 6 pm.

Solutions

Solutions for all homework will be available, after their due date, through Canvas. Please don't share them with anyone, even after the course has ended.

Office Hours

Mondays	12:30--1:30	Mr. Manole	Wean Hall 3509
Tuesdays	10:30--11:30	Mr. Waudby-Smith	Wean Hall 3715
Wednesdays	12:30--1:30	Prof. Shalizi	Baker Hall 229C
Thursdays	3:00--4:00	Mr. Podkopaev	Porter Hall A19A

If you want help with computing, please bring a laptop.

If you cannot make the regular office hours, or have concerns you'd rather discuss privately, please e-mail me about making an appointment.

Textbook

The following three books are required:

D. J. Hand, Heikki Mannila and Padhraic Smyth, Principles of Data Mining (Cambridge, Massachusetts: MIT Press, 2001)
Cathy O'Neil, Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy (New York: Crown Books, 2016)
Michael Kearns and Aaron Roth, The Ethical Algorithm: The Science of Socially Aware Algorithm Design (New York: Oxford University Press, 2019)

Hand et al. and O'Neil are (currently) available electronically through the university library, but that's limited to one user at a time. You should make sure you have access to copies of all three books.

The following books are recommended but not required.

Richard A. Berk, Statistical Learning from a Regression Perspective (2nd edition, New York: Springer, 2016) [available online through the university library]
Trevor Hastie, Robert Tibshirani and Jerome Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2nd edition, Berlin: Springer-Verlag, 2009) [free online]
Jure Leskovec, Anand Rajaraman and Jeffrey D. Ullman, Mining of Massive Datasets (second edition, Cambridge, England: Cambridge University Press, 2014) [free online]
Paul Teetor, The R Cookbook (Sebastopol, California: O'Reilly, 2011) [R's help files are organized around "What does command X do?"; this is organized around "I want to do Y, what commands do I need to use?"]

Collaboration, Cheating and Plagiarism

Except for explicit group exercises, everything you turn in for a grade must be your own work, or a clearly acknowledged borrowing from an approved source; this includes all mathematical derivations, computer code and output, figures, and text. Any use of permitted sources must be clearly acknowledged in your work, with citations letting the reader verify your source. You are free to consult the textbook and recommended class texts, lecture slides and demos, any resources provided through the class website, solutions provided to this semester's previous assignments in this course, books and papers in the library, or legitimate online resources, though again, all use of these sources must be acknowledged in your work. (Websites which compile course materials are not legitimate online resources.)

In general, you are free to discuss, in general terms, homework with other students in the class, though not to share or compare work; such conversations must be acknowledged in your assignments. You may not discuss the content of assignments with anyone other than current students, the instructors, or your teachers in other current classes at CMU, until after the assignments are due. (Exceptions can be made, with prior permission, for approved tutors.) You are free to complain, in general terms, about any aspect of the course, to whomever you like.

Any use of solutions provided for any assignment in this course in previous semesters is strictly prohibited. This prohibition applies even to students who are re-taking the course. Do not copy the old solutions (in whole or in part), do not "consult" them, do not read them, do not ask your friend who took the course last year if they "happen to remember" or "can give you a hint". Doing any of these things, or anything like these things, is cheating, it is easily detected cheating, and those who thought they could get away with it in the past have failed the course. Even more importantly: doing any of those things means that the assignment doesn't give you a chance to practice; it makes any feedback you get meaningless; and of course it makes any evaluation based on that assignment unfair.

If you are unsure about what is or is not appropriate, please ask me before submitting anything; there will be no penalty for asking. If you do violate these policies but then think better of it, it is your responsibility to tell me as soon as possible to discuss how your mis-deeds might be rectified. Otherwise, violations of any sort will lead to formal disciplinary action, under the terms of the university's policy on academic integrity.

You must complete "homework 0" on the content of the university's policy on academic integrity, and on these course policies. This assignment will not factor into your grade, but you must complete it, with a grade of at least 80%, before you can get any credit for any other assignment.

Accommodations for Students with Disabilities

The Office of Disability Resources provides support services for both physically disabled and learning disabled students. For individualized academic adjustment based on a documented disability, contact Disability Resources at access [at] andrew.cmu.edu or 412-268-6121; they will coordinate with me.

Schedule, Lecture Notes and Readings

Lecture notes and readings will be linked to here, as they become available. Lecture topics are tentative and subject to change. Don't expect required and suggested readings to finalize until about a week before each lecture.

PDM = Principles of Data Mining; WMD = Weapons of Math Destriction; TEA = The Ethical Algorithm
TBD = to be determined = I'll figure it out between when whenever you're reading this and when the readings need to be made available.

Lecture 1 (Tuesday, 14 January): Introduction to the course

What is data mining? how is it used? where did it come from? where does the data come from? Some technical themes: prediction by averaging similar cases; deciding what's a similar case; verification by predicting new data. R Markdown. Course policy on cheating.
Readings
- Key readings:
  - PDM, chapter 1
  - WMD, Introduction and Chapter 1
  - TEA, introduction
  - "Using R Markdown for Class Reports"
- Background readings:
  - James Beniger, The Control Revolution: Technological and Economic Origins of the Information Society (Cambridge, Massachusetts: Harvvard University Press, 1986)
  - Herbert Simon, The Shape of Automation for Men and Management (New York: Harper and Row, 1965) [Reprinted as The New Science of Management Decisions]
  - Norbert Wiener, The Human Use of Human Beings: Cybernetics and Society (Boston: Houghton Mifflin, 1950)
  - JoAnne Yates, Control through Communication: The Rise of System in American Management (Baltimore: Johns Hopkins University Press, 1989)
Homework:
- Homework 0 assigned
- Homework 1: assignment, calif_penn_2011.csv data file
After class:
- Slides
- .Rmd file for the slides
- welcome.Rmd file for our first demo of R Markdown, including a solution to the exercise

Lecture 2 (Thursday, 16 January): Lightning review of linear regression

Linear regression as a prediction method. Expected squared error; the bias-variance decomposition; the conditional expectation function as the optimal regression function. The optimal linear predictor: why it always exists (regardless of whether anything is really linear, let alone linear-and-Gaussian), and its general mathematical form. Ordinary least squares as an estimator of the optimal linear predictor; OLS in matrix form. The hat matrix and predictions. Model checking and diagnostics: when they matter and when they don't. What linear models can and cannot do.
Readings
- Key reading: PDM, sections 11.1--11.3
- Suggested readings:
  - Berk, chapter 1
  - Hastie et al., sections 2.3.1 and 2.6, and chapter 3 (especially through section 3.4)
- Background reading:
  - Richard A. Berk, Regression Analysis: A Constructive Critique (Thousand Oaks, California: Sage, 2004)
  - R. W. Farebrother, Fitting Linear Relationships: A History of the Calculus of Observations, 1750--1900 (New York: Springer-Verlag, 1999)
  - CRS, The Truth About Linear Regression
  - Norbert Wiener, Extrapolation, Interpolation and Smoothing of Stationary Time-Series: with Engineering Application (Cambridge, Massachusetts: The Technology Press, 1949 [but first published as a classified technical report, Washington, D.C.: National Defense Research Council, 1942])
After class:
- Slides (typo-corrected!)
- .Rmd source file for the slides

Sunday, 19 January at 10 pm

Homework 0 due
Online questions for homework 1 due

Lecture 3 (Tuesday, 21 January): Nearest neighbors I: Mostly theoretical

k-nearest neighbors for classification, regression, and general prediction. Performance of nearest-neighbor prediction (=1NN) for noise-free function learning; how and why the distance to the nearest neighbor goes to zero. Effects of noise. Why 1NN converges to a risk of twice the minimum possible risk. Bias-variance trade-off for k-nearest-neighbors.
Readings
- Key reading: PDM, sections 6.3.3, 10.1--10.2 and 10.6
- Suggested reading: Hastie et al., chapter 13 (skimming section 13.2)
- Background readings:
  - Thomas M. Cover, "Estimation by the Nearest Neighbor Rule", IEEE Transactions on Information Theory 14 (1968): 50--55 [PDF reprint via Prof. Cover]
  - Thomas M. Cover, "Rates of Convergence for Nearest Neighbor Procedures", pp. 413--415 in B. K. Kinariwala and F. F. Kuo (eds.), Proceedings of the Hawaii International Conference on Systems Sciences (Honolulu: University of Hawaii Press, 1968) [PDF reprint via Prof. Cover]
  - Thomas M. Cover and P. E. Hart, "Nearest Neighbor Pattern Classification", IEEE Transactions on Information Theory 13 (1967): 21--27 [PDF reprint via Prof. Cover]

After class: Notes (.Rmd), also covering the material for lecture 4.

Lecture 4 (Thursday, 23 January): Nearest neighbors II: Mostly statistical

The bias-variance trade-off for k-nearest-neighbors (reprise). The problem of selecting the number of nearest neighbors. Selecting k by cross-validation: v-fold versus leave-one-out. Some examples.
Readings
- Key reading: PDM, sections 10.10--10.11
Homework:
- Homework 1 due at 10 pm
- Homework 2: Assignment (spare copy of the data file)
After class: Notes (.Rmd), also covering the material for lecture 3.

Sunday, 26 January

Online questions for homework 2 due at 10 pm

Lecture 5 (Tuesday, 28 January): Trees and ensembles I

Classification and regression trees; the CART algorithm for building trees; "entropy" and "deviance"; some examples.
Reading
- Key reading: PDM, sections 5.2 and 10.5
- Suggested readings:
  - Berk, chapter 3
  - Hastie et al., section 9.2
After class: the chapter on trees from Advanced Data Analysis from an Elementary Point of View

Lecture 6 (Thursday, 30 January): Trees and ensembles II

Improving on single trees by combining multiple trees. "Bagging", or "bootstrap averaging": combining multiple trees from slightly different data sets. Random forests: bagging plus random feature selection. Two perspectives on why multiple trees help: averaging reduces variance (but less so if what's averaged is positively correlated); the performance of the average is the average performance plus the diversity.
Reading
- Suggested readings:
  - Berk, chapters 4 and 5
  - Hastie et al., section 8.7 (bagging) and chapter 15 (random forests)
- Background reading:
  - Leo Breiman, "Bagging Predictors", Machine Learning 24 (1996): 123--140
  - Leo Breiman, "Random Forests", Machine Learning 45 (2001): 5--32
  - Hastie et al., sections 8.8 (stacking) and 9.5 (hierarchical mixtures of experts)
  - Anders Krogh and Jesper Vedelsby, "Neural Network Ensembles, Cross Validation, and Active Learning", pp. 231--238 in Gerald Tesauro, David Tourtetsky and Todd Leen (eds.) Advances in Neural Information Processing Systems 7 [NIPS 1994] (Cambridge, Massachusetts: MIT Press, 1995)
After class: Notes, including the algebra I short-circuited in class (.Rmd)
Homework
- Homework 2 due at 10 pm
- Homework 3: Assignment, compas_violence.csv data file

Lecture 7 (Tuesday, 4 February): Linear classifiers and logistic regression

The "prototype method" for classification (= which class has the closer average vector?). Linear classifiers more broadly. The idea of a decision boundary. Logistic regression: a linear classifier plus an estimated class probability.
Reading
- Key reading: PDM, sections 10.4 and 10.7
- Suggested reading: Hastie et al., chapter 4
After class: Slides (.Rmd)

Lecture 8 (Thursday, 6 February): Kernel methods, support vector machines, and random kitchen sinks

Nonlinear classification by using a linear classifiers with a nonlinear set of features. The "kernel trick" for avoiding explicitly calculating features. Support vector machines. Random nonlinear features.
Reading
- Key reading: PDM, sections 10.3
- Suggested readings:
  - Berk, chapter 7
  - Hastie et al., chapter 12
- Background reading:
  - Ali Rahimi and Benjamin Recht, "Weighted Sums of Random Kitchen Sinks: Replacing Minimization with Randomization in Learning", pp. 1313--1320 in D. Koller et al. (eds.), Advances in Neural Information Processing Systems 21 [NIPS 2008]
After lecture: Notes (.Rmd source)
Homework
- Homework 3 due at 10 pm
- Homework 4: Assignment; compas_training_rows.txt file, giving the row numbers you should use as the training set.

Sunday, 9 February

~~Online questions for homework 4 due at 10 pm~~ No online reading questions this week.

Lecture 9 (Tuesday, 11 February): Information measures I: Scene-setting

Prediction needs statistical dependence; reminder of the definition of statistical independence and how it connects to prediction. Mutual information quantify dependence between random variables. Alternate take: entropy measures uncertainty or spread in a random variable; conditional entropy measures uncertainty after conditioning; mutual information is the reduction in entropy due to conditioning.
Bonus material not covered in lecture: proofs of the "it can be shown that" claims; connection between entropy and coding; measuring differences between distributions with divergence a.k.a. relative entropy; connections between classification and conditional entropy; connections between classification and divergence; maximum likelihood estimation is really minimum divergence estimation; estimating entropy and information from sample data.
After class: Notes (.Rmd)

Lecture 10 (Thursday, 13 February): Information measures II: The information-theoretic view of prediction, feature selection, and summarization

Predictive information. Sufficient statistics. Statistical relevance bases. The information-bottleneck method. Selecting features to maximize predictive information; to minimize stored information. Creation of new features to capture information: the information bottleneck again; word2vec.
Readings
- Suggested reading:
  - Naftali Tishby, Fernando C. Pereira and William Bialek, "The Information Bottleneck Method", pp. 368--377 in B. Hajek and R. S. Sreenivas (eds.), Proceedings of the 37th Annual Allerton Conference on Communication, Control and Computing (Urbana, Illinois: University of Illinois Press, 1999), arxiv:physics/0004057
After class: Notes (with extensive worked examples); R code for reproducing the notes (except for a few deliberately-omitted items); nyt.frame.csv data frame
Homework
- Homework 4 due at 10 pm
- Homework 5: assignment, nyt.frame.csv data frame

Sunday, 16 February

Online questions for homework 5 due at 10 pm

NO CLASS on Tuesday, 18 February

Lecture 11 (Thursday, 20 February): Prelude to dimension reduction

Goals of dimension reduction. Linear algebra review: vectors, matrices, inner products, projections. Eigen-decomposition of matrices.
Readings: None
Homework
- Homework 5 due at 10 pm
- Homework 6: assignment

Sunday, 23 February

Online questions for homework 6 due at 10 pm

Lecture 12 (Tuesday, 25 February): Linear dimension reduction

"Principal components analysis" = finding the linear surface which comes closest to the original data points. Performing and interpreting PCA. Example with historical complexity. "Latent semantic indexing" = PCA on bag-of-words vectors. Multidimensional scaling and PCA. Random linear projections as an alternative to PCA.
Reading
- Key reading: PDM, chapter 3, especially sections 3.6 and 3.7
- Suggested readings:
  - Hastie et al., section 14.5
  - Shalizi, Advanced Data Analysis from an Elementary Point of View, chapter on PCA
- Background readings:
  - Peter Turchin et al., "Quantitative historical analysis uncovers a single dimension of complexity that structures global variation in human social organization", Proceedings of the National Academy of Sciences (USA) 115 (2018) E144--E151
  - Michael E. Wall, Andreas Rechtsteiner and Luis M. Rocha, "Singular Value Decomposition and Principal Component Analysis", physics/0208101
After class: Slides, with amplifications / clarifications based on class discussion (.Rmd source file)

Lecture 13 (Thursday, 27 February): Linear dimensionality reduction, once more with feeling

Recapitulation of the math; worked example with real data.
After class: Slides (.Rmd source file)
Homework
- Homework 6 due at 10 pm
- Homework 7: assignment, amazon-dress-jpgs.zip (data file), eigendresses.R (starter code)

--- The original plan was to do nonlinear dimension reduction, as follows:

Problems with using linear methods when the data lie near a curved surface. The concept of a "manifold". Exploiting manifold structure: local linear embedding. Exploiting a pre-existing measure of similarity: diffusion maps.
Reading
- Key readings:
  - PDM, section 6.5
  - Shalizi, Advanced Data Analysis from an Elementary Point of View, appendix on nonlinear dimensionality reduction
- Background reading:
  - Mikhail Belkin and Partha Niyogi, "Laplacian Eigenmaps for Dimensionality Reduction and Data Representation", Neural Computation 15 (2003): 1373--1396 [PDF via Prof. Belkin]
  - Ann B. Lee and Larry Wasserman, "Spectral Connectivity Analysis", Journal of the American Statistical Association 105 (2010): 1241--1255, arxiv:0811.0121
  - Sam T. Roweis and Laurence K. Saul, "Nonlinear Dimensionality Reduction by Locally Linear Embedding", Science 290 (2000): 2323--2326
  - Lawrence K. Saul and Sam T. Roweis, "Think Globally, Fit Locally: Supervised Learning of Low Dimensional Manifolds", Journal of Machine Learning Research 4 (2003): 119--155

Sunday, 1 March

Online questions for homework 7 due at 10 pm

Lecture 14 (Tuesday, 3 March): Prelude to clustering

"Clustering" is dividing data into categories, without knowing what the categories are, or having an examples to follow. Why this is not totally hopeless, and some rough ideas about what good clusters might look like. Lightning review of probability distributions: cumulative distribution function, probability density function, median, mode, mean; goodness-of-fit testing for distributions (chi-squared, Kolmogorov-Smirnov); entropy and relative entropy again.
Reading
- Key reading: PDM, sections 6.4 and 9.2.1--9.2.3, 9.2.6

Lecture 15 (Thursday, 5 March): Clustering I: Clusters without Probability

The most basic algorithm for clustering: k-means. Using clustering to find (approximate) nearest neighbors ("locality-sensitive hashing").
Readings
- Key readings:
  - PDM, sections 9.3--9.5 (skimming 9.1 and 9.2 is recommended)
  - O'Neil, Chapters 9 and 10
- Suggested readings:
  - Hastie et al., sections 14.3.1--14.3.6
- Background reading:
  - Aristides Gionis, Piotr Indyk and Rajeev Motwani, "Similarity Search in High Dimensions via Hashing", pp. 518--529 in M. P. Atkinson et al. (eds.), Proceedings of the 25th International Conference on Very Large Data Bases [VLDB '99] (San Francisco: Morgan Kaufmann, 1999)
  - Jacob Kogan, Introduction to Clustering Large and High-Dimensional Data (Cambridge, UK: Cambridge University Press, 2006)
After class: Notes
Homework
- Homework 7 due at 10 pm
- ~~Homework 8 assigned~~

10 and 12 March: NO CLASS

Enjoy spring break.

Monday, 16 March

~~Online questions for homework 8 due at 10 pm (note day!)~~

Lecture 16 (Tuesday, 17 March): NO CLASS

The university has canceled all instruction for Monday and Tuesday.

Lecture 17 (Thursday, 19 March): Clustering II

Probabilistic clustering with mixture models; the EM algorithm; k-means again. Using log-likelihood to pick the number of clusters. Checking goodness of fit of cluster models. Advanced topics in probabilistic clustering: "mixed membership" models, and their roles in modeling text ("topic models", "latent Dirichlet allocation") and genetics.
Reading
- Key reading: PDM, sections 9.2.4--9.2.5 (again), and section 9.6
- Suggested reading:
  - Hastie et al., section 8.5
  - Shalizi, Advanced Data Analysis from an Elementary Point of View, chapter on mixture models
- Background readings:
  - April Galyardt, "Interpreting Mixed Membership", in Edoardo M. Airoldi, David Blei, Elena A. Erosheva, Stephen E. Fienberg (eds.), Handbook of Mixed Membership Models and Their Applications (New York; Chapman and Hall, 2016) [PDF]
  - David M. Blei, Andrew Y. Ng and Michael I. Jordan, "Latent Dirichlet Allocation", Journal of Machine Learning Research 3 (2003): 993--1022
  - J. K. Pritchard, M. Stephens and P. Donnelly, "Inference of population structure using multilocus genotype data", Genetics 155 (2000): 945--959
Homework
- Homework 8 ("graded survey") due at 10 pm
- Homework 9: Assignment

Lecture 18 (Tuesday, 24 March): How big are the errors of our predictive models?

Error measures or "loss functions": squared error for regression, mis-labeling ("0/1 loss") for classification, log-likelihood for probabilities. Why we should never evaluate predictive models in-sample. (Review of why R-squared is useless and/or misleading.) Simple (too simple?) formulas for estimating out-of-sample prediction error. The concept of effective degrees of freedom. Other measures of model complexity.
Reading
- Key reading: PDM, chapter 7
- Suggested readings:
  - Berk, section 1.6 (skipping 1.6.7, "Basis Functions" and 1.6.8, "The Curse of Dimensionality")
  - Hastie et al., chapter 7
- Background readings:
  - Tilmann Gneiting, "Making and Evaluating Point Forecasts", Journal of the American Statistical Association 106 (2011): 746--762
Notes (.Rmd)

Lecture 19 (Thursday, 26 March): What kinds of errors do our predictive models make?

Regression diagnostics, focusing on prediction error: over- and under- prediction, non-constant variance ("heteroskedasticity"). Errors for classifiers: false positives versus false negatives; the trade-off between false positives and false negatives, and the "receiver operating characteristic" (ROC) curve. "Calibration" of probability models; "proper scoring rules".
Reading
- Key reading: PDM, chapter 10
- Suggested readings:
  - Hastie et al., section 9.2.5
Background readings:
- Tilmann Gneiting, Fadoua Balabdaoui and Adrian E. Raftery, "Probabilistic Forecasts, Calibration and Sharpness", Journal of the Royal Statistical Society B 69 (2007): 243--268
- Deborah G. Mayo and Aris Spanos, "Methodology in Practice: Statistical Misspecification Testing", Philosophy of Science 71 (2004): 1007--1025
Homework
- Homework 10: Assignment (.Rnw source file with all the math)

Saturday, 28 March

Homework 9 due, along with online reading questions (an extension)

Lecture 20 (Tuesday, 31 March): Cross-validation

Splitting the sample to estimate generalization error. Cross-validation: "v-fold" (or "k-fold") and "leave-one-out". Short-cut formulas for LOOCV.
Readings
- Key reading: PDM, chapter 10, especially sections 10.6 and 10.10
- Suggested reading:
  - Hastie et al., section 7.10
- Background reading:
  - Mona Azadkia, "Optimal choice of k for k-nearest neighbor regression", arxiv:1909.05495
  - Seymour Geisser, "The Predictive Sample Reuse Method with Applications", Journal of the American Statistical Association 70 (1975): 320--328
  - Seymour Geisser and William F. Eddy, "A Predictive Approach to Model Selection", Journal of the American Statistical Association 74 (1979): 153--160
  - M. Stone, "Cross-validatory choice and assessment of statistical predictions", Journal of the Royal Statistical Society B 36 (1974): 111--147
Notes: Slides (.Rmd source file)

Lecture 21 (Thursday, 2 April): Bootstrapping

The concept of the bootstrap (again): imitating generalizing from a sample to a whole population by re-sampling the data. Bootstrap estimates of prediction error. Bootstrap confidence intervals. Bootstrap model averaging (again).
Reading
- Suggested reading: Hastie et al., sec. 7.11
Homework
- Homework 10 due at 10 pm
- Homework 11: assignment
Slides (.Rmd)

Lecture 22 (Tuesday, 7 April): Recommender systems I: Broad Ideas and Technical Issues

The idea of a recommender system. Best-seller lists. Nearest-neighbor-based approaches (whether based on users or on items). The Netflix problem, singular value decomposition, and PCA/factor-model style approaches. Recommendations using social graphs. Missing-data issues.
Readings
- Key readings:
  - Upendra Shardanand and Pattie Maes, "Social Information Filtering: Algorithms for Automating ``Word of Mouth’’", pp. 210--217 in CHI '95: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems [PDF]
  - Paul Resnick, Neophytos Iacovou, Mitesh Suchak, Peter Bergstrom, and John Riedl, "GroupLens: an open architecture for collaborative filtering of netnews", pp. 175--186 in Proceedings of the 1994 ACM conference on Computer supported cooperative work [CSCW '94] [HTML reprint via Prof. Resnick]
- Suggested reading:
  - Andrey Feuerverger, Yu He, and Shashi Khatri, "Statistical Significance of the Netflix Challenge", Statistical Science 27 (2012): 202--231
- Background readings:
  - Gérard Biau, Benoît Cadre, Laurent Rouvière, "Statistical analysis of k-nearest neighbor collaborative recommendation", Annals of Statistics 38 (2010): 1568--1592, arxiv:1010.0499
  - Maurizio Ferrari Dacrema, Paolo Cremonesi and Dietmar Jannach, "Are We Really Making Much Progress? A Worrying Analysis of Recent Neural Recommendation Approaches", in Proceedings of the 13th ACM Conference on Recommender Systems [RecSys 2019], arxiv:1907.06902
  - Benjamin Marlin, Richard S. Zemel, Sam Roweis, Malcolm Slaney, "Collaborative Filtering and the Missing at Random Assumption", arxiv:1206.5267
Notes (.Rmd)

Lecture 23 (Thursday, 9 April): Recommender systems II: What's Not to Love?

Homogenization; paradoxes of diversification; rabbit holes and echo chambers; discrimination; what does the system optimize for; when inferior recommendations are more profitable; natural monopoly (or oligopoly). Also: do recommendations do anything at all?
Readings
- Suggested reading (in this order):
  - "Amazon.com Recommendations Understand Area Woman Better Than Husband", The Onion 9 January 2007
  - Tom Slee, "Online Monoculture and the End of the Niche", Whimsley 15 March 2009
  - Tommy Wallach, "In Which Netflix Achieves Sentience As A Result of My Terrible Decision-making", The Toast, 4 August 2014
  - Zeynep Tufekci, "YouTube's Recommendation Algorithm Has a Dark Side", Scientific American 320:4 (April 2019): 77
  - Zeynep Tufekci, "YouTube, the Great Radicalizer", New York Times, 10 March 2018
Notes (.Rmd)
Homework
- Homework 11 due at 10 pm
- Homework 12: CANCELLED

Lecture 24 (Tuesday, 14 April): Information retrieval

Finding data by content. Approaches we will neglect: metadata; "coding" or "content analysis". Textual features. Vector representations of documents: the bag-of-words representation; word2vec. Measures of similarity and the importance of normalization; inverse document frequency and other tweaks to the representation. Example with the New York Times Annotated Corpus. The trick to searching: queries are documents and can be represented in the same feature space. Search evaluation: precision, recall, precision-recall curves; error rates. Incorporating user feedback. Proxies for relevance like "engagement". Abstraction and generalizing to other types of data.
Readings
- Key reading: PDM, chapter 14
- Background reading:
  - Marco Baroni, Georgiana Dinu and German Kruszewski, "Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors", pp. 238--247 in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics [ACL14] (Baltimore, MD: Association for Computational Linguistics, 2014)
  - Yoav Goldberg and Omer Levy, "word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method", arxiv:1402.3722
  - Kian Kenyon-Dean, "Word Embedding Algorithms as Generalized Low Rank Models and their Canonical Form", arxiv:1911.02639
  - Lingpeng Kong, Cyprien de Masson d'Autume, Wang Ling, Lei Yu, Zihang Dai, Dani Yogatama, "A Mutual Information Maximization Perspective of Language Representation Learning", arxiv:1910.08350
  - Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean, "Efficient Estimation of Word Representations in Vector Space", arxiv:1301.3781
Notes (.Rmd)

Thursday, 16 April: OPTIONAL lecture on epidemic modeling

Originally, there wasn't supposed to be any class this day because of Carnival. The university now says it's a day for instruction, but I promised you wouldn't have to do anything, so we'll cover something that is, strictly speaking, outside the scope of the course, but statistical and topical, and make it completely optional.

The basic "susceptible-infectious-removed" (SIR) epidemic model. The probability model and its deterministic limit. The idea of the "basic reproductive number" R0 and how it relates to the rates of transmission and removal. Why diseases do not necessarily evolve to be less lethal to their hosts. The epidemic threshold when R0=1. Complications: gaps between being infected and becoming infectious; the possibility of being infectious without showing symptoms; re-infection. Epidemics in social networks, and how network structure affects the epidemic threshold; why high-degree people tend to be among the first infected, and disease-control strategies based on "destroying the hubs". Statistical issues in connecting epidemic models to data.
Reading:
- Key reading: Handout on compartment models from "Data over Space and Time" (2018)
- Suggested readings:
  - Zeynep Tufekci, "Don’t Believe the COVID-19 Models: That’s not what they’re for", The Atlantic 2 April 2020
- Background reading:
  - Mark E. J. Newman, "The spread of epidemic disease on networks", Physical Review E 66 (2002): 016128, arxiv:cond-mat/0205009
  - Tom Britton, "Epidemic models on social networks -- with inference", arxiv:1908.05517
  - Romualdo Pastor-Satorras and Alessandro Vespignani, "Immunization of complex networks", Physical Review E 65 (2002): 036104, arxiv:cond-mat/0107066
Slides (.Rmd), revised after the lecture to fix typos, amplify points, incorporate things that came up in discussion, etc.
Homework
- Homework 13: assignment, R code, MovieLense.csv

Lecture 25 (Tuesday, 21 April): Fairness in prediction, especially classification

Notions of "fair" prediction: not using "protected" features; parity of error rates across groups; calibration of prediction. "Inference" issues: using features that carry information about protected features. Trade-offs between different forms of fairness. Trade-offs between forms of fairness and accuracy. Some critiques of these notions of "fairness".
Readings
- Key readings:
  - WMD, chapters 1, 5, 6 and 8
  - TEA, chapter 2
- Suggested readings:
  - Julia Angwin, Jeff Larson, Surya Mattu and Lauren Kirchner, "Machine Bias", ProPublica 23 May 2016
  - Alexandra Chouldechova, "Fair prediction with disparate impact: A study of bias in recidivism prediction instruments", arxiv:1610.07524
  - Sam Corbett-Davies and Sharad Goel, "The Measure and Mismeasure of Fairness: A Critical Review of Fair Machine Learning", arxiv:1808.00023
  - Simon DeDeo, "Wrong side of the tracks: Big Data and Protected Categories", pp. 31--42 in Cassidy R. Sugimoto, Hamid R. Ekbia and Michael Mattioli (eds.), Big Data Is Not a Monolith (Cambridge, Mass.: MIT Press, 2016), arxiv:1412.4643
  - Sharad Goel, Jake M. Hofman, M. Irmak Sirer, "Who Does What on the Web: A Large-Scale Study of Browsing Behavior", Sixth International AAI Conference on Weblogs and Social Media [ICWSM 2012] [Preprint via Prof. Goel]
- Background reading:
  - Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, Aram Galstyan, "A Survey on Bias and Fairness in Machine Learning", arxiv:1908.09635
Notes (.Rmd)

Lecture 26 (Thursday, 23 April): Waste, fraud and abuse I: when is our data-mining project going to be a technical failure?

Weak or non-existent predictive relationships. Changing predictive relationship. Changing distributions. The curse of dimensionality.
Readings
- Key readings:
  - PDM, chapter 2, section 6.5, section 1.7
- Suggested readings:
  - Chris Anderson, "The End of Theory: The Data Deluge Makes the Scientific Method Obsolete", Wired 16:17 (June 2008)
  - Berk, chapter 9
  - Jesse Frederik and Maurits Martijn, "The new dot com bubble is here: it’s called online advertising", The Correspondent 2 November 2019 [While a popular article, this is well-sourced in serious research, which it links to]
  - Andrew Gelman and David Weakliem, "Of Beauty, Sex and Power", American Scientist 97 (2009): 310
  - Fernando Pereira, comments on Anderson, Earning My Turns, 23 June 2008
  - Christian Grothoff and J.M. Porup, "The NSA’s SKYNET program may be killing thousands of innocent people", Ars Technica 16 February 2016
Notes (.Rmd)

Saturday, 25 April

Homework

Homework 13 due at 10 pm
Homework 14: Assignment

Lecture 27 (Tuesday, 28 April): Waste, Fraud, and Abuse II: More sources of technical failure

Bad data: errors in data vs. bad measurements, measuring the wrong thing. Not accounting for model search and/or model selection; sample splitting as a cure for post-selection inference. The "garden of forking paths" and "researcher degrees of freedom". Prediction vs. causality; taking credit for what would have happened anyway.
Readings
- Key reading: TEA, chapter 4
- Suggested readings:
  - Andrew Gelman and Eric Loken, "The Statistical Crisis in Science", American Scientist 102 (2014): 460--465 [PDF reprint via Prof. Gelman]
  - Donald B. Rubin and Richard P. Waterman, "Estimating the Causal Effects of Marketing Interventions Using Propensity Score Methodology", Statistical Science 21 (2006): 206--222, arxiv:math.ST/0609201
- Background readings
  - Jorge Luis Borges, "The Garden of Forking Paths" (as "El jardín de senderos que se bifurcan", 1941; translated by Anthony Kerrigan in Ficciones [New York: Grove Press, 1962] and by Anthony Hurley in Collected Fictions [New York: Viking, 1998])
  - Howard Becker, Evidence (Chicago: University of Chicago Press, 2017)
  - Luke Oakden-Rayner, "Exploring large scale public medical image datasets", arxiv:1907.12720
  - Amit Sharma, Jake M. Hofman and Duncan J. Watts, "Estimating the Causal Impact of Recommendation Systems from Observational Data", pp. 453--470 in Proceedings of the Sixteenth ACM Conference on Economics and Computation [EC '15], arxiv:1510.05569

Lecture 28 (Thursday, 30 April): Waste, fraud and abuse III: what are we doing to ourselves? should we be doing this?

Privacy and surveillance. Rights-violating applications of data mining. Legal (probably) but still destructive applications. Is it any better if it doesn't work (well)? Some ethical check-list questions: who benefits? who is harmed? how much? how surely? are you OK with being responsible for that?
Readings
- Key readings:
  - WMD, Conclusion
  - TEA, chapters 1 and 5
- Suggested readings:
  - Maciej Ceglowski, "What Happens Next Will Amaze You"
  - Henry Farrell, "Seeing Like a Finite-State Machine", Crooked Timber 25 November 2019
  - Kashmir Hill, "The Secretive Company That Might End Privacy As We Know It", New York Times 19 January 2020 [published in the print edition as "A Facial Recognition App That Puts Privacy in Peril", p. 1]
    - Betsy Swan, "Facial-Recognition Company That Works With Law Enforcement Says Entire Client List Was Stolen", Daily Best 26 February 2020
  - Stuart A. Thompson and Charlie Warzel, "Twelve Million Phones, One Dataset, Zero Privacy", New York Times 19 December 2019
  - Zeynep Tufekci, "We're building a dystopia just to make people click on ads", TED
- Background readings:
  - Francis Spufford, Red Plenty (London: Faber and Faber, 2010) [About the first time brilliant and well-intentioned people tried to use data, computers and optimization to make the world a better place]
  - Norbert Wiener, The Human Use of Human Beings: Cybernetics and Society (Boston: Houghton Mifflin, 1950)

Friday, 1 May

Homework 14 due at 10 pm