36-402, Undergraduate Advanced Data Analysis

Spring 2025

Tuesdays and Thursdays, 9:30--10:50 am, Doherty Hall 2315

The goal of this class is to train you in using statistical models to analyze data --- as data summaries, as predictive instruments, and as tools for scientific inference. We will build on the theory and applications of the linear model, introduced in 36-401, extending it to more general functional forms, and more general kinds of data, emphasizing the computation-intensive methods introduced since the 1980s. After taking the class, when you're faced with a new data-analysis problem, you should be able to (1) select appropriate methods, (2) use statistical software to implement them, (3) critically evaluate the resulting statistical models, and (4) communicate the results of your analyses to collaborators and to non-statisticians.

During the class, you will do data analyses with existing software, and write your own simple programs to implement and extend key techniques. You will also have to write reports about your analyses.

36-602 A small number of well-prepared graduate students from other departments can take this course by registering for it as 36-602. (Graduate students enrolling in 36-402, or undergraduates enrolling in 36-602, will be automatically dropped from the roster.) If you want to do so, please contact me, to discuss whether you have the necessary preparation.

Prerequisites

36-401, with a grade of C or better. Exceptions are only granted for graduate students in other departments taking 36-602.

Instructors

Professors	Cosma Shalizi	cshalizi [at] cmu.edu
		Baker Hall 229C
Teaching assistants	Kyle Schindl (head)	Not to be bothered by e-mail
	Julian Braganza
	Woonyoung Chang
	Prina Doshi
	Yinan Guo
	Amy Ke
	Lena Liang
	Noelani Phillips
	Akshay Prasadan
	Roochi Shah
	Yuntian (Jack) Shen
	Riley Vetere-Jones
	Ryan Zhang
	Tianyi (Calum) Zhang

Topics

Model evaluation: statistical inference, prediction, and scientific inference; in-sample and out-of-sample errors, generalization and over-fitting, cross-validation; evaluating by simulating; the bootstrap; penalized fitting; mis-specification checks

Yet More Regression: what is regression, really?; what ordinary linear regression actually does; what it cannot do; regression by kernel smoothing; regression by spline smoothing; additive models; regression by trees and/or nearest neighbors (time permitting)

Generalized linear and additive models: logistic regression; generalized linear models; generalized additive models.

Distributions, Structure in Distributions, and Latent Variables: Multivariate distributions; factor analysis and latent variables; cluster/mixture models and latent variables; graphical models in general

Causality: graphical causal models; causal inference from randomized experiments; identification of causal effects from observations; estimation of causal effects; discovering causal structure

Dependent data: dependence over time; dependence over space and over space-time; dependence over networks (time and/or space permitting)

See the end of this syllabus for the current lecture schedule, subject to revision. Lecture handouts, slides, etc., will be linked there, as available.

Course Mechanics and Grading

Grades will not go away if you avert your eyes (photo by laurent KB on Flickr)

There are three reasons you will get assignments in this course. In order of decreasing importance:

Practice. Practice is essential to developing the skills you are learning in this class. It also helps you learn, because some things which seem murky clarify when you actually do them, and sometimes trying to do something shows that you only thought you understood it.
Feedback. By seeing what you can and cannot do, and what comes easily and what you struggle with, I can help you learn better, by giving advice and, if need be, adjusting the course.
Evaluation. The university is, in the end, going to stake its reputation (and that of its faculty) on assuring the world that you have mastered the skills and learned the material that goes with your degree. Before doing that, it requires an assessment of how well you have, in fact, mastered the material and skills being taught in this course.

To serve these goals, there will be two kinds of assignment in this course.

Homework: Almost every week will have a homework assignment. Most homeworks will be divided into a series of questions or problems. These will have a common theme, and will usually build on each other, but different problems may involve doing or applying some theory, analyzing real data sets on the computer, and communicating the results.; Four of the homework assignments will require you to write reports, along the lines of the data-analysis exams you did in 36-401. For those assignments, you will analyze a real-world data set, answering questions about the world (not about statistics) on the basis of your analysis, and write up your findings in the form of a scientific report. These homework assignments will provide the data set, the specific questions, and a rubric for your report. Each of these, however, will receive the same weight in your grade as any other weekly homework.; All homework will be submitted electronically. Most weeks, homework will be due at 6:00 pm on Thursdays. Homework assignments will always be released by Friday of the previous week, and sometimes before. The week of Carnival (April 3rd and 4th), the homework will be due at 6 pm on Wednesday the 2nd, and will be shorter than usual (but count just as much).; There are specific formatting requirements for homework --- see below.
In-class exercises: Most lectures will have in-class exercises. These will be short (10--15 minutes) assignments, emphasizing problem solving, connected to the theme of the lecture and to the current homework. You will do them in class in small groups of at most four people. The assignments will be given out in class, and must be handed in electronically by 6 pm that day. On most days, a randomly-selected group will be asked to present their solution to the class.
Exams: There are no exams in this class.

Time expectations

You should expect to spend 5--7 hours on assignments every week, outside of class, averaging over the semester. (This follows from the university's rules about how course credits translate into hours of student time.) If you find yourself spending significantly more time than that on the class, please come to talk to me.

Grading

Grades will be broken down as follows:

Exercises: 10%. All exercises will have equal weight. Your lowest five exercises grades will be replaced by scores of 80/100, no questions asked. (If your lowest scores are above 80, they will not be changed.)
Late exercises will not be accepted for any reason.
Homework: 90%. There will be 13 homeworks, all of equal weight. Your three lowest homework grades will be replaced by scores of 80/100, no questions asked. (If your lowest scores are above 80, they will not be changed!) As mentioned above, four of the weekly homework assignments will ask you to write reports; these will have the same weight as any other weeks.
Late homework will not be accepted for any reason.

Dropping your lowest grades lets you schedule interviews, social engagements, and other non-academic uses of your time flexibly, and without my having to decide what is or is not important enough to change the rules. If something --- illness, family crisis, or anything else --- is a continuing obstacle to your ability to do the work of the class, come talk to me.

Letter-grade thresholds

Grade boundaries will be as follows:

A	[90, 100]
B	[80, 90)
C	[70, 80)
D	[60, 70)
R	< 60

To be fair to everyone, these boundaries will be held to strictly.

Grade changes and regrading: If you think that particular assignment was wrongly graded, tell me as soon as possible. Direct any questions or complaints about your grades to me; the teaching assistants have no authority to make changes. (This also goes for your final letter grade.) Complaints that the thresholds for letter grades are unfair, that you deserve or need a higher grade, etc., will accomplish much less than pointing to concrete problems in the grading of specific assignments.

As a final word of advice about grading, "what is the least amount of work I need to do in order to get the grade I want?" is a much worse way to approach higher education than "how can I learn the most from this class and from my teachers?".

Lectures

Lectures will be used to amplify the readings, provide examples and demos, and answer questions and generally discuss the material. They are also when you will do the graded in-class assignments which will help consolidate your understanding of the material, and help with your homework.

You will generally find it helpful to do the readings before coming to class.

Electronics: Don't. Please don't use electronics during lecture: no laptops, no tablets, no phones, no recording devices, no watches that do more than tell time. The only exception to this rule is for electronic assistive devices, authorized by a written request for accommodations through CMU's Office of Disability Resources.

(The no-electronics rule is not arbitrary meanness on my part. Experiments show, pretty clearly, that students learn more in electronics-free classrooms, not least because your device isn't distracting your neighbors, who aren't as good at multitasking as you are.)

R, R Markdown, and Reproducibility

R is a free, open-source software package/programming language for statistical computing. You should have begun to learn it in 36-401 (if not before). In this class, you'll be using it for every homework assignment. If you are not able to use R, or do not have ready, reliable access to a computer on which you can do so, let me know at once.

Communicating your results to others is as important as getting good results in the first place. Every homework assignment will require you to write about that week's data analysis and what you learned from it; this writing is part of the assignment and will be graded. Raw computer output and R code is not acceptable; your document must be humanly readable.

All homework assignments are to be written up in R Markdown. (If you know what knitr is and would rather use it, go ahead.) R Markdown is a system that lets you embed R code, and its output, into a single document. This helps ensure that your work is reproducible, meaning that other people can re-do your analysis and get the same results. It also helps ensure that what you report in your text and figures really is the result of your code (and not some brilliant trick that you then forget, or obscure bug that you didn't notice). For help on using R Markdown, see "Using R Markdown for Class Reports".

Canvas, Gradescope and Piazza

You will submit your work electronically through Gradescope. This includes both your homework assignments and the class exercises. We will use Canvas as a gradebook, to distribute solutions, and to distribute some readings that can't be publicly posted here.

We will be using the Piazza website for question-answering. You will receive an invitation within the first week of class. Anonymous-to-other-students posting of questions and replies will be allowed, at least initially. (Postings will not be anonymous to instructors.) Anonymity will go away for everyone if it is abused. You should not expect the instructors to answer questions on Piazza (or by e-mail) outside normal working hours. (We may, but you shouldn't expect it.)

Format Requirements for Homework

For each assignment, you should write your homework in R Markdown, knit it to a humanly readable document in PDF format, and upload the PDF to Gradescope. This is, ordinarily, what you will be graded on. However, it is important that you keep the R Markdown file around, because every week I will randomly select a few students and ask them to send me your R Markdown so that I can re-run it, and check that it does, in fact, produce the file you turned in. (You should expect to be picked about once a semester.) If they don't match, I will have questions, and it will hurt your grade.

(Gradescope makes it much easier for multiple graders to collaborate, but it doesn't understand R Markdown files, just PDFs.)

You'll need to do math for some homework problems. R Markdown provides a simple but powerful system for type-setting math. (It's based on the LaTeX document-preparation system widely used in the sciences.) If you can't get it to work, you can hand-write the math and include scans or photos of your writing in the appropriate places in your R Markdown document. You will, however, lose points for doing so, starting with no penalty for homework 1, and growing to a 90% penalty (for those problems) by the last homework.

--- For the class exercises, scans/photos of hand-written math are acceptable, but you will lose points if they are hard to read. (Dark ink on unlined paper tends to work best.)

Solutions

Solutions for all homework and in-class exercise will be available, after their due date, through Canvas. Please don't share them with anyone, even after the course has ended. This very much includes uploading them to websites.

Office Hours

Mondays	10:00--11:00	Akshay Prasadan	Piazza
Mondays	2:00--3:00	Lena Liang	Piazza
Mondays	3:00--4:00	Amy Ke	Wean Hall 3509
	—	—
Tuesdays	11:00--12:00	Akshay Prasada	Wean Hall 3509
Tuesdays	12:00--1:00	Roochi Shah	Piazza
Tuesdays	2:00--3:00	Woonyoung Chang	Piazza
Tuesdays	4:00--5:00	Yinan Guo	Wean Hall 3509
	—	—
Wednesdays	9:00--10:00	Kyle Schindl	Piazza
Wednesdays	2:00--3:00	Lena Liang	Wean Hall 3509
Wednesdays	3:00--4:00	Roochi Shah	Wean Hall 3509
Wednesdays	3:00--4:00	Amy Ke	Piazza
	—	—
Thursdays	11:30--12:30	Kyle Schindl	Wean Hall 3509
Thursdays	2:00--3:00	Woonyoung Chang	Wean Hall 3509
Thursdays	3:15--4:15	Julian Braganza	Piazza
	—	—
Fridays	3:00--4:00	Julian Braganza	Wean Hall 3509
Fridays	4:00--5:00	Yinan Guo	Piazza

Piazza office hours aren't (necessarily) the only time we'll answer questions on Piazza, but they are when someone is online to clear any backlog of accumulated questions, and for rapid follow-ups if there are further questions.

If you want help with computing at in-person office hours, please bring a laptop.

If you cannot make the regular office hours, or have concerns you'd rather discuss privately, please e-mail me about making an appointment.

Textbook

The primary textbook for the course will be the draft Advanced Data Analysis from an Elementary Point of View. Chapters will be linked to here as they become needed. Reading these chapters will very greatly help your ability to do the assignments.

In addition, Paul Teetor, The R Cookbook (O'Reilly Media, 2011, ISBN 978-0-596-80915-7) is strongly suggested as a reference.

Cox and Donnelly, Principles of Applied Statistics (Cambridge University Press, 2011, ISBN 978-1-107-64445-8); Faraway, Extending the Linear Model with R (Chapman Hall/CRC Press, 2006, ISBN 978-1-58488-424-8; errata); and Venables and Ripley, Modern Applied Statistics with S (Springer, 2003; ISBN 9780387954578) will be optional.

Collaboration, Cheating and Plagiarism

Cheating leads to desolation and ruin (photo by paddyjoe on Flickr)

Except for explicit group exercises, everything you turn in for a grade must be your own work, or a clearly acknowledged borrowing from an approved source; this includes all mathematical derivations, computer code and output, figures, and text. Any use of permitted sources must be clearly acknowledged in your work, with citations letting the reader verify your source. You are free to consult the textbook and recommended class texts, lecture slides and demos, any resources provided through the class website, solutions provided to this semester's previous assignments in this course, books and papers in the library, or legitimate online resources, though again, all use of these sources must be acknowledged in your work. Websites which compile course materials are not legitimate online resources. Neither are large language models (ChatGPT, etc.).

In general, you are free to discuss homework with other students in the class, though not to share work; such conversations must be acknowledged in your assignments. You may not discuss the content of assignments with anyone other than current students or the instructors until after the assignments are due. (Exceptions can be made, with prior permission, for approved tutors.) You are, naturally, free to complain, in general terms, about any aspect of the course, to whomever you like.

Any use of solutions provided for any assignment in this course in previous years is strictly prohibited. This prohibition applies even to students who are re-taking the course. Do not copy the old solutions (in whole or in part), do not "consult" them, do not read them, do not ask your friend who took the course last year if they "happen to remember" or "can give you a hint". Doing any of these things, or anything like these things, is cheating, it is easily detected cheating, and those who thought they could get away with it in the past have failed the course. Even more importantly: doing any of those things means that the assignment doesn't give you a chance to practice; it makes any feedback you get meaningless; and of course it makes any evaluation based on that assignment unfair.

If you are unsure about what is or is not appropriate, please ask me before submitting anything; there will be no penalty for asking. If you do violate these policies but then think better of it, it is your responsibility to tell me as soon as possible to discuss how your mis-deeds might be rectified. Otherwise, violations of any sort will lead to severe, formal disciplinary action, under the terms of the university's policy on academic integrity.

You must complete "homework 0" on the content of the university's policy on academic integrity, and on these course policies. This assignment will not factor into your grade, but you must complete it, with a grade of at least 80%, before you can get any credit for any other assignment.

No AI

As stated above, you cannot use any generative AI tools for any assignment in this class: not for drafting or brainstorming, not for writing, not for coding, not for getting started or for getting un-stuck, nothing. Using one will be treated like plagiarism. Spelling and grammar checkers are OK.

Rationale: Generative tools, such as ChatGPT or CoPilot, are sometimes useful, if you already know enough about a topic to check and correct their output. You do not know enough about the topics in this class to reliably use them here. Worse, if you rely on them, instead of doing the assignments on your own, they will actively keep you from learning.

Accommodations for Students with Disabilities

If you need accommodations for physical and/or learning disabilities, please contact the Office of Disability Resources, via their website http://www.cmu.edu/disability-resources. They will help you work out an official written accommodation plan, and help coordinate with me.

Other Iterations of the Class

Some material is available from versions of this class taught in other years. As stated above, any use of solutions provided in earlier years is not only cheating, it is very easily detected cheating.

Schedule

Subject to revision. Lecture notes, assignments and solutions will all be linked here, as they are available.

Identifying significant features from background (photo by Gord McKenna on Flickr)

Current revision of the complete textbook

January 14 (Tuesday): Introduction to the class; regression

Regression is about guessing the value of a quantitative, numerical random variable $ Y $ from one or more other variables $ X $ (which may or may not be numerical). Because "guess" sounds undignified, we say that we are making a point prediction (as opposed to an interval prediction or distributional prediction). This means that our guess is a numerically-valued function of $ X $, say $ m(X) $. Traditionally, we measure the quality of the prediction by the expected squared error, $ \Expect{(Y-m(X))^2} $. Doing so leads to a unique choice for the optimal or true regression function: $ \mu(x) \equiv \Expect{Y|X=x} $, the conditional expectation function. We can always say that $ Y = \mu(X) + \epsilon $, where the noise $ \epsilon $ around the regression function has expectation 0, $ \Expect{\epsilon|X=x} = 0 $. Calculating the true regression function would require knowing the true conditional distribution of $ Y $ given $ X $, $ p(y|x) $. Instead of having that distribution, though, as statisticians we just have data, $ (x_1, y_1), (x_2, y_2), \ldots (x_n, y_n) $. We use that data to come up with an estimate of the regression function, $ \hat{\mu} $. Because our data are random, our estimate is also random, so $ \hat{\mu} $ has some distribution. This distribution matters for the error of our estimate, via the bias-variance decomposition: $ \Expect{(Y-\hat{\mu}(X))^2|X=x} = \Var{\epsilon|X=x} + (\mu(x) - \Expect{\hat{\mu}(x)})^2 + \Var{\hat{\mu}(x)} $.
Reading:: Chapter 1 of the textbook (R code from that chapter); CMU's policy on academic integrity; This course's policy on collaboration, cheating and plagiarism (above); Excerpt from Turabian's A Manual for Writers (on Canvas)
Optional reading:: Cox and Donnelly, chapter 1; Faraway, chapter 1 (especially up to p. 17).
Homework 0 (on collaboration and plagiarism) assigned

January 16 (Thursday): Linear smoothers, and the truth about linear regression

In practice, almost all ways of estimating the regression function are examples of linear smoothers, where $ \hat{\mu}(x) = \sum_{j=1}^{n}{w(x, x_j) y_j} $. Here the weights $ w(x, x_j) $ are some way of saying "How similar is the data point $x_j $ to the place where we're trying to make a prediction $ x $?", where we (usually) want to give more weight to values $ y_j $ from places $ x_j $ which are similar to $ x $. Examples of this scheme include the nearest neighbor method, the $ k $-nearest-neighbor method, kernel smoothing, and linear regression. (The weights in linear regression are very weird and only make sense if we insist on smoothing the data on to a straight line, no matter what.) For all linear smoothers, many properties of the fitted values can be read off from the weight, influence or hat matrix $ \mathbf{w} $, defined by $ w_{ij} = w(x_i, x_j) $, just as we learned to use the hat matrix in linear regression.
If we decide to predict $ Y $ as a linear function of a scalar $ X $, so $m (X) = b_0 + X b_1 $, there is a unique optimal linear predictor: $ m(X) = \Expect{Y} + (X-\Expect{X}) \frac{\Cov{X,Y}}{\Var{X}} $. That is, the optimal slope of the simple linear regression is $ \beta_1 = \frac{\Cov{X,Y}}{\Var{X}} $, and the optimal intercept makes sure the regression line goes through the means, $ \beta_0 = \Expect{Y} - \Expect{X} \beta_1 $. This generalizes to higher dimensions: if we decide to linearly predict $ Y $ from a vector $ \vec{X} $, the vector of optimal slopes is $ \vec{\beta} = \Var{\vec{X}}^{-1} \Cov{\vec{X}, Y} $, and the intercept is $ \beta_0 = \Expect{Y} - \Expect{\vec{X}} \cdot \vec{\beta} $, so the optimal linear predictor again has the form $ \Expect{Y} + (\vec{X} - \Expect{\vec{X}}) \Var{\vec{X}}^{-1} \Cov{\vec{X}, Y} $. This is where the strange-looking form of the weights in the hat matrix come from.
All this is true whether or not the real relationships are linear, whether there are any Gaussian distributions anywhere, etc., etc. Ordinary least squares will give consistent estimates for $ \beta_0 $ and $ \vec{\beta} $ under a very wide range of circumstances, where none of the usual assumptions from a linear-regression course apply. The place those assumptions become important is in the typical calculations of statistical significance, confidence intervals, prediction intervals, etc. However, it is important to understand that "this coefficient is statistically significant" really means "a linear model where there's a slope on this variable fits better than one which is flat on that variable, and the difference is too big to come from noise unless we're really unlucky". Statistical significance thus runs together actual effect magnitude, precision of measurement, and sample size. Lots of the other stuff people would like linear regression to do, and which many non-statisticians think it can do, can also be seen to be myths.
Reading:: Chapter 2 (R code from that chapter); Handout: "predict and Friends: Common Methods for Predictive Models in R" (PDF, R Markdown)
Optional reading: Faraway, rest of chapter 1
Homework 1 assigned

January 21 (Tuesday): Evaluation of Models: Error and inference

When we use statistical models to make predictions, we don't really care about their performance on our data. Rather, we want them to fit well to new cases, at least if those new data points followed the same distribution as the old. Expected error (or "loss") on new data is sometimes called risk; we want low risk. The difficulty is that we adjust our models to fit the training data, so in-sample performance on that training data is a biased estimate of risk. For regression models, when we use squared error as our loss function, we can relate this optimism bias to the covariance between data points and fitted values, $ \Cov{Y_i, \hat{\mu}(X_i)} $, but the issue is much more general. This becomes particularly important when we're comparing different models to select one of them: more flexible models will seem to fit the data better, but are they actually learning subtle-yet-systematic patterns, or just memorizing noise?
The cleanest way to get a good estimate of the risk would be to evaluate all our models on a new, independent testing set. If we don't have an independent testing set, we can try splitting our data (randomly) into training and testing sets. This introduces some extra randomness, which we can partially average away by using each half of the data in turn as the testing set, and averaging. We can often do even better by $ k $-fold cross-validation, dividing the data into $ k $ "folds", where each fold is used in turn as the testing set (and the other $ k-1 $ folds together are the training set). For some purposes, $ n $-fold or leave-one-out cross-validation is even better, but it comes at a high computational cost. There are however short-cuts for leave-one-out cross-validation for linear smoothers. The famous Akaike information criterion (AIC) is another approximate hack for leave-one-out CV.
Model selection, whether by cross-validation or some other means, makes our choice of model dependent on the data, hence (partially) random. The usual formulas for inferential statistics (p-values, confidence intervals, etc.) do not include this extra randomness. Statistical inference after model selection therefore requires extra work; the most straightforward approach is just data splitting again.
Reading: Chapter 3 (R code from that chapter; R code for re-usable functions from that chapter, including cv.lm(), without examples/demos)
Optional reading: Cox and Donnelly, ch. 6

January 23 (Thursday): Smoothing methods for regression

Kernel smoothing, a.k.a. Nadaraya-Watson smoothing, is yet another linear smoother. For one-dimensional $ x $, We make predictions using $ \widehat{\mu}(x) = \sum_{j=1}^{n}{y_j \frac{K\left( \frac{x - x_j}{h} \right)}{\sum_{k=1}^{n}{K\left( \frac{x - x_k}{h}\right)}}} $. Here $ K(u) $ is the kernel function, which is non-negative and maximized when $ u = 0 $; when we do $ K(x - x_j $, we're using the kernel function to measure how similar the data point $ x_j $ is to the place $ x $ where we're making a prediction. We usually also require that $ K(u) $ be a probability density with expectation zero and finite variance ( $ \int_{-\infty}^{\infty}{K(u) du} = 1 $, $ \int_{-\infty}^{\infty}{u K(u) du} = 0 $, $ \int_{-\infty}^{\infty}{u^2 K(u) du} < \infty $ ). These restrictions ensure that when $ x_j $ is very far away from our point of interest $ x $, $ y_j $ gets comparatively little weight in the prediction. Dividing by the sum of the kernel factors makes sure that our prediction is always a weighted average. Finally, we need some sense of what counts as "near" or "far", a distance scale --- this is provided by the $ h $ in the denominator inside the kernel functions, called (for historical reasons) the bandwidth. Very roughly speaking, $ y_j $ should get a substantial weight in the prediction of $ \widehat{\mu}(x) $ if, and only if, $ x_j $ is within a few factors of $ h $ of $ x $.
Intuitively, a big bandwidth means every prediction averages over a lot of data points which are widely spread. It should thus give a very smooth function, estimated with low variance,but (potentially) high bias. On the other hand, using a very small bandwidth makes all the predictions into very local averages, based only a few data points each --- they will be less smooth, and have higher variance, but also less bias. Because kernel smoothing gives us a weighted average, we can analyze its bias and variance comparatively easily. The variance comes from the fact that we are only averaging a limited number of $ y_j $ values in each prediction, where the corresponding $ x_j $ are close to the operating point $ x $. The number of such points should be about $ 2 n h p(x) $, $ p $ being the pdf of $ X $, and the variance should be inversely proportional to that. The bias comes from the fact that $ \mu(x_j) \neq \mu(x) $. Assuming the real regression function $ \mu $ is sufficiently smooth, we can do some Taylor expansions to get that the bias is $ \propto h^2 $. (One part of the bias comes from $ \mu(x) $ having a non-zero slope and $ p(x) $ also having a non-zero slope, so the $ x_j $ tend to be on one side or the other of $ x $, pulling the average up or down. The other part of the bias comes from $ \mu(x) $ having non-zero curvature, pulling the average up or down regardless of the distribution of the $ x_j $. Both contributions end up being $ \propto h^2 $.) This lets us conclude that the optimal bandwidth $ h_{opt} \propto n^{-1/5} $, and that the expected squared error of kernel regression $ = \Var{\epsilon} + O(n^{-4/5}) $. That is, the excess expected squared error, compared to the Oracle who knows the true regression function, is only $ O(n^{-4/5}) $. Notice that the optimal bandwidth $ h $ changes with the sample size, and goes to zero. Bandwidths are not parameters.
In practice, we find the best bandwidth by cross-validation, which is why this lecture comes after the previous one.
All of the above analysis is for $ X $ one dimensional. If $ X $ has multiple dimensions, the usual approach is to multiply kernel functions together, each with their own bandwidth. This changes the error analysis in ways we will come back to later in the course. (The bias stays the same, but the variance blows up.) We can also incorporate qualitative variables (factors in R) by appropriate kernels.
Reading: Chapter 4 (R code from that chapter)
Optional reading:: Faraway, section 11.1; Hayfield and Racine, "Nonparametric Econometrics: The np Package"; Gelman and Pardoe, "Average Predictive Comparisons for Models with Nonlinearity, Interactions, and Variance Components" [PDF]
Homework 0 due (at 6 pm)
Homework 1 due (at 6 pm)
Homework 2 assigned (uval.csv data file)

January 28 (Tuesday): Smoothing continued

Previous lecture continued; same readings.

January 30 (Thursday): Simulation and the Bootstrap

When we have complicated probabilistic models, it is often very hard, if not impossible, to come up with nice, short formulas describing their behavior. A good probabilistic model is, however, a systematic, step-by-step account of how to generate a data set, or something shaped like a data set. (There is some dice-rolling and coin-tossing involved, because the model is probabilistic, but the model is very precise about which dice to toss when, and what to do with the results.) This gives us an alternative to analytical calculations: simulate data from the model, and see what it looks like. We usually need to run the simulations multiple times, to get an idea of the distribution of outcomes the model can generate, but once we can figure out how to do one simulation, we can get the computer to do it for us over and over. Analytical probability formulas should in fact be seen as short-cuts which summarize what exhaustive simulations would also tell us. Those short-cuts are useful when we can find them, but we do not have to rely on them.
Data-splitting and cross-validation can be seen as examples of simulation, specifically simulating the process of generalizing to new data.
Our data are random; consequently everything we calculate from those data are random; consequently any conclusions we draw based on analysis of our data are (at least somewhat) random. If we could repeat the experiment (or survey, or re-run the observational study), we would get more or less different data, make more or less different calculations, draw more or less different inferences. The distribution of calculations or inferences we would see under repetitions of the experiment is called the sampling distribution, and is the source of our knowledge about uncertainty --- about standard errors, biases, confidence intervals, etc. For a few models, under very strong probabilistic assumptions, we can work out sampling distributions analytically. Away from those special cases, we need other techniques. By far the most important technique is called the bootstrap: come up with a good estimate of the data-generating process; simulate new data sets from the estimate; repeat the analysis on the simulation output; use the simulation distribution as an approximation to the true, unknown sampling distribution. This simulation can be done in two ways. One is to just estimate a model, and then simulate from the estimated model. The other uses the empirical distribution of the data points as an approximation to the data-generating distribution, and simulates from that, which amounts to "resampling" data points with replacement. Hybrids between these two approaches are possible for regression models (e.g., estimating a model of the shape of the regression, but then resampling the residuals around that curve). While more bootstrap simulations are always better, all else being equal, they are subject to diminishing returns, and we can think about how few we really need for particular applications.
Bootstrapping is a fundamental technique for quantifying uncertainty in modern statistics, and you will get lots of practice with it. There are nonetheless some things which bootstrapping does poorly. These are, unsurprisingly, situations where even a small discrepancy in the distribution we simulate from leads to a large error in our conclusions, or a large change in the sampling distribution.
Reading: Chapters 5 and 6: R for chapter 5; R for chapter 6; R for chapter 6, limited to reusable functions (i.e., omitting examples/demos)
Optional readings:: Appendix on writing R code; Handout on approximating probabilities and expectations by repeated simulation ("Monte Carlo"): PDF, .Rmd; Cox and Donnelly, chapter 8
Homework 2 due (at 6:00 pm)
Homework 3 assigned (stock_history.csv data file); help files for particular questions: Q6b, Q6c, and Q9a. Using these will lead to partial credit for those questions (but not affect your score on other questions).

February 4 (Tuesday): Splines

A "spline" was originally a tool used by craftsmen to draw curves: pin a thin, flexible board or strip of material down in a few points and let it bend to give a smooth curve joining those points; the stiffer the material, the less it curved. Today, the smoothing spline problem is find the one-dimensional function which balances between coming close to given data points, versus having low average curvature. Specifically, we seek the function which minimizes $ \frac{1}{n}\sum_{i=1}^{n}{(y_i - m(x_i))^2} + \lambda \int_{-\infty}^{\infty}{\left( \frac{d^2m}{dx^2}(x)\right)^2 dx} $. Here $ \lambda $ is the penalty factor which tells us how much mean-squared-error we are prepared to trade for a given amount of curvature (corresponding to the stiffness of the spline tool). The function which minimizes this criterion is called the spline. Splines are, it turns out, always piecewise cubic polynomials, with the boundaries between pieces, or knots, located at the $ x_i $; splines are always continuous, with continuous first and second derivatives. All the coefficients can be found efficiently by solving a system of $ n $ linear equations in $ n $ unknowns, and prediction can then be done very rapidly. As $ \lambda \rightarrow 0 $, we get functions that veer around wildly to interpolate between the data points; as $ \lambda \rightarrow \infty $, we get back towards doing ordinary least squares to find the straight line that minimizes the mean-squared error. $ \lambda $ thus directly controls how much we smooth, by penalizing un-smoothness. Low values of $ \lambda $ lead to estimates of the regression function with low bias but high variance; high values of $ \lambda $ have high bias but low variance. If $ \lambda \rightarrow 0 $ as $ n \rightarrow 0 $, at the right rate, the smoothing spline will converge on the true regression function. Because penalized estimation generally corresponds to constrained estimation, and vice versa ("a fine is a price"), we can think of splines found with high values of $ \lambda $ as answering the question "how close can we come to the data, if our function can only have at most so-much curvature?", with the constraint weakening (allowing for more curvature) as $ \lambda $ gets smaller. Consistent estimation requires that the constraint eventually go away, but not too fast.
In higher dimensions, we either generalize the spline problem to account for curvature in multiple dimensions ("thin plate" splines), or piece together the functions we would get from solving multiple one-dimensional spline problems ("tensor product" splines). Or we turn to additive models (next chapter).
Reading: Chapter 7: R for chapter 7

February 6 (Thursday): Multivariate Smoothing and Additive Models

If we estimate a smooth regression function at $ x $ by averaging all the data points found within a distance $ h $ of $ x $, we get a bias that is $ O(h^2) $. The variance of this estimate, meanwhile, is $ O(n^{-1} h^{-d}) $, where $ d $ is the dimension of $ x $. The bias and variance of kernel regression scales the same way, because kernel regression essentially is a kind of local averaging. When we minimize the sum of bias squared and variance, we conclude that the best bandwidth $ h = O(n^{-1/(4+d)}) $. The total error (bias squared plus variance) of the estimate therefore goes to zero like $ O(n^{-4/(4+d)}) $. Local averaging and kernel smoothing are therefore consistent in any number of dimensions --- they converge on the true regression function ---- but they slow down drastically as the number of dimensions grows, the "curse of dimensionality".
This is not just a fact about local averaging and kernel smoothing. Splines, nearest neighbors, and every other universally consistent method will converge at the same rate. The only way to guarantee a faster rate of convergence is to use a method which correctly assumes that the true regression function $ \mu $ has some special, nice structure which it can use. For instance, if the true regression function is linear, the total estimation error of linear regression will be $ O(d/n) $; the general rate of convergence for any parametric model will be $ O(1/n) $. However, if that assumption is false, then a method relying on that assumption will converge rapidly to a systematically-wrong answer, so its error will be $ O(1) + O(1/n) $.
Additive models are an important compromise, for dealing with many variables in a not-fully-nonparametric manner. The model is $ \mu(x) = \alpha + \sum_{j=1}^{d}{f_j(x_j)} $, so each coordinate $ x_j $ of $ x $ makes a separate contribution to the regression function via its partial response function $ f_j $, and these just add up. The partial response functions can be arbitrary smooth functions, so this set-up includes all linear models, but is far more general. (Every linear model is additive, but not vice versa.) The partial response functions are also basically as easily interpretable as the slopes in a linear model. We can estimate an additive model by doing a series of one-dimensional non-parametric regressions, which we now understand how to do quite well. If the true regression function is additive, then the total estimation error shrinks at the rate $ O(d n^{-4/5}) $, almost as good as $ O(d n^{-1}) $ for a linear model. If the true regression function is not additive, this is the rate at which we converge to the best additive approximation to $ \mu $, which is always at least as good as the best linear approximation. We can also extend additive models to jointly smooth some combinations of variables together, allowing them to interact (and to interact more sensibly than the product terms in conventional linear models).
Reading: Chapter 8: R for chapter 8
Homework 3 due at 6:00 pm
Homework 4 assigned; gmp-2006.csv data file

February 11 (Tuesday): Testing Regression Specifications

As mentioned last time, the total estimation error for a correctly-specified parametric model is (typically) $ O(d n^{-1}) $, while that for a mis-specified parametric model is $ O(1) + O(d n^{-1}) $. The error for a non-parametric model, which cannot really be mis-specified, is $ O(n^{-4/(4+d)}) $. If a parametric model is properly specified, therefore, it will (eventually) have smaller error than a non-parametric model, while if a parametric model is mis-specified, the non-parametric method will (eventually) predict better. This leads us to a set of tests of regression specifications, where we compare the fit of parametric and non-parametric models, and assess $ p $-values by simulating from the estimated parametric model. Alternatively, we use the fact $\Expect{Y-m{X}|X=x}=0$ when, but only when, $m$ is the true regression function, so we apply a non-parametric smoother to the residuals from our parametric model, and look at how close the resulting function is to zero.
Reading: Chapter 9: In-class slides and the Rmd file that generated them; R code from chapter 9
Optional reading:: Cox and Donnelly, chapter 7

February 13 (Thursday): Heteroskedasticity, Weighted Least Squares, and Variance Estimation

A random process with constant variance is called homoskedastic (or homoscedastic), while one with changing variance is called heteroskedastic. If we know that a process is heteroskedastic, it makes sense to give less weight to observations which we know have high variance, and more weight to observations known to have low variance. This leads to weighted-least-squares problems for estimating means, linear regressions, and non-parametric regressions, many of which can be solved in closed form, if the variances are known. Sometimes those variances can be worked out by studying our measurement process, but often they are themselves unknown. Fortunately, non-parametric regression actually gives us a way of estimating conditional variances: basically, do an initial regression of $ Y $ on $ X $, find the residuals, and then smooth the squared residuals against $ X $. (There are some bells-and-whistles.) We can then use this variance function to get a better estimate of the regression, by trying harder to fit observations with low variance and being more relaxed about fitting high-variance observations, then improve the variance function estimate, etc.

Reading: Chapter 10, skipping section 10.5

Optional reading:

Faraway, section 11.3

Homework 4 due at 6:00 pm

Homework 5 assigned, with nampd.csv and MoM.txt data sets

February 18 (Tuesday): Logistic Regression

Classification is very similar to regression, except that the variable $ Y $ we are trying to predict is qualitative or categorical. For some classification tasks, it is enough to have a sheer guess, but usually we want a probability distribution for $ Y $ given $ X $. If there are only two ("binary") categories, we arbitrarily code one of them as 0 and the other as 1. The advantage of this coding is that then $ p(x) \equiv \mathbb{P}\left( Y=1|X=x \right) = \Expect{Y|X=x} $. In principle, then, binary classification to regression is just a special case of regression.
The snag is that probabilities have to be between 0 and 1, but not all regression methods always give predictions between 0 and 1, even when all the observed $ y_i $ values obey those constraints. Linear regression, in particular, always gives non-sensical probabilities if $ x $ moves far enough from the center of the data. To combat this, we often apply some transformation to the output of a regression method, transforming a number in $ (-\infty, \infty) $ into a number in $ (0,1) $. That is, the idea is to use a model of the form $ p(x) = g(m(x)) $, where $ m $ is some sort of regression model estimated from data, and $ g $ is a transformation, fixed in advance, which ensures we always have a legitimate probability.
Among all the many transformations people have tried, a special place is held by the transformation $ g(u) = \frac{e^{u}}{1+e^{u}} $. The reason has to do with the likelihood. The likelihood of our model assigns to the data set $ (x_1, y_1), (x_2, y_2), \ldots (x_n, y_n) $ is $ \prod_{i=1}^{n}{p(x_i)^{y_i} (1-p(x_i))^{1-y_i}} $. The log-likelihood is then $ \sum_{i=1}^{n}{\log{(1-p(x_i))} + y_i \log{\frac{p(x_i)}{1-p(x_i)}}} $. The log-likelihood thus depends on the observed $ y_i $ only via the "log odds ratios" $ \log{\frac{p}{1-p}} $. Even if we used a different transformation to turn real numbers into probabilities, therefore, we would still end up having to deal with this transformation when we tried to fit to data. The map $ p \mapsto \log{\frac{p}{1-p}} $ from probabilities in $ [0,1] $ to real numbers in $ (-\infty, \infty) $ has come to be called the "logistic transformation", or "logit". (The inverse transformation is the "inverse logistic" or "inverse logit" or "ilogit".) What is called logistic regression is the modeling assumption that the log-odds-ratio is a linear function of $ x $, i.e., $ \mathrm{logit}(p(x)) = \beta_0 + \beta \cdot x $. (Turned around, $ p(x) = \frac{e^{\beta_0 + \beta \cdot x}}{1+ e^{\beta_0 + \beta \cdot x}} $. While common, there is usually no scientific or mathematical reason to think it is correct; it does, however, often work well if the right features go into $ x $. A natural generalization is to make the log-odds-ratio an additive function of $ x $, as in additive models for ordinary regression.
Once we have a model for $ p(x) $, whether logistic or otherwise, we can ask whether the probabilities it gives us for $ Y $ match the actual frequencies. (Does it, in fact, rain on half the days when weather forecast predicts a 50% chance of rain?) This is known as checking calibration, and while there are some advanced methods for doing so, simple graphical checks are usually informative.
Reading: chapter 11
Optional reading: Faraway, chapter 2 (omitting sections 2.11 and 2.12)

February 20 (Thursday): Generalized linear models and generalized additive models


Reading: Chapter 12
Optional reading: Faraway, section 3.1 and chapter 6
Homework 5 due at 6:00 pm
Homework 6 assigned; ch.csv data file

February 25 (Tuesday): Canceled because the professor was sick

Was to have been: trees
Reading: Chapter 13

February 27 (Thursday): Canceled because the professor was sick

Homework 6 due at 6:00 pm

Homework 7 assigned (compas_violence.csv data file, testing_set file with row numbers for your testing set). --- Not due until Thursday, 13 March, the week after spring break.

March 4 and 6: NO CLASS (spring break)

March 11 (Tuesday): Multivariate distributions

Reading: Appendix on multivariate distributions

March 13 (Thursday): Density Estimation

Reading: Chapter 14
Slides, .Rmd file generating the slides
Homework 7 due at 6:00 pm (note date!)
Homework 8 assigned; n90_pol.csv data file

March 18 (Tuesday): Independence, Dependence, Conditional Independence

Two random variables $X$ and $Y$ are statistically independent (or probabilistically independent, or just "independent") when their joint distribution is the product of their marginal distributions, $p_{XY}(x,y) = p_X(x) p_Y(y)$. (If $X$ and $Y$ are not independent, then (naturally) they are dependent.) This holds if and only the marginal distribution of each variable is equal to its conditional distribution given the other. That is, "$X$ and $Y$ are independent" is equivalent to $p_{X|Y}(x|y) = p_X(x)$ and to $p_{Y|X}(y|x)=p_{Y}(y)$. Informally, this means that $X$ and $Y$ are irrelevant to each other: knowing the value one of these random variables tells us absolutely nothing about the other, and conditioning does not change the distribution in any way.
We can make the talk about relevance and irrelevance somewhat more formal with the notion of mutual information: $I[X;Y] \equiv \int{\log{\left( \frac{p_{XY}(x,y)}{p_{X}(x) p_{Y}(y)} \right)} p_{XY}(x,y) dx dy}$. If $X$ and $Y$ are independent, the ratio inside the logarithm will be 1 and this integral will be 0. Less obviously (see ch. 14), $I[X;Y] \geq 0$, and $I[X;Y]=0$ if and only if $I[X;Y]=0$. Zero mutual information is thus one way of characterizing probabilistic independence. (There are others.)
Two random variables $X$ and $Y$ are conditionally independent given a third random variable $Z$ when the conditional distributions factor: $p_{XY|Z}(x, y| z) = p_{X|Z}(x|z) p_{Y|Z}(y|z)$, which is true if and only if $p_{XYZ}(x, y, z) = p_{X|Z}(x|z) p_{Y|Z}(y|z) p_{Z}(z)$. This is in turn equivalent to both $p_{X|YZ}(x|y,z) = p_{X|Z}(x|z)$ and $p_{Y|XZ}(y|x,z) = p_{Y|Z}(y|z)$. Informally: $X$ is irrelevant to $Y$ given $Z$, and vice versa. Once we know $Z$, $Y$ has nothing to tell us about $X$. This can be formalized with conditional mutual information, $I[X;Y|Z] \equiv \int{p_Z(z) \left( \int{\log{\left( \frac{p_{XY|Z}(x,y|z)}{p_{X|z}(x) p_{Y|Z}(y|z)} \right)} p_{XY|Z}(x,y|z) dx dy} \right) dz}$. As before, $I[X;Y|Z]=0$ if and only if $X$ and $Y$ are conditionally independent given $Z$.
Conditional independence does not imply unconditional or marginal independence. Perhaps more remarkably, marginal independence does not imply conditional independence.
Both independence and conditional independence can greatly doing inference for high-dimensional distributions. If all our variables are independence of each other, we can learn a bunch of one-dimensional distributions and multiply. (This is similar to how additivity simplifies learning regression functions.) If most of our variables are conditionally independent given one variable, then we need to learn a bunch of two-dimensional (conditional) distributions. The next lecture will explore some variations on that theme.
Reading: Chapter 14 (still)

March 20 (Thursday): Factor Models and Mixture Models

Reading: Chapters 16 and 17
Homework 8 due at 6:00 pm
Homework 9 assigned; stockData.RData

March 25 (Tuesday): Graphical Models

Reading: Chapter 18

March 27 (Thursday): Graphical Causal Models

Reading: Chapter 19
Optional reading:: Cox and Donnelly, chapters 5, 6 and 9; Pearl, "Causal Inference in Statistics", section 1, 2, and 3 through 3.2
Homework 9 due at 6:00 pm
Homework 10 assigned, due on Wednesday, 2 April; sim-smoke.csv

April 1 (Tuesday): Identifying Causal Effects from Observations I

A quantity (or vector, curve, function, etc.) in a statistical model is identified when it can be expressed as a function of the joint distribution of observables; otherwise it is unidentified. Causal quantities, like $ \Prob{Y=y|do(X=x)} $, are often not identified. One common reason is that $ X $ and $ Y $ have common causal ancestors, so the actual effect of $ X $ on $ Y $ is confounded with the information $ X $ gives us about the common ancestor(s), and so about $ Y $. Thus R. A. Fisher, probably the greatest statistician who ever lived, did not deny that smoking predicted lung cancer; he just thought that there was a common genetic cause of both lung cancer and of taking up smoking.
If $ X $ has no causal ancestors, i.e., is exogenous ("born outside"), then its causal effects are always identified, and $ \Pr{Y|do(X)} = \Pr{Y|X} $. If we experiment on $ X $, particularly if we do randomized experiments on $ X $, we ensure that it is exogenous, and can identify its effects. If we cannot experiment, we might be able to observe all the parents of $ X $. We can then show that $ \Prob{Y|do(X=x)} = \sum_{t}{\Prob{Y|X=x, \mathrm{Parents}(X)=t}\Prob{\mathrm{Parents}(X)=t}} $, so that the effects of $ X $ are identified. That is, we condition on, or control for, the parents. (This rule implies the one about the effects of exogenous variables being identified.)
If controlling for all the parents of $ X $ is not feasible, we can try to find a set of variables $ S $ with two properties: (i) $ S $ blocks all "back-door" paths between $ X $ and $ Y $, those paths beginning with an arrow into $ X $; and (ii) $ S $ contains no descendants of $ X $. Any set of variables $ S $ which meets both (i) and (ii) satisfies the back-door criterion. We can then identify the effects of $ X $ in basically the same way: $ Prob{Y|do(X=x)} = \sum_{s}{\Prob{Y|X=x, S=s}\Prob{S=s}} $. There may be multiple sets of variables which all satisfy the back door criterion. These will all converge on same estimate of $ \Prob{Y|do(X)} $ in the limit of infinite data, but some of them may be more practical, reliable or simply cheap in a pre-asymptotic world.
Reading: Chapter 20
Optional reading: Pearl, "Causal Inference in Statistics", sections 3.3--3.5, 4, and 5.1

April 2 (Wednesday)

Homework 10 due at 6 pm
Homework 11 assigned; sesame.csv data file

April 3 (Thursday): Carnival, NO CLASS

April 8 (Tuesday): Identifying Causal Effects from Observations II

Suppose we are unable, or unwilling, to use the back-door criterion. We can still identify the effect of $ X $ on $ Y $ if we can find a set of mediators or mediating variables where (i) $ M $ blocks all directed paths from $ X $ to $ Y $, (ii) there are no unblocked back-door paths from $ X $ to $ M $, and (iii) $ X $ blocks all back-door paths from $ M $ to $ Y $. Then $ \Prob{Y|do(X=x)} = \sum_{m}{\Prob{Y|do(M=m)}\Prob{M=m|X=x}} $, and we can identify $ \Prob{Y|do(M=m)} $ using $ X $ to block back-door paths, $ \Prob{Y|do(M=m)} = \sum_{x^{\prime}}{\Prob{Y|X=x^{\prime},M=m}\Prob{X=x^{\prime}}} $. These mediators meet the front-door criterion.: The final major way of getting identification is to use an instrumental variable, an $ I $ which has a directed path to $ X $ and to $ Y $, but where the only unblocked, directed paths to $ Y $ go through $ X $. We can identify $ \Prob{Y|do(I)} $, and $ \Prob{X|do(I)} $, and we can often (though not quite always) "back out" $ \Prob{Y|do(X)} $ from these, by solving an inverse problem.
Reading: Chapter 20

April 10 (Thursday): Estimating Causal Effects from Observations

Once we know that the causal effect we are interested in is identified, we still have to estimate it. If we are using the back-door criterion, the obvious approach to estimating $ \Prob{Y|do(X)} $ is to estimate $ \Prob{Y|X,S} $ and $ \Prob{S} $ and calculate. Similarly, to estimate $ \Expect{Y|do(X)} $, we can estimate $ \Expect{Y|X,S} $ and $ \Prob{S} $. If we have IID data, however, we can avoid the step of estimating $ \Prob{S} $. This is because, under the back-door criterion, $ \Prob{Y=y|do(X=x)} = \Expect{\Prob{Y=y|X=x, S}} $, and, by the law of large numbers, $ \frac{1}{n}\sum_{i=1}^{n}{\Prob{Y=y|X=x, S=s_i}} \rightarrow \Expect{\Prob{Y=y|X=x, S}} $. (Sample averages converge on expectations.) Similarly, $ \Expect{Y|do(X=x)} = \Expect{\Expect{Y|X=x, S}} $, and $ \frac{1}{n}\sum_{i=1}^{n}{\Expect{Y|X=x, S=s_i}} \rightarrow \Expect{\Expect{Y|X=x, S}} $. Substituting in estimated functions for $ \Prob{Y=y|X=x, S=s} $ or $ \Expect{Y|X=x, S=s} $ adds some more approximation error and uncertainty, but it is a feasible procedure.
If $ X $ is binary, 0 or 1, the propensity score is the probability that $ X=1 $ given the control variables in $ S $, $ R \equiv \Prob{X=1|S} $. When conditioning on $ S $ satisfies the back-door criterion, then conditioning on $ R $ alone also satisfies the back-door criterion, and $ \Expect{Y|do(X=x)} = \Expect{\Expect{Y|X=x, R}} $. So if we can get good estimates of the propensity scores, we can do a much simpler regression model for $ Y $ (or a much simpler conditional probability model).
If $ X $ is binary and all we care about is the average treatment effect of switching from $ X=0 $ to $ X = 1 $, that is $ ATE = \sum_{s}{\left(\Expect{Y|X=1, S=s} - \Expect{Y|X=0, S=s}\right)\Prob{S=s}} $. By the same sort of law-of-large-numbers argument, the ATE is $ \approx \frac{1}{n}\sum_{i=1}^{n}{\Expect{Y|X=1, S=s_i} - \Expect{Y|X=0, S=s_i}} $. The actual, observed $ Y_i = \Expect{Y|X=x_i, S=s_i} + \epsilon_i $, where $ \Expect{\epsilon_i|X,S}=0 $. So $ Y_i $ is an unbiased estimate of either $ \Expect{Y|X=1, S=s_i} $, or of $ \Expect{Y|X=0, S=s_i} $, depending on what $ x_i $ is. For each data point $ i $, match it with another data point, $ m(i) $, where the covariates are the same, $ S=s_i = s_{m(i)} $, but the cause is flipped, $ X_{m(i)} = 1-X_i $. Then $ (Y_i - Y_{m(i)})(2 X_i - 1) $ is an unbiased estimate of $ \Expect{Y|X=1, S=s_i} - \Expect{Y|X=0, S=s_i} $, and we can average this over all data points to get an estimate of the ATE. This sort of matching estimate is basically one-nearest-neighbors; it can be adapted to be more like kNN, including allowing for imperfect matching.
Matching on the propensity score, a.k.a. propensity score matching, is just what it sounds like.
Estimation using the front door criterion is built on top of estimation using the back-door criterion.
Reading: Chapter 21
Homework 11 due at 6:00 pm
Homework 12 assigned

April 15 (Tuesday): Discovering Causal Structure from Observations

Reading: Chapter 22

April 17 (Thursday): Dependent Data I

Almost everything so far, in this course and in your earlier statistics courses, has been based on independent data points. There might be complicated dependencies between variables within a data point, but each data point was treated as an independent draw from the same underlying distribution. Lots of real data is not like this, in particular, data from processes that evolve over time or are spread out over space are not like this. There are nonetheless underlying laws of large numbers for dependent data, essentially saying that if correlations become sufficiently weak over long intervals of time (or long distances in space), sample averages still converge on expectation values, and large samples become representative of the whole process. This lets us port over huge chunks of the technology we have built to modeling dependence over time: in particular, we can look at autoregressions, where expected future values are functions of past values. The conditional-distribution versions of these are Markov models.
Reading: Chapter 23
Homework 12 due at 6:00 pm
Homework 13 assigned; debt.csv data file

April 22 (Tuesday): Dependence Data II

Reading: Chapter 23, continued

April 24 (Thursday): Summary of the course

What have we learned?
Homework 13 due at 6:00 pm