Cosma Shalizi
36-402, Undergraduate Advanced Data Analysis
Spring 2024
Tuesdays and Thursdays, 9:30--10:50 am, Doherty Hall 2315
The goal of this class is to train you in using statistical models to
analyze data — as data summaries, as predictive instruments, and as tools
for scientific inference. We will build on the theory and applications of the
linear model, introduced
in 36-401, extending it
to more general functional forms, and more general kinds of data, emphasizing
the computation-intensive methods introduced since the 1980s. After taking the
class, when you're faced with a new data-analysis problem, you should be able
to (1) select appropriate methods, (2) use statistical software to implement
them, (3) critically evaluate the resulting statistical models, and (4)
communicate the results of your analyses to collaborators and to
non-statisticians.
During the class, you will do data analyses with existing software, and
write your own simple programs to implement and extend key techniques. You
will also have to write reports about your analyses.
36-602 A small number of well-prepared graduate students
from other departments can take this course by registering for it as 36-602.
(Graduate students enrolling in 36-402, or undergraduates enrolling in 36-602,
will be automatically dropped from the roster.) If you want to do so, please
contact me, to discuss whether you have the necessary preparation.
Prerequisites
36-401, with a grade
of C or better. Exceptions are only granted for graduate students in other
departments taking 36-602.
Instructors
Professors | Cosma Shalizi | cshalizi [at] cmu.edu |
| | Baker Hall 229C |
Teaching assistants | TBD |
Topics
- Model evaluation: statistical inference, prediction, and
scientific inference; in-sample and out-of-sample errors, generalization and
over-fitting, cross-validation; evaluating by simulating; the bootstrap;
penalized fitting; mis-specification checks
- Yet More Regression: what is regression, really?;
what ordinary linear regression actually does; what it cannot do; regression by kernel smoothing; regression by spline smoothing; additive models; regression by trees and/or nearest neighbors (time permitting)
- Generalized linear and additive models: logistic
regression; generalized linear models; generalized additive models.
- Distributions, Structure in Distributions, and Latent Variables: Multivariate distributions; factor analysis and latent variables; cluster/mixture models and latent variables; graphical models in general
- Causality: graphical causal models; causal
inference from randomized experiments; identification of
causal effects from observations; estimation of causal effects;
discovering causal structure
- Dependent data: dependence over time; dependence over space and over space-time; dependence over networks (time and/or space permitting)
See the end of this syllabus for the current lecture
schedule, subject to revision. Lecture handouts, slides, etc., will be
linked there, as available.
Course Mechanics and Grading
There are three reasons you will get assignments in this course. In order of
decreasing importance:
- Practice. Practice is essential to developing the skills
you are learning in this class. It also helps you learn, because some things
which seem murky clarify when you actually do them, and sometimes trying to do
something shows that you only thought you understood it.
- Feedback. By seeing what you can and cannot do, and what
comes easily and what you struggle with, I can help you learn better, by giving
advice and, if need be, adjusting the course.
- Evaluation. The university is, in the end, going to
stake its reputation (and that of its faculty) on assuring the world that you
have mastered the skills and learned the material that goes with your degree.
Before doing that, it requires an assessment of how well you have, in fact,
mastered the material and skills being taught in this course.
To serve these goals, there will be three kinds of assignment in this
course.
- Homework
- Almost every week will have a homework assignment, divided into a series
of questions or problems. These will have a common theme, and will usually
build on each other, but different problems may involve doing or applying some
theory, analyzing real data sets on the computer, and communicating the
results.
- All homework will be submitted electronically. Most weeks,
homework will be due at 6:00 pm on Thursdays. Homework
assignments will always be released by Friday of the previous week, and sometimes before. The week of Carnival (April 11th and 12th), the homework will be due at 6 pm on Wednesday the 10th, and will be shorter than usual (but count just as much).
- There are specific formatting requirements for homework --- see below.
- In-class exercises
- Most lectures will have in-class exercises. These will be short (10--15
minutes) assignments, emphasizing problem solving, connected to the theme
of the lecture and to the current homework. You will do them in class in small
groups of at most four people. The assignments will be given out in class, and
must be handed in electronically by 6 pm that day. On most days, a
randomly-selected group will be asked to present their solution to the class.
- Exams
- There will be two data analysis exams, one in the middle
and one at the end of the semester. In each, you will analyze a real-world
data set, answering questions about the world (not statistics) on the basis of
your analysis, and write up your findings in the form of a scientific report.
The exam assignments will provide the data set, the specific questions, and a
rubric for your report.
- Both exams will be take-home, and you will have at least one week to
work on each, without homework (from this class anyway). Both exams will
be cumulative.
- Exams are to be submitted electronically, and follow the
same formatting requirements as the homework --- see
below.
The exams will be due on February 29th and April 25th. If these dates
present difficulties for you, please contact me as soon as possible.
Time expectations
You should expect to spend 5--7 hours on assignments every week, averaging over
the semester. (This follows from the university's rules about how course
credits translate into hours of student time.) If you find yourself spending
significantly more time than that on the class, please come to talk to me.
Grading
Grades will be broken down as follows:
- Exercises: 10%. All exercises will have equal weight. Your lowest five
exercises grades will be dropped. If you complete all the exercises,
with a score of at least 60%, your lowest six grades will be dropped.
Late exercises will not be accepted for any reason.
- Homework: 60%. There will be 11 homeworks, all of equal weight. Your
lowest two homework grades will be dropped, no questions asked. If you turn in
all homework assignments (on time), with a score of at least 60%, your lowest
two homework grades will be dropped, and you will get 50 points of
extra credit added to your third-lowest homework grade.
Late homework will not be accepted for any reason.
- Data analysis exams: 30%, i.e., 15% each.
Dropping your lowest grades lets you schedule interviews, social
engagements, and other non-academic uses of your time flexibly, and without my
having to decide what is or is not important enough to change the rules.
(Giving you extra credit for making a serious effort at every assignment
rewards practice, which is very important.) If something --- illness, family
crisis, or anything else --- is a continuing obstacle to your ability
to do the work of the class, come talk to me.
Letter-grade thresholds
Grade boundaries will be as follows:
A | [90, 100] |
B | [80, 90) |
C | [70, 80) |
D | [60, 70) |
R | < 60 |
To be fair to everyone, these boundaries will be held to strictly.
Grade changes and regrading: If you think that particular assignment was wrongly graded, tell me as soon as possible. Direct any questions or complaints about your grades to me; the teaching assistants have no authority to make changes. (This also goes for your final letter grade.) Complaints that the thresholds for letter grades are unfair, that you deserve a higher grade, etc., will accomplish much less than pointing to concrete problems in the grading of specific assignments.
As a final word of advice about grading, "what is the least amount of work I need to do in order to get the grade I want?" is a much worse way to approach higher education than "how can I learn the most from this class and from my teachers?".
Lectures
Lectures will be used to amplify the readings, provide examples and demos, and
answer questions and generally discuss the material. They are also when you
will do the graded in-class assignments which will help consolidate your
understanding of the material, and help with your homework.
You will generally find it helpful to do the readings before coming
to class.
Electronics: Don't. Please don't use electronics during
lecture: no laptops, no tablets, no phones, no recording devices, no
watches that do more than tell time. The only exception to this rule is for
electronic assistive devices, properly documented with CMU's Office of
Disability Resources.
(The no-electronics rule is not arbitrary meanness on my part. Experiments
show,
pretty
clearly, that students learn more in electronics-free classrooms, not least
because your device isn't distracting your neighbors, who aren't as good at
multitasking as you are.)
Exams
The only exams will be the take-home data analysis exams (each 15% of your
final grade). You will have one week to work on each, without homework from
this class. The exams may require you to use any material already covered in
the readings, lectures or assignments.
Exams must also be submitted electronically, under the same rules about file
formats as homework.
The due dates for exams will be fixed by the first day of classes, and will
not change thereafter. Please try to schedule your obligations around them.
If you know that the dates will be a problem for you, please contact me as soon
as possible.
There will be no in-class exams, and nothing due during the final period.
R, R Markdown, and Reproducibility
R is a free, open-source software
package/programming language for statistical computing. You should have begun
to learn it in 36-401 (if not before). In this class, you'll be using it for
every homework and exam assignment. If you are not able to use R, or
do not have ready, reliable access to a computer on which you can do so, let me
know at once.
Communicating your results to others is as important as getting good results
in the first place. Every homework assignment will require you to write about
that week's data analysis and what you learned from it; this writing is part of
the assignment and will be graded. Raw computer output and R code is not
acceptable; your document must be humanly readable.
All homework and exam assignments are to be written up
in R Markdown. (If you know
what knitr is and would rather use it,
go ahead.) R Markdown is a system that lets you embed R code, and its output,
into a single document. This helps ensure that your work
is reproducible, meaning that other people can re-do your
analysis and get the same results. It also helps ensure that what you report
in your text and figures really is the result of your code (and not some
brilliant trick that you then forget, or obscure bug that you didn't notice).
For help on using R Markdown,
see "Using R Markdown
for Class Reports".
Canvas, Gradescope and Piazza
You will submit your work electronically through Gradescope. This includes
your homework assignments, your take-home exams and the class exercises. We
will use Canvas as a gradebook, to distribute solutions, and to distribute some
readings that can't be publicly posted here.
We will be using the Piazza website for question-answering. You will
receive an invitation within the first week of class.
Anonymous-to-other-students posting of questions and replies will be allowed,
at least initially. (Postings will not be anonymous to instructors.)
Anonymity will go away for everyone if it is abused. You should not expect the
instructors to answer questions on Piazza (or by e-mail) outside normal working
hours. (We may, but you shouldn't expect it.)
For each assignment, you should write your homework in R Markdown,
knit it to a humanly readable document in PDF format, and upload the PDF to
Gradescope. This is, ordinarily, what you will be graded on. However, it is
important that you keep the R Markdown file around, because every week I will
randomly select a few students and ask them to send me your R Markdown so that
I can re-run it, and check that it does, in fact, produce the file you turned
in. (You should expect to be picked about once a semester.) If they don't
match, I will have questions, and it will hurt your grade.
(Gradescope makes it much easier for multiple graders to collaborate, but it
doesn't understand R Markdown files, just PDFs.)
You'll need to do math for some homework problems. R Markdown provides a
simple but powerful system for type-setting math. (It's based on the LaTeX
document-preparation system widely used in the sciences.) If you can't get it
to work, you can hand-write the math and include scans or photos of your
writing in the appropriate places in your R Markdown document. You will,
however, lose points for doing so, starting with no penalty for homework 1, and
growing to a 90% penalty (for those problems) by the last homework.
--- For the class exercises, scans/photos of hand-written math are
acceptable, but you will lose points if they are hard to read. (Dark
ink on unlined paper tends to work best.)
Solutions
Solutions for all homework and in-class exercise will be available, after their
due date, through Canvas. Please don't share them with anyone, even after the
course has ended. This very much includes uploading them to websites.
Office Hours
Mondays | 9:00--10:00 | Ergan Shang | Piazza |
Mondays | 9:30--10:30 | Ergan Shang | Zoom |
Mondays | 1:30--2:30 | Cosma Shalizi | Piazza |
Mondays | 4:00--5:00 | Steffi Chern | Piazza |
Tuesdays | 8:30--9:30 | Kay Nam | Piazza |
Tuesdays | 12:30--1:30 | Soheun Yi | In person, Wean Hall 3713 |
Tuesdays | 1:00--2:00 | Neil Xu | Piazza |
Tuesdays | 1:30--2:30 | Cosma Shalizi | Piazza |
Tuesdays | 2:00--3:00 | Odalys Barrientos | Zoom |
Tuesdays | 4:00--5:00 | Michael Wieck-Sosa | Piazza |
Wednesdays | 9:00-10:00 | Vinay Maruri | Piazza |
Wednesdays | 10:00--11:00 | Neil Xu | Zoom |
Wednesdays | 12:00--1:00 | Julia Walchessen | In person, Wean Hall 3711 |
Wednesdays | 1:00--2:00 | Lawrence Jang | Piazza |
Wednesdays | 1:30--2:30 | Cosma Shalizi | Piazza |
Wednesdays | 3:30--4:30 | Eric Bensen | In person, Wean Hall 3715 |
Wednesdays | 4:00--5:00 | Tianyou Zheng | Piazza |
Thursdays | 8:30--9:30 | Eric Bensen | Piazza |
Thursdays | 12:30--1:30 | Michael Wieck-Sosa | Zoom |
Thursdays | 1:00--2:00 | Anni Hong | Piazza |
Thursdays | 1:30--2:30 | Cosma Shalizi | Piazza |
Thursdays | 4:00--5:00 | Anni Hong | In person, Gates Hall 4211 |
Thursdays | 4:00--5:00 | Sohuen Yi | Piazza |
Fridays | 9:00--10:00 | Odalys Barrientos | Piazza |
Fridays | 11:00--12:00 | Aoran Zhan | Piazza |
Fridays | 2:00--3:00 | Cosma Shalizi | Piazza |
Fridays | 4:00--5:00 | Gabriel Krotkov | Piazza |
Piazza office hours aren't (necessarily) the only time we'll answer
questions on Piazza, but they are when someone is online to clear any backlog
of accumulated questions, and for rapid follow-ups if there are further
questions.
If you want help with computing at in-person office hours, please bring a laptop.
If you cannot make the regular office hours, or have concerns you'd rather
discuss privately, please e-mail me about making an appointment.
Textbook
The primary textbook for the course will be the
draft Advanced
Data Analysis from an Elementary Point of View. Chapters will be
linked to here as they become needed. Reading these chapters will very greatly
help your ability to do the assignments.
In addition, Paul Teetor, The R Cookbook (O'Reilly Media, 2011,
ISBN 978-0-596-80915-7)
is strongly suggested as a reference.
Cox and Donnelly, Principles of Applied Statistics (Cambridge
University Press, 2011,
ISBN 978-1-107-64445-8); Faraway, Extending
the Linear Model with R (Chapman Hall/CRC Press, 2006,
ISBN 978-1-58488-424-8; errata);
and Venables and Ripley, Modern Applied Statistics with S
(Springer,
2003;
ISBN 9780387954578)
will be optional.
Except for explicit group exercises,
everything you turn in for a grade must be your own work, or a clearly
acknowledged borrowing from an approved source; this includes all mathematical
derivations, computer code and output, figures, and text. Any use of permitted
sources must be clearly acknowledged in your work, with citations letting the
reader verify your source. You are free to consult the textbook and
recommended class texts, lecture slides and demos, any resources provided
through the class website, solutions provided to this semester's
previous assignments in this course, books and papers in the library, or
legitimate online resources, though again, all use of these sources must be
acknowledged in your work. Websites which compile course materials
are not legitimate online resources. Neither are large language
models (ChatGPT, etc.).
In general, you are free to discuss homework with other students in the
class, though not to share work; such conversations must be acknowledged in
your assignments. You may not discuss the content of assignments with
anyone other than current students or the instructors until after the
assignments are due. (Exceptions can be made, with prior permission, for
approved tutors.) You are, naturally, free to complain, in general terms,
about any aspect of the course, to whomever you like.
During the take-home exams, you are not allowed to discuss the content of
the exams with anyone other than the instructors; in particular, you may not
discuss the content of the exam with other students in the course.
Any use of solutions provided for any assignment in this course in previous
years is strictly prohibited, both for homework and for exams. This
prohibition applies even to students who are re-taking the course. Do not copy
the old solutions (in whole or in part), do not "consult" them, do not read
them, do not ask your friend who took the course last year if they "happen to
remember" or "can give you a hint". Doing any of these things, or anything
like these things, is cheating, it is easily detected cheating, and those who
thought they could get away with it in the past have failed the course. Even
more importantly: doing any of those things means that the
assignment doesn't give you a chance to practice; it makes any
feedback you get meaningless; and of course it makes any evaluation based on
that assignment unfair.
If you are unsure about what is or is not appropriate, please ask me before
submitting anything; there will be no penalty for asking. If you do violate
these policies but then think better of it, it is your responsibility to tell
me as soon as possible to discuss how your mis-deeds might be rectified.
Otherwise, violations of any sort will lead to severe, formal disciplinary
action, under the terms of the university's
policy
on academic integrity.
You must complete "homework 0" on the content of the university's policy on
academic integrity, and on these course policies. This assignment will not
factor into your grade, but you must complete it, with a grade of at
least 90%, before you can get any credit for any other assignment.
Accommodations for Students with Disabilities
If you need accommodations for physical and/or learning disabilities, please
contact the Office of Disability Resources, via their
website http://www.cmu.edu/disability-resources.
They will help you work out an official written accommodation plan, and help
coordinate with me.
Other Iterations of the Class
Some material is available from versions of this class taught in
other years. As stated above, any use of solutions provided in earlier
years is not only cheating, it is very easily detected cheating.
Subject to revision. Lecture notes, assignments and solutions
will all be linked here, as they are available.
Current revision of the complete textbook
\[
\newcommand{\Prob}[1]{\mathbb{P}\left[ #1 \right]}
\newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]}
\newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]}
\newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]}
\]
January 16 (Tuesday): Introduction to the class; regression
- Regression is about guessing the value of a quantitative, numerical random variable \( Y \) from one or more other variables \( X \) (which may or
may not be numerical). Because "guess" sounds undignified, we say that we are making a point prediction (as opposed to an interval prediction or distributional prediction). This means that our guess is a numerically-valued function of \( X \), say \( m(X) \). Traditionally, we measure the quality of the prediction by the expected squared error, \( \Expect{(Y-m(X))^2} \). Doing so leads to a unique choice for the optimal or true regression function: \( \mu(x) \equiv \Expect{Y|X=x} \), the conditional expectation function. We can always say that \( Y = \mu(X) + \epsilon \), where the noise \( \epsilon \) around the regression function has expectation 0, \( \Expect{\epsilon|X=x} = 0 \). Calculating the true regression function would require knowing the true conditional distribution of \( Y \) given \( X \), \( p(y|x) \). Instead of having that distribution, though, as statisticians we just have data, \( (x_1, y_1), (x_2, y_2), \ldots (x_n, y_n) \). We use that data to come up with an estimate of the regression function, \( \hat{\mu} \). Because our data are random, our estimate is also random, so \( \hat{\mu} \) has some distribution. This distribution matters for the error of our estimate, via the bias-variance decomposition: \( \Expect{(Y-\hat{\mu}(X))^2|X=x} = \Var{\epsilon|X=x} + (\mu(x) - \Expect{\hat{\mu}(x)})^2 + \Var{\hat{\mu}(x)} \).
- In practice, almost all ways of estimating the regression function are examples of linear smoothers, where \( \hat{\mu}(x) = \sum_{j=1}^{n}{w(x, x_j) y_j} \). Here the weights \( w(x, x_j) \) are some way of saying "How similar is the data point \(x_j \) to the place where we're trying to make a prediction \( x \)?", where we (usually) want to give more weight to values \( y_j \) from places \( x_j \) which are similar to \( x \). Examples of this scheme include the nearest neighbor method, the \( k \)-nearest-neighbor method, kernel smoothing, and linear regression. (The weights in
linear regression are very weird and only make sense if we insist on smoothing the data on to a straight line, no matter what.) For all linear smoothers, many
properties of the fitted values can be read off from the weight, influence or hat matrix \( \mathbf{w} \), defined by \( w_{ij} = w(x_i, x_j) \), just as we learned to use the hat matrix in linear regression.
- Reading:
- Chapter 1 of the textbook; R for all examples
- CMU's policy on academic integrity
- This course's policy on collaboration, cheating and plagiarism (above)
- Excerpt from Turabian's A Manual for Writers (on Canvas)
- Optional reading:
- Cox and Donnelly, chapter 1
- Faraway, chapter 1 (especially up to p. 17).
- Homework 0 (on collaboration and plagiarism): assignment
January 18 (Thursday): The truth about linear regression
- If we decide to predict \( Y \) as a linear function of a scalar \( X \), so (\m (X) = b_0 + X b_1 \), there is a unique optimal linear predictor: \( m(X) = \Expect{Y} + (X-\Expect{X}) \frac{\Cov{X,Y}}{\Var{X}} \). That is, the optimal slope of the simple linear regression is \( \beta_1 = \frac{\Cov{X,Y}}{\Var{X}} \), and the optimal intercept makes sure the regression line goes through the means, \( \beta_0 = \Expect{Y} - \Expect{X} \beta_1 \). This generalizes to higher dimensions: if we decide to linearly predict \( Y \) from a vector \( \vec{X} \), the vector of optimal slopes is \( \vec{\beta} = \Var{\vec{X}}^{-1} \Cov{\vec{X}, Y} \), and the intercept is \( \beta_0 = \Expect{Y} - \Expect{\vec{X}} \cdot \vec{\beta} \), so the optimal linear predictor again has the form \( \Expect{Y} + (\vec{X} - \Expect{\vec{X}} \Var{\vec{X}}^{-1} \Cov{\vec{X}, Y} \). This is where the strange-looking form of the weights in the hat matrix come from.
- All this is true whether or not the real relationships are linear, whether there are any Gaussian distributions anywhere, etc., etc. Ordinary least squares will give consistent estimates for \( \beta_0 \) and \( \vec{\beta} \) under a very wide range of circumstances, where none of the usual assumptions from a linear-regression course apply. The place those assumptions become important is in the typical calculations of statistical significance, confidence intervals, prediction intervals, etc. However, it is important to understand that "this coefficient is statistically significant" really means "a linear model where there's a slope on this variable fits better than one which is flat on that variable, and the difference is too big to come from noise unless we're really unlucky". Statistical significance thus runs together actual effect magnitude, precision of measurement, and sample size. Lots of the other stuff people would like linear regression to do, and which many non-statisticians think it can do, can also be seen to be myths.
- Reading:
- Chapter 2 of the textbook; revision with a better treatment of why ordinary least squares is unbiased; R for examples
- Handout: "predict and Friends: Common Methods for Predictive Models in R" (PDF, R Markdown)
- Optional reading: Faraway, rest of chapter 1
- Homework 1: Assignment
January 23 (Tuesday): Evaluation of Models: Error and inference
- When we use statistical models to make predictions, we
don't really care about their performance on our data. Rather, we
want them to fit well to new cases, at least if those new data points
followed the same distribution as the old. Expected error (or "loss") on new
data is sometimes called risk; we want low risk. The
difficulty is that we adjust our models to fit the training data, so
in-sample performance on that training data is a biased estimate of risk.
For regression models, when we use squared error as our loss function, we can
relate this optimism bias to the covariance between data points and fitted
values, \( \Cov{Y_i, \hat{\mu}(X_i)} \), but the issue is much more general.
This becomes particularly important when we're comparing different models to
select one of them: more flexible models will seem to fit the data
better, but are they actually learning subtle-yet-systematic patterns, or just
memorizing noise?
- The cleanest way to get a good estimate of the risk would be to evaluate all our models on a new, independent testing set. If we don't have an independent testing set, we can try splitting our data (randomly) into training and testing sets. This introduces some extra randomness, which we can partially average away by using each half of the data in turn as the testing set, and averaging. We can often do even better by \( k \)-fold cross-validation, dividing the data into \( k \) "folds", where each fold is used in turn as the testing set (and the other \( k-1 \) folds together are the training set). For some purposes, \( n \)-fold or leave-one-out cross-validation is even better, but it comes at a high computational cost. There are however short-cuts for leave-one-out cross-validation for linear smoothers. The famous Akaike information criterion (AIC) is another approximate hack for leave-one-out CV.
- Model selection, whether by cross-validation or some other means, makes our choice of model dependent on the data, hence (partially) random. The usual formulas for inferential statistics (p-values, confidence intervals, etc.) do not include this extra randomness. Statistical inference after model selection therefore requires extra work; the most straightforward approach is just data splitting again.
- Reading: Chapter 3; R for examples
- Optional reading: Cox and Donnelly, ch. 6
January 25 (Thursday): Smoothing methods for regression
- Kernel smoothing, a.k.a. Nadaraya-Watson smoothing, is yet another linear smoother. For one-dimensional \( x \), We make predictions using \( \widehat{\mu}(x) = \sum_{j=1}^{n}{y_j \frac{K\left( \frac{x - x_j}{h} \right)}{\sum_{k=1}^{n}{K\left( \frac{x - x_k}{h}\right)}}} \). Here \( K(u) \) is the kernel function, which is non-negative and maximized when \( u = 0 \); when we do \( K(x - x_j \), we're using the kernel function to measure how similar the data point \( x_j \) is to the place \( x \) where we're making a prediction. We usually also require that \( K(u) \) be a probability density with expectation zero and finite variance ( \( \int_{-\infty}^{\infty}{K(u) du} = 1 \), \( \int_{-\infty}^{\infty}{u K(u) du} = 0 \), \( \int_{-\infty}^{\infty}{u^2 K(u) du} < \infty \) ). These restrictions ensure that when \( x_j \) is very far away from our point of interest \( x \), \( y_j \) gets comparatively little weight in the prediction. Dividing by the sum of the kernel factors makes sure that our prediction is always a weighted average. Finally, we need some sense of what counts as "near" or "far", a distance scale --- this is provided by the \( h \) in the denominator inside the kernel functions, called (for historical reasons) the bandwidth. Very roughly speaking, \( y_j \) should get a substantial weight in the prediction of \( \widehat{\mu}(x) \) if, and only if, \( x_j \) is within a few factors of \( h \) of \( x \).
- Intuitively, a big bandwidth means every prediction averages over a lot of data points which are widely spread. It should thus give a very smooth function, estimated with low variance,but (potentially) high bias. On the other hand, using a very small bandwidth makes all the predictions into very local averages, based only a few data points each --- they will be less smooth, and have higher variance, but also less bias. Because kernel smoothing gives us a weighted average, we can analyze its bias and variance comparatively easily. The variance comes from the fact that we are only averaging a limited number of \( y_j \) values in each prediction, where the corresponding \( x_j \) are close to the operating point \( x \). The number of such points should be about \( 2 n h p(x) \), \( p \) being the pdf of \( X \), and the variance should be inversely proportional to that. The bias comes from the fact that \( \mu(x_j) \neq \mu(x) \). Assuming the real regression function \( \mu \) is sufficiently smooth, we can do some Taylor expansions to get that the bias is \( \propto h^2 \). (One part of the bias comes from \( \mu(x) \) having a non-zero slope and \( p(x) \) also having a non-zero slope, so the \( x_j \) tend to be on one side or the other of \( x \), pulling the average up or down. The other part of the bias comes from \( \mu(x) \) having non-zero curvature, pulling the average up or down regardless of the distribution of the \( x_j \). Both contributions end up being \( \propto h^2 \).) This lets us conclude that the optimal bandwidth \( h_{opt} \propto n^{-1/5} \), and that the expected squared error of kernel regression \( = \Var{\epsilon} + O(n^{-4/5}) \). That is, the excess expected squared error, compared to the Oracle who knows the true regression function, is only \( O(n^{-4/5}) \). Notice that the optimal bandwidth \( h \) changes with the sample size, and goes to zero. Bandwidths are not parameters.
- In practice, we find the best bandwidth by cross-validation, which is why this lecture comes after the previous one.
- All of the above analysis is for \( X \) one dimensional. If \( X \) has multiple dimensions, the usual approach is to multiply kernel functions together, each with their own bandwidth. This changes the error analysis in ways we will come back to later in the course. (The bias stays the same, but the variance blows up.) We can also incorporate qualitative variables (factors in R) by appropriate kernels.
- Reading: Chapter 4; R for examples
- Optional reading:
- Faraway, section 11.1
- Hayfield and Racine, "Nonparametric Econometrics: The np Package"
- Gelman and Pardoe, "Average Predictive Comparisons for Models with Nonlinearity, Interactions, and Variance Components" [PDF]
- Homework 0 due (at 6 pm)
- Homework 1 due (at 6 pm)
- Homework 2: assignment, uval.csv data file
January 30 (Tuesday): Simulation
- When we have complicated probabilistic models, it is often very hard, if not impossible, to come up with nice, short formulas describing their behavior. A good probabilistic model is, however, a systematic, step-by-step account of how to generate a data set, or something shaped like a data set. (There is some dice-rolling and coin-tossing involved, because the model is probabilistic, but the model is very precise about which dice to toss when, and what to do with the results.) This gives us an alternative to analytical calculations: simulate data from the model, and see what it looks like. We usually need to run the simulations multiple times, to get an idea of the distribution of outcomes the model can generate, but once we can figure out how to do one simulation, we can get the computer to do it for us over and over. Analytical probability formulas should in fact be seen as short-cuts which summarize what exhaustive simulations would also tell us. Those short-cuts are useful when we can find them, but we do not have to rely on them.
- Data-splitting and cross-validation can be seen as examples of simulation, specifically simulating the process of generalizing to new data.
- Reading: Chapter 5; R for all examples
- Optional reading:
- Appendix on writing R code; R for all examples, including some deliberately bad code (to make points about common mistakes and about debugging)
- Handout on approximating probabilities and expectations by repeated simulation ("Monte Carlo"); .Rmd for the handout
February 1 (Thursday): NO CLASS
- Professor out sick
- Homework 2 due (at 6:00 pm)
- Homework 3: assignment, stock_history.csv data file
February 6 (Tuesday): NO CLASS
- Professor out sick
February 8 (Thursday): The Bootstrap
- Our data are random; consequently everything we calculate from those
data are random; consequently any conclusions we draw based on analysis of our
data are (at least somewhat) random. If we could repeat the experiment (or
survey, or re-run the observational study), we would get more or less different
data, make more or less different calculations, draw more or less different
inferences. The distribution of calculations or inferences we would see under
repetitions of the experiment is called the sampling
distribution, and is the source of our knowledge about uncertainty ---
about standard errors, biases, confidence intervals, etc. For a few models,
under very strong probabilistic assumptions, we can work out sampling
distributions analytically. Away from those special cases, we need other
techniques. By far the most important technique is called the
bootstrap: come up with a good estimate of the data-generating
process; simulate new data sets from the estimate; repeat the analysis on the
simulation output; use the simulation distribution as an approximation to the
true, unknown sampling distribution. This simulation can be done in two ways.
One is to just estimate a model, and then simulate from the estimated model.
The other uses the empirical distribution of the data points
as an approximation to the data-generating distribution, and simulates from
that, which amounts to "resampling" data points with replacement. Hybrids
between these two approaches are possible for regression models (e.g.,
estimating a model of the shape of the regression, but then resampling the
residuals around that curve). While more bootstrap simulations are always
better, all else being equal, they are subject to diminishing returns,
and we can think about how few we really need for particular applications.
- Bootstrapping is a fundamental technique for quantifying uncertainty in
modern statistics, and you will get lots of practice with it. There
are nonetheless some things which bootstrapping does poorly. These are,
unsurprisingly, situations where even a small discrepancy in the distribution
we simulate from leads to a large error in our conclusions, or a large change
in the sampling distribution.
- Reading: Chapter 6; R for examples; R for examples without the demos (= just definitions of bootstrapping-related functions)
- Slides (.Rmd source)
- Optional reading: Cox and Donnelly, chapter 8
- Homework 3 due at 6:00 pm
- Homework 4: assignment, gmp-2006.csv data file
February 13 (Tuesday): Splines
- A "spline" was original a tool used by craftsmen to draw curves: pin a thin, flexible board or strip of material down in a few points and let it bend to give a smooth curve joining those points; the stiffer the material, the less it curved. Today, the smoothing spline problem is find the one-dimensional function which balances between coming close to given data points, versus having low average curvature. Specifically, we seek the function which minimizes \( \frac{1}{n}\sum_{i=1}^{n}{(y_i - m(x_i))^2} + \lambda \int_{-\infty}^{\infty}{\left( \frac{d^2m}{dx^2}(x)\right)^2 dx} \). Here \( \lambda \) is the penalty factor which tells us how much mean-squared-error we are prepared to trade for a given amount of curvature (corresponding to the stiffness of the spline tool). The
function which minimizes this criterion is called the spline. Splines are, it turns out, always piecewise cubic polynomials, with the boundaries between pieces, or knots, located at the \( x_i \); splines are always continuous, with continuous
first and second derivatives. All the coefficients can be found efficiently
by solving a system of \( n \) linear equations in \( n \) unknowns, and prediction can then be done very rapidly.
As \( \lambda \rightarrow 0 \), we get functions that veer around wildly to interpolate between the data points; as \( \lambda \rightarrow \infty \), we get back towards doing ordinary least squares to find the straight line that minimizes the mean-squared error. \( \lambda \) thus directly controls how much we smooth, by penalizing un-smoothness. Low values of \( \lambda \) lead to estimates of the regression function with low bias but high variance; high values of \( \lambda \) have high bias but low variance. If \( \lambda \rightarrow 0 \) as \( n \rightarrow 0 \), at the right rate, the smoothing spline will converge on the true regression function. Because penalized estimation generally
corresponds to constrained estimation, and vice versa ("a fine is a price"), we can think of splines found with high values of \( \lambda \) as answering the question "how close can we come to the data, if our function can only have at most so-much curvature?", with the constraint weakening (allowing for
more curvature) as \( \lambda \) gets smaller. Consistent estimation requires
that the constraint eventually go away, but not too fast.
- In higher dimensions, we either generalize the spline problem to account for curvature in multiple dimensions ("thin plate" splines), or piece together the functions we would get from solving multiple one-dimensional spline problems ("tensor product" splines). Or we turn to additive models (next chapter).
- Reading: Chapter 7; R for examples
February 15 (Thursday): Multivariate Smoothing and Additive Models
- If we estimate a smooth regression function at \( x \) by averaging all the data points found within a distance \( h \) of \( x \), we get a bias that is \( O(h^2) \). The variance of this estimate, meanwhile, is \( O(n^{-1} h^{-d}) \), where \( d \) is the dimension of \( x \). The bias and variance of kernel regression scales the same way, because kernel regression essentially is a kind of local averaging. When we minimize the sum of bias squared and variance, we conclude that the best bandwidth \( h = O(n^{-1/(4+d)}) \). The total error (bias squared plus variance) of the estimate therefore goes to zero like \( O(n^{-4/(4+d)}) \). Local averaging and kernel smoothing are therefore consistent in any number of dimensions --- they converge on the true regression function ---- but they slow down drastically as the number of dimensions grows, the "curse of dimensionality".
- This is not just a fact about local averaging and kernel smoothing. Splines, nearest neighbors, and every other universally consistent method will converge at the same rate. The only way to guarantee a faster rate of convergence is to use a method which correctly assumes that the true regression function \( \mu \)
has some special, nice structure which it can use. For instance, if the true regression function is linear, the total estimation error of linear regression will be \( O(d/n) \); the general rate of convergence for any parametric model
will be \( O(1/n) \). However, if that assumption is false, then a
method relying on that assumption will converge rapidly to a systematically-wrong
answer, so its error will be \( O(1) + O(1/n) \).
- Additive models are an important compromise, for dealing with many
variables in a not-fully-nonparametric manner. The model is \( \mu(x) = \alpha + \sum_{j=1}^{d}{f_j(x_j)} \), so each coordinate \( x_j \) of \( x \) makes a separate contribution to the regression function via its partial response function \( f_j \), and these just add up. The partial response functions can be arbitrary smooth functions, so this set-up includes all linear models, but is far more general. (Every linear model is additive, but not vice
versa.) The partial response functions are also basically
as easily interpretable as the slopes in a linear model. We can estimate
an additive model by doing a series of one-dimensional non-parametric regressions, which we now understand how to do quite well. If the true
regression function is additive, then the total estimation error shrinks at the
rate \( O(d n^{-4/5} \), almost as good as \( O(d n^{-1}) \) for a linear
model. If the true regression function is not additive, this is the rate
at which we converge to the best additive approximation to \( \mu \), which is
always at least as good as the best linear approximation. We can also extend
additive models to jointly smooth some combinations of variables together,
allowing them to interact (and to interact more sensibly than the product
terms in conventional linear models).
- Reading: Chapter 8 (R for examples)
- Homework 4 due at 6:00 pm
- Homework 5: Assignment
February 20 (Tuesday): Testing Regression Specifications; Heteroskedasticity, Weighted Least Squares, and Variance Estimation
- Specification testing: As mentioned last time, the total estimation error for a correctly-specified parametric model is (typically) \( O(d n^{-1}) \), while that for a mis-specified parametric model is \( O(1) + O(d n^{-1}) \). The error for a non-parametric model, which really cannot be mis-specified, is \( O(n^{-4/(4+d)}) \). If a parametric model is properly specified, therefore, it will (eventually) have smaller error than a non-parametric model, while if a parametric model is mis-specified, the non-parametric method will (eventually) predict better. This leads us to a set of tests of regression specifications, where we compare the fit
of parametric and non-parametric models, and assess \( p \)-values by simulating
from the estimated parametric model.
- Heteroskedasticitiy: A random process with constant variance is called homoskedastic (or homoscedastic), while one with changing variance is called heteroskedastic. If we know that a process is heteroskedastic, it makes sense to give less weight to observations which we know have high variance, and more weight to observations known to have low variance. This leads to weighted-least-squares problems for estimating means, linear regressions, and non-parametric regressions, many of which can
be solved in closed form, if the variances are known. Sometimes those variances can be worked out by studying our measurement process, but often they are themselves unknown. Fortunately, non-parametric regression actually gives us a way of estimating conditional variances: basically, do an initial regression of \( Y \) on \( X \), find the residuals, and then smooth the squared residuals against \( X \). (There are some bells-and-whistles.) We can then use this variance function to get a better estimate of the regression, by trying harder to fit observations with low variance and being more relaxed about fitting high-variance observations, then improve the variance function estimate, etc.
-
- Reading: Chapter 9 (R) and Chapter 10 (R), skipping section 10.5.
- Optional reading:
- Cox and Donnelly, chapter 7
- Faraway, section 11.3
February 22 (Thursday): Logistic Regression
- Classification is very similar to regression, except that the variable \( Y \) we are trying to predict is qualitative or categorical. For some classification tasks, it is enough to have a sheer guess, but usually we want a probability distribution for \( Y \) given \( X \). If there are only two ("binary") categories, we arbitrarily code one of them as 0 and the other as 1. The advantage of this coding is that then \( p(x) \equiv \mathbb{P}\left( Y=1|X=x \right) = \Expect{Y|X=x} \). In principle, then, binary classification to regression is just a special case of regression.
- The snag is that probabilities have to be between 0 and 1, but not all regression methods always give predictions between 0 and 1, even when all the observed \( y_i \) values obey those constraints. Linear regression, in particular, always gives non-sensical probabilities if \( x \) moves far enough from the center of the data. To combat this, we often apply some transformation to the output of a regression method, transforming a number in \( (-\infty, \infty) \) into a number in \( [0,1] \). That is, the idea is to use a model of the form \( p(x) = g(m(x)) \), where \( m \) is some sort of regression model estimated from data, and \( g \) is a transformation, fixed in advance, which ensures we always have a legitimate probability.
- Among all the many transformations people have tried, a special place is held by the transformation \( g(u) = \frac{e^{u}}{1+e^{u}} \). The reason has to do with the likelihood. The likelihood of our model assigns to the data set \( (x_1, y_1), (x_2, y_2), \ldots (x_n, y_n) \) is \( \prod_{i=1}^{n}{p(x_i)^{y_i} (1-p(x_i))^{1-y_i}} \). The log-likelihood is then \( \sum_{i=1}^{n}{\log{(1-p(x_i))} + y_i \log{\frac{p(x_i)}{1-p(x_i)}}} \). The log-likelihood thus depends on the observed \( y_i \) only via the "log odds ratios" \( \log{\frac{p}{1-p}} \). Even if we used a different transformation to turn real numbers into probabilities, therefore, we would still end up having to deal with this
transformation when we tried to fit to data. The map \( p \mapsto \log{\frac{p}{1-p}} \) from probabilities in \( [0,1] \) to real numbers in \( (-\infty, \infty) \) has come to be called the "logistic transformation", or "logit". (The inverse transformation is the "inverse logistic" or "inverse logit" or "ilogit".)
What is called logistic regression is the modeling assumption that the log-odds-ratio is a linear function of \( x \), i.e., \( \mathrm{logit}(p(x)) = \beta_0 + \beta \cdot x \). (Turned around, \( p(x) = \frac{e^{\beta_0 + \beta \cdot x}}{1+ e^{\beta_0 + \beta \cdot x}} \). While common, there is
usually no scientific or mathematical reason to think it is correct; it can,
however, often work well if the right features go into \( x \). A natural generalization is to make the log-odds-ratio an additive function of \( x \), as in additive models for ordinary regression.
- Once we have a model for \( p(x) \), whether logistic or otherwise, we can ask whether the probabilities
it gives us for \( Y \) match the actual frequencies. (Does it, in fact, rain on
half the days when weather forecast predicts a 50% chance of rain?) This is
known as checking calibration, and while there are some
advanced methods for doing so, simple graphical checks are usually informative.
- Reading: Chapter 11 (R for examples)
- Optional reading: Faraway, chapter 2 (omitting sections 2.11 and 2.12)
- Homework 5 due at 6:00 pm
- Data analysis exam 1: assignment, ch.csv data file
- (See also the old DAE assignment, the model report/solutions for that DAE, and the discussion of other approaches / alternatives for that DAE, all available on Canvas)
February 27 (Tuesday): Mid-semester review
- Going over the most important topics so far, to consolidate.
February 29 (Thursday): Generalized linear models and generalized additive models
-
- Reading: Chapter 12 (there are no R examples)
- Optional reading: Faraway, section 3.1 and chapter 6
- Data analysis exam 1 due at 6:00 pm
- Homework 6: assignment. Not due until Thursday the week after spring break (14 March)
March 5 and 7: NO CLASS (spring break)
March 12 (Tuesday): Multivariate distributions
- Reading: Appendix on multivariate distributions
March 14 (Thursday): Density Estimation
- Reading: Chapter 14 (R for examples)
- Slides (.Rmd, including comments on the code)
- Homework 6 due at 6:00 pm (note date!)
- Homework 7: assignment, n90_pol.csv data file
March 19 (Tuesday): Factor Models
- Reading: Chapter 16
March 21 (Thursday): Graphical Models
- Reading: Chapter 18
- Homework 7 due at 6:00 pm
- Homework 8: Assignment, stockData.RData data file
March 26 (Tuesday): Graphical Causal Models
- Reading: Chapter 19
- Optional reading:
- Cox and Donnelly, chapters 5, 6 and 9
- Pearl, "Causal
Inference in Statistics", section 1, 2, and 3 through 3.2
March 28 (Thursday): Identifying Causal Effects from Observations I
- A quantity (or vector, curve, function, etc.) in a statistical model
is identified when it can be expressed as a function of the
joint distribution of observables; otherwise it
is unidentified. Causal quantities, like \(
\Prob{Y=y|do(X=x)} \), are often not identified. One common reason is that \(
X \) and \( Y \) have common causal ancestors, so the actual effect of \( X \)
on \( Y \) is confounded with the information \( X \) gives us
about the common ancestor(s), and so about \( Y \). Thus R. A. Fisher,
probably the greatest statistician who ever lived, did not deny that smoking
predicted lung cancer; he just thought that there might be a common genetic
cause of both lung cancer and of taking up smoking.
- If \( X \) has no causal ancestors, i.e., is exogenous ("born outside"), then its causal
effects are always identified, and \( \Pr{Y|do(X)} = \Pr{Y|X} \). If we experiment on \( X \), particularly if we do randomized experiments on \( X \), we ensure that it is exogenous, and can identify its effects. If we cannot experiment, we might be able to observe all the parents of \( X \). We can then show that \( \Prob{Y|do(X=x)} = \sum_{t}{\Prob{Y|X=x, \mathrm{Parents}(X)=t}\Prob{\mathrm{Parents}(X)=t}} \), so that the effects of \( X \) are identified. That is, we condition on, or control for, the parents. (This rule implies the one about the effects of exogenous variables being identified.)
- If controlling for all the parents of \( X \) is not feasible, we can try to find a set of variables \( S \) with two properties: (i) \( S \) blocks all "back-door" paths between \( X \) and \( Y \), those paths beginning with an arrow into \( X \); and (ii) \( S \) contains no descendants of \( X \). Any set of variables \( S \) which meets both (i) and (ii) satisfies the back-door criterion. We can then identify the effects of \( X \) in basically the same way: \( Prob{Y|do(X=x)} = \sum_{s}{\Prob{Y|X=x, S=s}\Prob{S=s}} \).
There may be multiple sets of variables which all satisfy the back door criterion. These will all converge on same estimate of \( \Prob{Y|do(X)} \) in the limit
of infinite data, but some of them may be more practical, reliable or
simply cheap in a pre-asymptotic world.
- Reading: Chapter 20
- Optional reading:
Pearl, "Causal Inference in Statistics", sections 3.3--3.5, 4, and 5.1
- Homework 8 due at 6:00 pm
- Homework 9: assignment, sim-smoke.csv
April 2 (Tuesday): Identifying Causal Effects from Observations II
- Suppose we are unable, or unwilling, to use the back-door criterion. We can still identify the effect of \( X \) on \( Y \) if we can find a set of mediators or mediating variables where
(i) \( M \)
blocks all directed paths from \( X \) to \( Y \), (ii) there are no unblocked
back-door paths from \( X \) to \( M \), and (iii) \( X \) blocks all back-door paths from
\( M \) to \( Y \). Then
\( \Prob{Y|do(X=x)} = \sum_{m}{\Prob{Y|do(M=m)}\Prob{M=m|X=x}} \), and
we can identify \( \Prob{Y|do(M=m)} \) using \( X \) to block back-door paths,
\( \Prob{Y|do(M=m)} = \sum_{x^{\prime}}{\Prob{Y|X=x^{\prime},M=m}\Prob{X=x^{\prime}}} \). These mediators meet the front-door criterion.
- The final major way of getting identification is to use an instrumental variable, an \( I \) which has a directed path to \( X \) and to \( Y \), but where the only unblocked, directed paths to \( Y \) go through \( X \). We can identify \( \Prob{Y|do(I)} \), and \( \Prob{X|do(I)} \),
and we can often (though not quite always) "back out" \( \Prob{Y|do(X)} \) from these, by solving an inverse problem.
- Reading: Chapter 20
April 4 (Thursday): Estimating Causal Effects from Observations
- Once we know that the causal effect we are interested in is identified, we still have to estimate it. If we are using the back-door criterion, the obvious approach to estimating \( \Prob{Y|do(X)} \) is to estimate \( \Prob{Y|X,S} \) and \( \Prob{S} \) and calculate. Similarly, to estimate \( \Expect{Y|do(X)} \), we can estimate \( \Expect{Y|X,S} \) and \( \Prob{S} \). If we have IID data, however, we can avoid the step of estimating \( \Prob{S} \). This is because, under the back-door criterion, \( \Prob{Y=y|do(X=x)} = \Expect{\Prob{Y=y|X=x, S}} \), and, by the law of large numbers, \( \frac{1}{n}\sum_{i=1}^{n}{\Prob{Y=y|X=x, S=s_i}} \rightarrow \Expect{\Prob{Y=y|X=x, S}} \). (Sample averages converge on expectations.) Similarly, \( \Expect{Y|do(X=x)} = \Expect{\Expect{Y|X=x, S}} \), and
\( \frac{1}{n}\sum_{i=1}^{n}{\Expect{Y|X=x, S=s_i}} \rightarrow \Expect{\Expect{Y|X=x, S}} \). Substituting in estimated functions for \( \Prob{Y=y|X=x, S=s} \) or \( \Expect{Y|X=x, S=s} \) adds some more approximation error and uncertainty, but it is a feasible procedure.
- If \( X \) is binary, 0 or 1, the propensity score is the probability that \( X=1 \) given the control variables in \( S \), \( R \equiv \Prob{X=1|S} \). When conditioning on \( S \) satisfies the back-door criterion, then conditioning on \( R \) alone also satisfies the back-door criterion, and \( \Expect{Y|do(X)} = \Expect{\Expect{Y|X, R}} \). So if we can get good estimates of the propensity scores, we can do a much simpler regression model for \( Y \) (or a much simpler conditional probability model).
- If \( X \) is binary and all we care about is the average treatment effect of switching from \( X=0 \) to \( X = 1 \), that is
\( ATE = \sum_{s}{\left(\Expect{Y|X=1, S=s} - \Expect{Y|X=0, S=s}\right)\Prob{S=s}} \). By the same sort of law-of-large-numbers argument, the ATE is \( \approx \frac{1}{n}\sum_{i=1}^{n}{\Expect{Y|X=1, S=s_i} - \Expect{Y|X=0, S=s_i}} \). The actual, observed \( Y_i = \Expect{Y|X=x_i, S=s_i} + \epsilon_i \), where \( \Expect{\epsilon_i|X,S}=0 \). So \( Y_i \) is an unbiased estimate of either \( \Expect{Y|X=1, S=s_i} \), or of \( \Expect{Y|X=0, S=s_i} \), depending on what \( x_i \) is. For each data point \( i \), match it with another data point, \( m(i) \), where the covariates are the same, \( S=s_i = s_{m(i)} \), but the cause is flipped, \( X_{m(i)} = 1-X_i \). Then \( (Y_i - Y_{m(i)})(2 X_i - 1) \) is an unbiased estimate of \( \Expect{Y|X=1, S=s_i} - \Expect{Y|X=0, S=s_i} \), and we can average this over all data points to get an estimate of the ATE. This sort of matching estimate is basically one-nearest-neighbors; it can be adapted to be more like kNN, including allowing for imperfect matching.
- Matching on the propensity score, a.k.a. propensity score matching, is just what it sounds like.
- Estimation using the front door criterion is built on top of estimation using the back-door criterion.
- Reading: Chapter 21
- Homework 9 due at 6:00 pm
- Homework 10: assignment, sesame.csv data file
April 9 (Tuesday): NO CLASS
April 11 (Thursday): Carnival, no class
- Homework 10 due at 6:00 pm
on Wednesday, April 10
- Homework 11: Assignment
April 16 (Tuesday): Discovering Causal Structure from Observations
- Reading: Chapter 22
April 18 (Thursday): Dependent Data I: Dependence over Time
- Reading: Chapter 23
- Homework 11 due at 6:00 pm
- Data analysis exam 2: assignment
April 23 (Tuesday): Dependent Data II: Dependence over Space and over Space-and-Time
- Reading: Chapter 24 (to come)
April 25 (Thursday): Summary of the course
- What have we learned?
- Data analysis exam 2 due at 6:00 pm