Cosma Shalizi
36-402, Undergraduate Advanced Data Analysis, Section A
Spring 2019
Section A
Tuesdays and Thursdays, 10:30--11:50, Porter Hall 100
The goal of this class is to train you in using statistical models to
analyze data — as data summaries, as predictive instruments, and as tools
for scientific inference. We will build on the theory and applications of the
linear model, introduced
in 36-401, extending it
to more general functional forms, and more general kinds of data, emphasizing
the computation-intensive methods introduced since the 1980s. After taking the
class, when you're faced with a new data-analysis problem, you should be able
to (1) select appropriate methods, (2) use statistical software to implement
them, (3) critically evaluate the resulting statistical models, and (4)
communicate the results of your analyses to collaborators and to
non-statisticians.
During the class, you will do data analyses with existing software, and
write your own simple programs to implement and extend key techniques. You
will also have to write reports about your analyses.
36-608 A small number of well-prepared graduate students
from other departments can take this course by registering for it as 36-608.
(Graduate students enrolling in 36-402, or undergraduates enrolling in 36-608,
will be automatically dropped from the roster.) If you want to do so, please
contact me, to discuss whether you have the necessary preparation.
Section B
This year, there are two sections of 36-402. This syllabus is for Section A,
taught by Prof. Shalizi; section B is taught by Prof. Lee. The two
sections are completely independent.
Prerequisites
36-401, with a grade
of C or better. Exceptions are only granted for graduate students in other
departments taking 36-608.
Instructors
Professors | Cosma Shalizi | cshalizi [at] cmu.edu |
| | Baker Hall 229C |
Teaching assistants | TBD |
Topics
- Model evaluation: statistical inference, prediction, and
scientific inference; in-sample and out-of-sample errors, generalization and
over-fitting, cross-validation; evaluating by simulating; the bootstrap;
penalized fitting; mis-specification checks
- Yet More Linear Regression: what is regression, really?;
what ordinary linear regression actually does; what it cannot do; extensions
- Smoothing: kernel smoothing, including local polynomial
regression; splines; additive models; kernel density estimation
- Generalized linear and additive models: logistic
regression; generalized linear models; generalized additive models.
- Latent variables and structured data: principal
components; factor analysis and latent variables; latent cluster/mixture models; graphical models in general
- Causality: graphical causal models; causal
inference from randomized experiments; identification of
causal effects from observations; estimation of causal effects;
discovering causal structure
See the end of this syllabus for the current lecture schedule, subject to
revision. Lecture handouts, slides, etc., will be linked there, as available.
Course Mechanics and Grading
There are three reasons you will get assignments in this course. In order of
decreasing importance:
- Practice. Practice is essential to developing the skills
you are learning in this class. It also actually helps you learn, because some
things which seem murky clarify when you actually do them, and sometimes trying
to do something shows that you only thought you understood it.
- Feedback. By seeing what you can and cannot do, and what
comes easily and what you struggle with, I can help you learn better, by giving
advice and, if need be, adjusting the course.
- Evaluation. The university is, in the end, going to
stake its reputation (and that of its faculty) on assuring the world that you
have mastered the skills and learned the material that goes with your degree.
Before doing that, it requires an assessment of how well you have, in fact,
mastered the material and skills being taught in this course.
To serve these goals, there will be three kinds of assignment in this
course.
- In-class exercises
- Most lectures will have in-class exercises. These will be short (10--20
minutes) assignments, emphasizing problem solving, done in class in small
groups of two to five people. The assignments will be given out in class, and
must be handed in on paper by the end of class. On most days, a
randomly-selected group will be asked to present their solution to the class.
- Homework
- Most weeks will have a homework assignment, divided into a series of
questions or problems. These will have a common theme, and will usually build
on each other, but different problems may involve statistical theory, analyzing
real data sets on the computer, and communicating the results. The in-class
exercises will either be problems from that week's homework, or close enough
that seeing how to do the exercise should tell you how to do some of the
problems.
- All homework will be submitted electronically through Canvas. Most weeks,
homework will be due at 6:00 pm on Wednesday. There will be a
few weeks, clearly noted on the syllabus and on the assignments, when Thursday
lecture will be canceled and homework will be due at noon on Thursday, i.e.,
the end of the lecture period. (When this means that there are only six days
for the next homework, it will be shortened accordingly.)
- There are specific formatting requirements for
homework --- see below.
- Exams
- There will be both a midterm and a final
exam. Each of these will require you to analyze a real-world data
set, answering questions posed about it in the exam, and to write up your
analysis in the form of a scientific report. The exam assignments will provide
the data set, the specific questions, and a rubric for your report.
- Both exams will be take-home, and you will have at least one week to
work on each, without homework (from this class anyway). Both exams will
be cumulative.
- Exams are to be submitted through Canvas, and follow the
same formatting requirements as the homework --- see
below.
The mid-term and final exam due dates be set by the first day of classes,
and will not change thereafter. If they present difficulties for you, please
contact me as soon as possible.
Time expectations: You should expect to spend 5--7 hours on
assignments every week, averaging over the semester. (This follows from the
university's rules about how course credits translate into hours of student
time.) If you find yourself spending significantly more time than that on the
class, please come to talk to me.
Grading
Grades will be broken down as follows:
- Exercises: 10%. All exercises will have equal weight.
- Homework: 50%. There will be 12 homeworks, all of equal weight. Your
lowest two homework grades will be dropped, no questions asked. If you turn in
all homework assignments (on time), your lowest three homework grades
will be dropped. Late homework will not be accepted for any
reason.
(The point of dropping the low grades is to let you schedule
interviews, social engagements, etc., with flexibility, and without my having
to decide what is or is not important enough to change the rules.)
- Midterm: 15%
- Final: 25%
Grade boundaries will be as follows:
A | [90, 100] |
B | [80, 90) |
C | [70, 80) |
D | [60, 70) |
R | < 60 |
To be fair to everyone, these boundaries will be held to strictly.
Grade changes and regrading: If you think that particular assignment was wrongly graded, tell me as soon as possible. Direct any questions or complaints about your grades to me; the teaching assistants have no authority to make changes. (This also goes for your final letter grade.) Complaints that the thresholds for letter grades are unfair, that you deserve a higher grade, etc., will accomplish much less than pointing to concrete problems in the grading of specific assignments.
As a final word of advice about grading, "what is the least amount of work I need to do in order to get the grade I want?" is a much worse way to approach higher education than "how can I learn the most from this class and from my teachers?".
Lectures
Lectures will be used to amplify the readings, provide examples and demos, and
answer questions and generally discuss the material. They are also when you
will do the graded in-class assignments which will help consolidate your
understanding of the material, and help with your homework.
You will generally find it helpful to do the readings before coming
to class.
Please don't use any electronic devices during lecture: no laptops, no
tablets, no phones, no recording devices, no watches that do more than tell
time. The only exception to this rule is for electronic assistive devices,
properly documented with CMU's Office of Equal Opportunity Services.
(The no-electronics rule is not arbitrary meanness on my part. Experiments
show,
pretty
clearly, that students learn more in electronics-free classrooms, not least
because your device isn't distracting your neighbors, who aren't as good at
multitasking as you are.)
Exams
There will be a take-home mid-term exam (15% of your final grade), which will be due before spring break. You will have one week to work on the midterm, and there will be no homework that week. There will also be a take-home final exam (25%), due during exam week. The exams may require you to use any material already covered in the readings, lectures or assignments; the final will be cumulative.
Exams must also be submitted through Canvas, under the same rules about file
formats as homework.
The due dates for exams will be fixed by the first day of classes, and will
not change thereafter. Please try to schedule your obligations around them.
If you know that the dates will be a problem for you, please contact me as soon
as possible.
R, R Markdown, and Reproducibility
R is a free, open-source software
package/programming language for statistical computing. You should have begun
to learn it in 36-401 (if not before). In this class, you'll be using it for
every homework and exam assignment. If you are not able to use R, or
do not have ready, reliable access to a computer on which you can do so, let me
know at once.
Communicating your results to others is as important as getting good results
in the first place. Every homework assignment will require you to write about
that week's data analysis and what you learned from it; this writing is part of
the assignment and will be graded. Raw computer output and R code is not
acceptable; your document must be humanly readable.
All homework and exam assignments are to be written up
in R Markdown. (If you know
what knitr is and would rather use it,
ask first.) R Markdown is a system that lets you embed R code, and its output,
into a single document. This helps ensure that your work
is reproducible, meaning that other people can re-do your
analysis and get the same results. It also helps ensure that what you report
in your text and figures really is the result of your code (and not some
brilliant trick that you then forget, or obscure bug that you didn't notice).
For help on using R Markdown,
see "Using R Markdown
for Class Reports".
For each assignment, you should generally submit two, and only two, files: an R
Markdown source file, integrating text, generated figures and R code, and the
"knitted", humanly-readable document, in either PDF (preferred) or HTML format.
(I cannot read Word files, and you will lose points if you submit them.) I
will be re-running the R Markdown file of randomly selected students; you
should expect to be picked for this about once in the semester. You will lose
points if your R Markdown file does not, in fact, generate your knitted file
(making obvious allowances for random numbers, etc.).
Some problems in the homework will require you to do math. R Markdown
provides a simple but powerful system for type-setting math. (It's based on
the LaTeX document-preparation system widely used in the sciences.) If you
can't get it to work, you can hand-write the math and include scans or photos
of your writing in the appropriate places in your R Markdown document. (You
should upload these additional files to Canvas.) You will, however, lose
points for doing so, starting with no penalty for homework 1, and growing to a
90% penalty (for those problems) by homework 12.
Canvas and Piazza
Homework and exams will be submitted electronically through Canvas, which will
also be used as the gradebook. Some readings and course materials will
also be distributed through Canvas.
We will be using the Piazza website for question-answering. You will
receive an invitation within the first week of class.
Anonymous-to-other-students posting of questions and replies will be allowed,
at least initially. (Postings will not be anonymous to instructors.)
Anonymity will go away for everyone if it is abused.
Solutions
Solutions for all homework will be available, after their due date, through
Canvas. Please don't share them with anyone, even after the course has ended.
Office Hours
If you want help with computing, please bring a laptop.
Mondays | 1:00--2:00 | Ms. Dunn | Porter Hall 223B |
Tuesdays | 3:00--4:00 | Ms. Dunn | Porter Hall 223B |
Wednesdays | 2:00--3:30 | Prof. Shalizi | Baker Hall 229C |
Fridays | 3:00--4:00 | Ms. Dunn | Porter Hall 223B |
If you cannot make the regular office hours, or have concerns you'd rather
discuss privately, please e-mail me about making an appointment.
Textbook
The primary textbook for the course will be the
draft Advanced
Data Analysis from an Elementary Point of View. Chapters will be
linked to here as they become needed. Reading these chapters will very greatly
help your ability to do the assignments.
In addition, Paul Teetor, The R Cookbook (O'Reilly Media, 2011,
ISBN 978-0-596-80915-7)
is strongly suggested as a reference.
Cox and Donnelly, Principles of Applied Statistics (Cambridge
University Press, 2011,
ISBN 978-1-107-64445-8); Faraway, Extending
the Linear Model with R (Chapman Hall/CRC Press, 2006,
ISBN 978-1-58488-424-8; errata);
and Venables and Ripley, Modern Applied Statistics with S
(Springer,
2003;
ISBN 9780387954578)
will be optional. The campus bookstore should have copies of
all of these.
Except for explicit group exercises,
everything you turn in for a grade must be your own work, or a clearly
acknowledged borrowing from an approved source; this includes all mathematical
derivations, computer code and output, figures, and text. Any use of permitted
sources must be clearly acknowledged in your work, with citations letting the
reader verify your source. You are free to consult the textbook and
recommended class texts, lecture slides and demos, any resources provided
through the class website, solutions provided to this semester's
previous assignments in this course, books and papers in the library, or
legitimate online resources, though again, all use of these sources must be
acknowledged in your work. (Websites which compile course materials
are not legitimate online resources.)
In general, you are free to discuss homework with other students in the
class, though not to share work; such conversations must be acknowledged in
your assignments. You may not discuss the content of assignments with
anyone other than current students or the instructors until after the
assignments are due. (Exceptions can be made, with prior permission, for
approved tutors.) You are, naturally, free to complain, in general terms,
about any aspect of the course, to whomever you like.
During the take-home exams, you are not allowed to discuss the content of
the exams with anyone other than the instructors; in particular, you may
not discuss the content of the exam with other students in the course.
Any use of solutions provided for any assignment in this course in previous
years is strictly prohibited, both for homework and for exams. This
prohibition applies even to students who are re-taking the course. Do not copy
the old solutions (in whole or in part), do not "consult" them, do not read
them, do not ask your friend who took the course last year if they "happen to
remember" or "can give you a hint". Doing any of these things, or anything
like these things, is cheating, it is easily detected cheating, and those who
thought they could get away with it in the past have failed the course. Even
more importantly: doing any of those things means that the
assignment doesn't give you a chance to practice; it makes any
feedback you get meaningless; and of course it makes any evaluation based on
that assignment unfair.
If you are unsure about what is or is not appropriate, please ask me before
submitting anything; there will be no penalty for asking. If you do violate
these policies but then think better of it, it is your responsibility to tell
me as soon as possible to discuss how your mis-deeds might be rectified.
Otherwise, violations of any sort will lead to severe, formal disciplinary
action, under the terms of the university's
policy
on academic integrity.
On the first day of class, every student will receive a written copy of the
university's policy on academic integrity, a written copy of these course
policies, and a "homework 0" on the content of these policies. This assignment
will not factor into your grade, but you must complete it before you
can get any credit for any other assignment.
Accommodations for Students with Disabilities
The Office of Equal Opportunity Services provides support services for both
physically disabled and learning disabled students. For individualized
academic adjustment based on a documented disability, contact Equal Opportunity
Services at eos [at] andrew.cmu.edu or (412) 268-2012; they will coordinate
with me.
Other Iterations of the Class
Some material is available from versions of this class taught in
other years. As stated above, any use of solutions provided in earlier
years is not only cheating, it is very easily detected cheating.
Schedule
Subject to revision. Lecture notes, assignments and solutions
will all be linked here, as they are available.
Current revision of the complete textbook
- January 15 (Tuesday): Lecture 1, Introduction to the class; regression
- Reading: Chapter 1, plus materials for homework 0
- Optional reading: Cox and Donnelly, chapter 1; Faraway, chapter 1 (especially up to p. 17).
- Homework 0 (on collaboration and plagiarism): assignment. Readings for homework 0:
- January 17 (Thursday): Lecture 2, The truth about linear regression
- Reading: Chapter 2
- R for what was to have been the in-class examples, had the projector worked
- Handout: "predict and Friends: Common Methods for Predictive Models in R" (PDF, R Markdown)
- Optional reading: Faraway, rest of chapter 1
- Homework 1: Assignment
- January 22 (Tuesday): Lecture 3, Evaluation of Models: Error and inference
- Reading: Chapter 3 (R code for that chapter)
- R for in-class demo
- Optional reading: Cox and Donnelly, ch. 6
- January 24 (Thursday): Lecture 4, Smoothing methods in regression
- Reading: Chapter 4 (R code for that chapter)
- R for in-class demos
- Optional readings: Faraway, section 11.1; Hayfield and Racine, "Nonparametric Econometrics: The np Package"; Gelman and Pardoe, "Average Predictive Comparisons for Models with Nonlinearity, Interactions, and Variance Components" [PDF]
- Homework 0 due (at 6 pm the night before)
- Homework 1 due (at 6 pm the night before)
- Homework 2: assignment, uval.csv data set
- January 29 (Tuesday): Lecture 5, Writing R Code
- Slides from class, including solution to the in-class exercise (.Rmd)
- Reading: Appendix on writing R code
- January 31 (Thursday): Lecture 6, Simulation
- Reading: Chapter 5 (R code for examples)
- In-class examples (if class hadn't been canceled by the weather)
- Homework 2 due (at 6:00 pm the night before)
- Homework 3: assignment,
stock_history.csv data set
- February 5 (Tuesday): Lecture 7, The Bootstrap I
- Slides (.Rmd)
- Reading: Chapter 6 (R for selected examples in that chapter)
- Optional reading: Cox and Donnelly, chapter 8
- February 7 (Thursday): Lecture 8,
The Bootstrap II Hypothesis Testing
- Slides (.Rmd)
- Homework 3 due (at 6:00 pm the night before)
- Homework 4: assignment,
gmp-2006.csv data set
- February 12 (Tuesday): Lecture 9, More Bootstrap Examples
- Slides (.Rmd
- Reading: Chapter 6
- February 14 (Thursday): Lecture 10, Multidimensional smoothing and the curse of dimensionality
- Reading: Chapter 7 (R for selected examples in that chapter) and Chapter 8 (R for selected examples)
- Optional reading: Faraway, section 11.2; Faraway, chapter 12
- Homework 4 due (at 6:00 pm the night before)
- Homework 5: assignment
- February 19 (Tuesday): Lecture 11, Splines and Additive models
- Slides (.Rmd for the slides)
- Reading: As for Lecture 10
- February 21 (Thursday): Lecture 12, Testing Regression Specifications
- Slides (.Rmd)
- Reading: Chapter 9
- Optional reading: Cox and Donnelly, chapter 7
- Homework 5 due (at 6:00 pm the night before)
- Homework 6: Assignment
- February 26 (Tuesday): Lecture 13, Heteroskedasticity, weighted least
squares, and variance estimation
- Reading: Chapter 10
- Optional reading: Faraway, section 11.3
- February 28 (Thursday): Lecture 14, Logistic Regression
- Reading: Chapter 11
- Optional reading: Faraway, chapter 2 (omitting sections 2.11 and 2.12)
- Homework 6 due (at 6:00 pm the night before)
- Midterm exam: assignment,
ch.csv data set
- March 5 (Tuesday): Lecture 15, Midterm review
- March 7 (Thursday): Lecture 16, Generalized linear models and
generalized additive models
- Slides (.Rmd)
- Reading: Chapter 12
- Optional reading: Faraway, section 3.1 and chapter 6
- Midterm exam due (at 6:00 pm the night before)
- No homework this week --- enjoy spring break
- March 12 and 14: Spring break, no class
- March 19 (Tuesday): Lecture 17, Multivariate distributions
- Reading: Appendix on multivariate distributions
- March 21 (Thursday): Lecture 18, Density Estimation
- Slides (.Rmd)
- Reading: Chapter 14
- Homework 7: assignment,
n90_pol.csv data set
- March 26 (Tuesday): Lecture 19, Factor Models
- Reading: Chapter 17
- March 28 (Thursday): Lecture 20, Factor Models II
- Reading: Chapter 17
- Homework 7 due (at 6:00 pm the night before)
- Homework 8: Assignment;
stockData.RData data file
- April 2 (Tuesday): Lecture 21, Graphical Models
- Reading: Chapter 20
- April 4 (Thursday): Lecture 22, Graphical Causal Models I
- Reading: Chapter 21
- Optional reading: Cox and Donnelly, chapter 5
- Homework 8 due (at 6:00 pm the night before)
- Homework 9: Assignment; smoke.csv data file
- April 9 (Tuesday): Lecture 23, Graphical Causal Models II
- Reading: Chapter 21
- Optional reading: Cox and Donnelly, chapters 6 and 9;
Pearl, "Causal
Inference in Statistics", section 1, 2, and 3 through 3.2
- April 11 (Thursday): Carnival, no class
- Homework 9 due (at 6:00 pm the night before)
- Homework 10: assignment,
sesame.csv data set
- April 16 (Tuesday): Lecture 24, Identifying Causal Effects from Observations I
- Reading: Chapter 22
- Optional reading:
Pearl, "Causal Inference in Statistics", sections 3.3--3.5, 4, and 5.1
- April 18 (Thursday): Lecture 25, Estimating Causal Effects from Observations
- Reading: Chapter 23
- Homework 10 due (at 6 pm the night before)
- Homework 11: Assignment
- April 23 (Tuesday): Lecture 26, Discovering Causal Structure from Observations
- Reading: Chapter 24
- April 25 (Thursday): Lecture 27, Limitations of Causal Inference
- Reading: Chapter 24
- Homework 11 due (at 6:00 pm the night before)
- Homework 12: assignment
- April 30 (Tuesday): Lecture 28, Summary of the course
- May 2 (Thursday): No lecture (extra time to work on final)
- Homework 12 due (at 6:00 pm the night before)
- Final exam assigned
- May 9 (Thursday)
- Final exam due at 6 pm