Cosma Shalizi
36-402, Undergraduate Advanced Data Analysis, Section A
Spring 2017
Section A
Tuesdays and Thursdays, 10:30--11:50, Wean Hall 7500
The goal of this class is to train you in using statistical models to
analyze data — as data summaries, as predictive instruments, and as tools
for scientific inference. We will build on the theory and applications of the
linear model, introduced
in 36-401, extending it
to more general functional forms, and more general kinds of data, emphasizing
the computation-intensive methods introduced since the 1980s. After taking the
class, when you're faced with a new data-analysis problem, you should be able
to (1) select appropriate methods, (2) use statistical software to implement
them, (3) critically evaluate the resulting statistical models, and (4)
communicate the results of your analyses to collaborators and to
non-statisticians.
During the class, you will do data analyses with existing software, and
write your own simple programs to implement and extend key techniques. You
will also have to write reports about your analyses.
36-608 In previous years, a small number of well-prepared
graduate students from other departments have been allowed to take this course,
by registering for it as 36-608. (Graduate students enrolling in 36-402 will
be dropped automatically from the roster.) This year, because of the number of
undergraduate students needing to take 402, we have no resources to accommodate
students wishing to take 608 for a grade. If space is available in the
classroom, a few may be allowed to audit the course.
Section B
This year, there are two sections of 36-402. This syllabus is for Section A,
taught by Prof. Shalizi; section B is taught by Prof. Lee. The two
sections are completely independent.
Prerequisites
36-401, with a grade
of C or better. Exceptions are only granted for graduate students in other
departments taking 36-608.
Instructors
Professors | Cosma Shalizi | cshalizi [at] cmu.edu |
| | Baker Hall 229C |
Teaching assistants | Mr. Niccolo Dalmasso |
| Mr. Alan Mishler |
| Mr. Michael Spece Ibanez |
| Mr. Lee Richardson |
Topics
- Model evaluation: statistical inference, prediction, and
scientific inference; in-sample and out-of-sample errors, generalization and
over-fitting, cross-validation; evaluating by simulating; the bootstrap;
penalized fitting; mis-specification checks
- Yet More Linear Regression: what is regression, really?;
what ordinary linear regression actually does; what it cannot do; extensions
- Smoothing: kernel smoothing, including local polynomial
regression; splines; additive models; kernel density estimation
- Generalized linear and additive models: logistic
regression; generalized linear models; generalized additive models.
- Latent variables and structured data: principal
components; factor analysis and latent variables; latent cluster/mixture models; graphical models in general
- Causality: graphical causal models; causal
inference from randomized experiments; identification of
causal effects from observations; estimation of causal effects;
discovering causal structure
- Dependent data: Markov models for time
series without latent variables; hidden Markov models for time series with
latent variables; smoothing and modeling for spatial and network data
See the end of this syllabus for the current lecture schedule, subject to revision. Lecture notes will be linked there, as available.
Course Mechanics
Homework will be 50% of the grade, a midterms exam 20%, and the final
exam 30%.
Lectures
You are responsible for all material covered in lecture, whether or not it is
in the textbook. If you are unable to attend a particular lecture, arrange to
get notes from a classmate. If you have problems coming to lecture, see me.
Homework
The homework will give you practice in using the techniques you are learning
to analyze data, and to interpret the analyses. There will be 12 weekly
homework assignments, nearly one every week; they will all be due on Wednesdays
at 11:59 pm (i.e., the night before Thursday classes), through Blackboard. All
homeworks count equally, totaling 50% of your grade. The lowest three homework
grades will be dropped; consequently, no late homework will be accepted for any
reason whatsoever.
Communicating your results to others is as important as getting good results
in the first place. Every homework assignment will require you to write about
that week's data analysis and what you learned from it; this writing is part of
the assignment and will be graded. As always, raw computer output and R code
is not acceptable; your document must be humanly readable. You should submit
an R Markdown
or knitr file, integrating text, figures
and R code; submit both your knitted file and the source. If that is
not feasible, contact me as soon as possible. Microsoft Word files get an
automatic grade of 0, with no feedback*.
For help on using R Markdown,
see "Using R Markdown
for Class Reports".
Exams
There will be a take-home mid-term exam (20% of your final grade), due at
11:59 pm on Wednesday, 8 March. You will have one week to work on the midterm,
and there will be no homework that week. There will also be a take-home final
exam (30%), due at 10:30 am on Monday, 8 May. These due date will not be moved
once the semester begins; please schedule job interviews and other
extra-curricular activities around them.
The exams may require you to use any material already covered in the
readings, lectures or assignments. All exams will be cumulative.
Exams must also be submitted through Blackboard, under the same rules about
file formats as homework.
Grading
The purpose of this course is to help you learn data analysis. The purpose
of the assignments is to help you learn by giving you structured opportunities
for practice. The purpose of grading is primarily to give you feedback,
distinguishing what you did well on from what you should work on improving.
The exams will each be curved separately to ensure that they are comparable
in scale to the homework before calculating your final grade. You
should not presume that an un-curved average of 90 guarantees you an
A.
If you believe that particular assignment has been incorrectly graded, tell
me as soon as possible. Direct any questions or complaints about your grades
to me; the teaching assistants have no authority to make changes. (This also
goes for your final letter grade.) Complaints that the thresholds for letter
grades are unfair, that you deserve a higher grade, etc., will accomplish much
less than pointing to concrete problems in the grading of specific assignments.
As a final word of advice, "what is the least amount of work I need
to do in order to get the grade I want?" is a much worse way to approach
higher education than "how can I learn the most from this class and from
my teachers?".
Solutions
Solutions for all homework and exams will be available, after their due date,
through Blackboard. Do not share them with anyone, even after the course
has ended.
Interviews
To help the instructors get a better sense of how the class is going, every
week (after the first week of classes), six students will be selected at
random, and will meet with me for 10--15 minutes each, to explain their work
and to answer questions about it. You may be selected on multiple weeks, if
that's how the random numbers come up. This is not a punishment, but
a way to see whether the problem sets are really serving their goal of helping
you learn the course material; being selected will not hurt your grade in any
way (and might even help).
Office Hours
If you want help with computing, please bring a laptop.
Monday | 6:00--7:00 | Mr. Richardson | Porter Hall 117 |
Tuesday | 4:00--5:00 | Prof. Shalizi | Wean Hall 4625 |
Tuesday | 6:00--7:00 | Mr. Richardson | Porter Hall 117 |
Wednesday | 11:00--12:00 | Prof. Shalizi | Doherty Hall 1211 |
Wednesday | 6:00--7:00 | Mr. Richardson | Porter Hall 117 |
If you cannot make the regular office hours, or have concerns you'd rather
discuss privately, please e-mail me about making an appointment.
Piazza
We will be using the Piazza website for question-answering. You will receive
an invitation within the first week of class. Anonymous posting of questions
and replies will be allowed, at least initially; if this is abused,
anonymity will go away.
Blackboard
Blackboard will be used for submitting assignments electronically, and as a
gradebook. All properly enrolled students should have access to the Blackboard
site before the first assignment is due.
Textbook
The primary textbook for the course will be the
draft Advanced Data Analysis from an
Elementary Point of View. Chapters will be linked to here as they
become needed. You are expected to read these chapters, and are unlikely to be
able to do the assignments without doing so. (There will be a prize for the
student who identifies the most errors by the next-to-last class, presented at
the last class meeting.) In addition, Paul Teetor, The R Cookbook
(O'Reilly Media, 2011,
ISBN 978-0-596-80915-7)
is required as a reference.
Cox and Donnelly, Principles of Applied Statistics (Cambridge
University Press, 2011,
ISBN 978-1-107-64445-8); Faraway, Extending
the Linear Model with R (Chapman Hall/CRC Press, 2006,
ISBN 978-1-58488-424-8; errata);
and Venables and Ripley, Modern Applied Statistics with S
(Springer,
2003;
ISBN 9780387954578)
will be optional. The campus bookstore should have copies of
all of these.
Collaboration, Cheating and Plagiarism
Everything you turn in for a grade must be your own work, or a clearly
acknowledged borrowing from an approved source; this includes all mathematical
derivations, computer code and output, figures, and text. Any use of permitted
sources must be clearly acknowledged in your work, with citations letting the
reader verify your source. You are free to consult the textbook and
recommended class texts, lecture slides and demos, any resources provided
through the class website, solutions provided to this semester's
previous assignments in this course, books and papers in the library, or online
resources, though again, all use of these sources must be acknowledged in your
work.
In general, you are free to discuss homework with other students in the
class, though not to share work; such conversations must be acknowledged in
your assignments. You may not discuss the content of assignments with
anyone other than current students or the instructors until after the
assignments are due. (Exceptions may be made, with prior permission, for
approved tutors.) You are, naturally, free to complain, in general terms,
about any aspect of the course, to whomever you like.
During the take-home exams, you are not allowed to discuss the content of
the exams with anyone other than the instructors; in particular, you may
not discuss the content of the exam with other students in the course.
Any use of solutions provided for any assignment in this course in previous
years is strictly prohibited, both for homework and for exams. This
prohibition applies even to students who are re-taking the course. Do not copy
the old solutions (in whole or in part), do not "consult" them, do not read
them, do not ask your friend who took the course last year if they "happen to
remember" or "can give you a hint". Doing any of these things, or anything
like these things, is cheating, it is easily detected cheating, and those who
thought they could get away with it in the past have failed the course.
If you are unsure about what is or is not appropriate, please ask me before
submitting anything; there will be no penalty for asking. If you do violate
these policies but then think better of it, it is your responsibility to tell
me as soon as possible to discuss how your mis-deeds might be rectified.
Otherwise, violations of any sort will lead to severe, formal disciplinary
action, under the terms of the university's
policy
on academic integrity.
On the first day of class, every student will receive a written copy of the
university's policy on academic integrity, a written copy of these course
policies, and a "homework 0" on the content of these policies. This assignment
will not factor into your grade, but you must complete it before you
can get any credit for any other assignment.
Accommodations for Students with Disabilities
The Office of Equal Opportunity Services provides support services for both
physically disabled and learning disabled students. For individualized
academic adjustment based on a documented disability, contact Equal Opportunity
Services at eos [at] andrew.cmu.edu or (412) 268-2012; they will coordinate
with me.
R
R is a free, open-source software
package/programming language for statistical computing. You should have begun
to learn it in 36-401 (if not before), and this class presumes that you have.
Every assignment will require you to use it. No other form of computational
work will be accepted. If you are not able to use R, or do not have
ready, reliable access to a computer on which you can do so, let me know at
once.
There is a separate page of resources for learning R.
Other Iterations of the Class
Some material is available from versions of this class taught in
other years. As stated above, any use of solutions provided in earlier
years is not only cheating, it is very easily detected cheating.
Schedule
Subject to revision. Lecture notes, assignments and solutions
will all be linked here, as they are available.
Current revision of the complete textbook
- January 17 (Tuesday): Lecture 1, Introduction to the class; regression
- Reading: Chapter 1
- Optional reading: Cox and Donnelly, chapter 1; Faraway, chapter 1 (especially up to p. 17).
- Homework 0 (on collaboration and plagiarism) assigned (relevant policies were handed out in class, with links in the text of the assignment; relevant excerpt from Turabian's A Manual for Writers is on Blackboard)
- Homework 1: assignment,
RAJ.csv data set
- January 19 (Thursday): Lecture 2, The truth about linear regression
- Reading: Chapter 2
- Optional reading: Faraway, rest of chapter 1
- Homework 0 due (at start of class)
- January 24 (Tuesday): Lecture 3, Evaluation of Models: Error and inference
- Reading: Notes, chapter 3
- Optional reading: Cox and Donnelly, ch. 6
- Demo comparing in-sample error, out-of-sample error and cross-validation
- Handout: "predict and Friends: Common Methods for Predictive Models in R" (PDF, R Markdown)
- January 26 (Thursday): Lecture 4, Smoothing methods in regression
- Reading: Chapter 4
- Optional readings: Faraway, section 11.1; Hayfield and Racine, "Nonparametric Econometrics: The np Package"; Gelman and Pardoe, "Average Predictive Comparisons for Models with Nonlinearity, Interactions, and Variance Components" [PDF]
- Homework 1 due (at 11:59 pm the night before)
- Homework 2: assignment,
data set, starter code for the last problem
- January 31 (Tuesday): Lecture 5,
Writing R Code More on smoothing
- Reading: Appendix on writing R code
- February 2 (Thursday): Lecture 6,
Simulation Canceled due to instructor illness
- Reading: Chapter 5
- Homework 2 due (at 11:59 pm the night before)
- Homework 3: assignment,
data set
- February 7 (Tuesday): Lecture 7, The Bootstrap
- Reading: Chapter 6 (commented R code from chapter 6)
- What were to have been the in-class demos
- Optional reading: Cox and Donnelly, chapter 8
- February 9 (Thursday): Lecture 8,
Splines More Bootstrap
- Lecture demos: HTML,
Rmd
- Homework 3 due (at 11:59 pm the night before)
- Homework 4: assignment,
nampd.csv data file,
MoM.txt data file
- February 14 (Tuesday): Lecture 9, Splines
- In-class examples: HTML,
Rmd source file
- Reading: Chapter 7
- Optional reading: Faraway, section 11.2
- February 16 (Thursday): Lecture 10, Additive models
- Reading: Chapter 8
- Optional reading: Faraway, chapter 12
- Homework 4 due (at 11:59 pm the night before)
- Homework 5: assignment, gmp-2006.csv data file
- February 21 (Tuesday): Lecture 11, Testing Regression Specifications
- Reading: Chapter 9
- Optional reading: Cox and Donnelly, chapter 7
- February 23 (Thursday): Lecture 12, Heteroskedasticity, weighted least
squares, and variance estimation
- Reading: Chapter 10
- Optional reading: Faraway, section 11.3
- Homework 5 due (at 11:59 pm the night before)
- Homework 6: assignment
- February 28 (Tuesday): Lecture 13, Logistic Regression
- Reading: Chapter 11
- Optional reading: Faraway, chapter 2 (omitting sections 2.11 and 2.12)
- March 2 (Thursday): Lecture 14, Generalized linear models and
generalized additive models
- Reading: Chapter 12
- Optional reading: Faraway, section 3.1 and chapter 6
- Homework 6 due (at 11:59 pm the night before)
- Midterm exam: Assignment,
ch.csv data set
- March 7 (Tuesday): Lecture 15, Multivariate Distributions
- Reading: Appendix on multivariate distributions
- March 9 (Thursday): Lecture 16, Density Estimation
- Reading: Chapter 14
- Midterm exam due (at 11:59 pm the night before)
Homework 7 assigned No homework --- enjoy spring break
- March 14 and 16: Spring break
- March 21 (Tuesday): NO CLASS
- March 23 (Thursday): Lecture 17, More Density Estimation
- Reading: Chapter 14 (again)
- Homework 7: assignment,
n90_pol.csv data file
- March 28 (Tuesday): Lecture 18, Mixture Models
- Reading: Chapter 19
- March 30 (Thursday): Lecture 19, Missing Data
- Reading: TBD
- Optional reading: Cox and Donnelly, chapter 5
- Homework 7 due (at 11:59 pm the night before)
- Homework 8: assignment
- April 4 (Tuesday):
Lecture 20, Graphical Models
- Reading: Chapter 20
- April 6 (Thursday): Lecture 21, Graphical Causal Models
- Reading: Chapter 24
- Optional reading: Cox and Donnelly, chapters 6 and 9;
Pearl, "Causal
Inference in Statistics", section 1, 2, and 3 through 3.2
- Homework 8 due (at 11:59 pm the night before)
- Homework 9: assignment,
smoke.csv data file
- April 11 (Tuesday): Lecture 22, Identifying Causal Effects from Observations I
- Reading: Chapter 25
- Optional reading:
Pearl, "Causal
Inference in Statistics", sections 3.3--3.5, 4, and 5.1
- April 13 (Thursday): Lecture 23, Identifying Causal Effects from Observations II
- Reading: Chapter 25
- Homework 9 due (at 11:59 pm the night before)
- Homework 10: assignment,
sesame.csv data file
- April 18 (Tuesday): Lecture 24, Estimating Causal Effects from Observations
- Reading: Chapter 27
- April 20 (Thursday): Carnival, no class
- April 25 (Tuesday): Lecture 25, Discovering Causal Structure from Observations
- Reading: Chapter 28
- April 27 (Thursday): Lecture 26, Limitations of Causal Inference
- Reading: Chapter 28
- Homework 10 due (at 6 pm the night before)
- Homework 11: assignment
- May 2 (Tuesday): Lecture 27, Time Series I
- Reading: Chapter 21
- May 4 (Thursday): Lecture 28, Time Series II
- Reading: Chapter 21
- Homework 11 due (at 6:00 pm the night before)
- Final exam: Assignment
- May 8 (Monday)
- Final exam due at 10:30 am
*: Unlike PDF or plain text, Word files do not display
consistently across different machines, different versions of the program on
the same machine, etc., so not using them eliminates any doubt that what we
grade differs from what you think you wrote. Word files are also much more of
a security hole than PDF or (especially) plain text. Finally, it is obnoxious
to force people to buy commercial, closed-source software just to read what you
write. (It would be obnoxious even if Microsoft paid you to push its wares
that way, but it doesn't.) ^