Cosma Shalizi
36-402, Undergraduate Advanced Data Analysis
Spring 2016
Tuesdays and Thursdays, 10:30--11:50 Wean Hall 7500
The goal of this class is to train you in using statistical models to
analyze data — as data summaries, as predictive instruments, and as tools
for scientific inference. We will build on the theory and applications of the
linear model, introduced
in 36-401, extending it
to more general functional forms, and more general kinds of data, emphasizing
the computation-intensive methods introduced since the 1980s. After taking the
class, when you're faced with a new data-analysis problem, you should be able
to (1) select appropriate methods, (2) use statistical software to implement
them, (3) critically evaluate the resulting statistical models, and (4)
communicate the results of your analyses to collaborators and to
non-statisticians.
During the class, you will do data analyses with existing software, and
write your own simple programs to implement and extend key techniques. You
will also have to write reports about your analyses.
36-608 Graduate students from other departments wishing to
take this course should register for it under the number "36-608". Enrollment
for 36-608 is very limited, and by permission of the professors only.
Prerequisites
36-401, with a grade
of C or better. Exceptions are only granted for graduate students in other
departments taking 36-608.
Instructors
Professors | Cosma Shalizi | cshalizi [at] cmu.edu |
| | Baker Hall 229C |
| Max G'Sell | mgsell [at] stat.cmu.edu |
| | Baker Hall 132B |
Teaching assistants | Ms. Purvasha Chakravarti |
| Mr. Jaehyeok Shin |
| Mr. Michael Spece |
| Mr. Michael Stanley |
Topics, Notes, Readings
- Model evaluation: statistical inference, prediction, and
scientific inference; in-sample and out-of-sample errors, generalization and
over-fitting, cross-validation; evaluating by simulating; the bootstrap;
penalized fitting; mis-specification checks
- Yet More Linear Regression: what is regression, really?;
what ordinary linear regression actually does; what it cannot do; extensions
- Smoothing: kernel smoothing, including local polynomial
regression; splines; additive models; kernel density estimation
- Generalized linear and additive models: logistic
regression; generalized linear models; generalized additive models.
- Latent variables and structured data: principal
components; factor analysis and latent variables; latent cluster/mixture models; graphical models in general
- Causality: graphical causal models; causal
inference from randomized experiments; identification of
causal effects from observations; estimation of causal effects;
discovering causal structure
- Dependent data: Markov models for time
series without latent variables; hidden Markov models for time series with
latent variables; smoothing and modeling for spatial and network data
See the end for the current lecture schedule, subject to revision. Lecture notes will be linked there, as available.
Course Mechanics
Homework will be 60% of the grade, two midterms 10% each, and the final
20%.
Homework
The homework will give you practice in using the techniques you are learning
to analyze data, and to interpret the analyses. There will be 11 weekly
homework assignments, nearly one every week; they will all be due
on Wednesdays Thursdays at 11:59 pm (i.e., the
night before after Thursday classes), through Blackboard. All
homeworks count equally, totaling 60% of your grade. The lowest three homework
grades will be dropped; consequently, no late homework will be accepted for any
reason whatsoever.
Communicating your results to others is as important as getting good results
in the first place. Every homework assignment will require you to write about
that week's data analysis and what you learned from it. This portion of the
assignment will be graded, along with the other questions. As always, raw
computer output and R code is not acceptable; your document must be humanly
readable. You should submit an R
Markdown or knitr file, integrating
text, figures and R code; submit both your knitted file and the
source. If that is not feasible, contact the professors as soon as possible.
Microsoft Word files will not be graded.
For help on using R Markdown,
see "Using R Markdown
for Class Reports".
Unlike PDF or plain text, Word files do not display
consistently across different machines, different versions of the program on
the same machine, etc., so not using them eliminates any doubt that what we
grade differs from what you think you wrote. Word files are also much more of
a security hole than PDF or (especially) plain text. Finally, it is obnoxious
to force people to buy commercial, closed-source software just to read what you
write. (It would be obnoxious even if Microsoft paid you for marketing its
wares that way, but it doesn't.)
Exams
There will be two take-home mid-term exams (10% each), due at 11:59 pm on
March 3rd and April 21st. You will have one week to work on each midterm.
There will be no homework in those weeks. These due dates will not be moved;
please schedule job interviews and other extra-curricular activities around
them. There will also be a take-home final exam (20%), due at 10:30 am on May
9th.
Exams must also be submitted through Blackboard, under the same rules about
file formats as homework.
Solutions
We will provide solutions for all homework and exams after their due date,
through Blackboard. Do not share them with anyone.
Interviews
To help give more informative feedback about the progress of the class, every
week (after the first week of classes), six students will be selected at
random, and will meet with one of the professors for 10--15 minutes each, to
explain their work and to answer questions about it. You may be selected on
multiple weeks, if that's how the random numbers come up. This is not
a punishment, but a way to see whether the problem sets are really measuring
learning of the course material; being selected will not hurt your grade in any
way (and might even help).
Grading Issues
Direct any questions or complaints about your grades directly to the
professors; the teaching assistants have no authority to make changes.
Office Hours
If you want help with computing, please bring your laptop.
Monday | 3:00--4:00 | Mr. Shin | Porter Hall 117 |
Tuesday | 2:30--3:30 | Mr. Spece | Porter Hall 117 |
Wednesday | 1:00--2:00 | Prof. Shalizi | Baker Hall 229A |
Wednesday | 4:30--5:30 | Mr. Stanley | Porter Hall 117 |
Thursday | 12:30--1:30 | Prof. G'Sell | Doherty Hall 2122 |
Thursday | 2:00--3:00 | Ms. Chakravarti | Porter Hall 117 |
Thursday | 3:00--4:00 | Prof. Shalizi | Baker Hall 229A |
If you cannot make office hours, please e-mail the professors about making an appointment.
Piazza
We will be using the Piazza website for question-answering. You will
receive an invitation within the first week of class. Anonymous posting
of questions and replies will be allowed, at least initially; if this
leads to problems it may go away.
Blackboard
Blackboard will be used for submitting assignments electronically, and as a
gradebook. All properly enrolled students should have access to the Blackboard
site by the beginning of classes.
Textbook
The primary textbook for the course will be the
draft Advanced Data Analysis from an
Elementary Point of View. Chapters will be linked to here as they
become needed. You are expected to read these chapters, and are unlikely to be
able to do the assignments without doing so. (There will be a prize for the
student who identifies the most errors by 27 April, presented at the last class
meeting.) In addition, Paul Teetor, The R Cookbook (O'Reilly
Media, 2011,
ISBN 978-0-596-80915-7)
is required as a reference.
Cox and Donnelly, Principles of Applied Statistics (Cambridge
University Press, 2011,
ISBN 978-1-107-64445-8); Faraway, Extending
the Linear Model with R (Chapman Hall/CRC Press, 2006,
ISBN 978-1-58488-424-8; errata);
and Venables and Ripley, Modern Applied Statistics with S
(Springer,
2003;
ISBN 9780387954578)
will be optional. The campus bookstore should have copies of
all of these.
Collaboration, Cheating and Plagiarism
In general, you are free to discuss homework with each other,
though all the work you turn in must be your own; you must not copy
mathematical derivations, computer output and input, figures or writing from
anyone or anywhere else, without reporting the source within your work. (This
includes copying from solutions to previous assignments in this class.) You
may not refer to solutions provided to previous semesters' of the
course. You cannot discuss take-home exams with anyone except the professors
and teaching assistants. Unacknowledged copying or unauthorized collaboration
will lead to severe disciplinary action. Please read the
CMU Policy
on Academic Integrity, and don't plagiarize.
If you are unsure about what is or is not appropriate, please ask the
professors before submitting anything; there will be no penalty for asking.
Physically Disabled and Learning Disabled Students
The Office of Equal Opportunity Services provides support services for both
physically disabled and learning disabled students. For individualized
academic adjustment based on a documented disability, contact Equal Opportunity
Services at eos [at] andrew.cmu.edu or (412) 268-2012.
R
R is a free, open-source software
package/programming language for statistical computing. You should have begun
to learn it in 36-401 (if not before), and this class presumes that you have.
Almost every assignment will require you to use it. No other form of
computational work will be accepted. If you are not able to use R, or
do not have ready, reliable access to a computer on which you can do so, let me
know at once.
Here are some resources for learning R:
- The official intro, "An Introduction to R", available online in
HTML
and PDF
- John Verzani, "simpleR",
in PDF
- Quick-R. This is
primarily aimed at those who already know a commercial statistics package like
SAS, SPSS or Stata, but it's very clear and well-organized, and others may find
it useful as well.
- Patrick
Burns, The R
Inferno. "If you are using R and you think you're in hell, this is a map
for you."
- Thomas Lumley, "R Fundamentals and Programming Techniques"
(large
PDF)
- Paul Teetor, The R Cookbook, explains how to use R to
do many, many common tasks. (It's like the inverse to R's help: "What command
does X?", instead of "What does command Y do?"). It is one of the required
texts, and is available at the campus bookstore.
- The notes for 36-350, Introduction to
Statistical Computing
- There are now many books about R. Some recommendable ones:
- Joseph Adler R in a Nutshell
(O'Reilly, 2009;
ISBN 9780596801700). Probably most useful for those with previous experience programming in another language.
- W. John Braun and
Duncan
J. Murdoch, A
First Course in Statistical Programming with R (Cambridge University Press, 2008; ISBN 978-0-521-69424-7)
- John M. Chambers, Software for Data Analysis:
Programming with R
(Springer, 2008,
ISBN 978-0-387-75935-7).
The best book on writing clean and reliable R programs; probably more advanced
than you will need.
- Norman
Matloff, The Art of R Programming (No Starch Press, 2011,
ISBN 978-1-59327-384-2).
Good introduction to programming for complete novices using R. Less statistics
than Braun and Murdoch, more programming skills.
Even if you know how to do some basic coding (or more), you should read the
page of Minimal Advice on
Programming.
Other Iterations of the Class
Some material is available from versions of this class taught in
other years. Copying from any solutions provided there is not only
cheating, it is very easily detected cheating.
Schedule
Subject to revision. Lecture notes, assignments and solutions
will all be linked here, as they are available.
Current revision of the complete textbook
- January 12 (Tuesday): Lecture 1, Introduction to the class; regression
- Reading: Chapter 1 (PDF,
selected R, 01.Rda data file for examples)
- Optional reading: Cox and Donnelly, chapter 1; Faraway, chapter 1 (especially up to p. 17).
- Homework 1: assignment, CAPA.csv data file
- January 14 (Thursday): Lecture 2, The truth about linear regression
- Reading: Chapter 2 (PDF,
selected R)
- Optional reading: Faraway, rest of chapter 1
- January 19 (Tuesday): Lecture 3, Evaluation of Models: Error and inference
- Reading: Notes, chapter 3 (PDF,
selected R)
- Optional reading: Cox and Donnelly, ch. 6
- Handout: "predict and Friends: Common Methods for Predictive Models in R" (PDF, R Markdown)
- January 21 (Thursday): Lecture 4, Smoothing methods in regression
- Reading: Chapter 4 (PDF, selected R)
- Optional readings: Faraway, section 11.1; Hayfield and Racine, "Nonparametric Econometrics: The np Package"; Gelman and Pardoe, "Average Predictive Comparisons for Models with Nonlinearity, Interactions, and Variance Components" [PDF]
- Homework 1 due (at 11:59 pm the night before)
- Homework 2: assignment, data file, starter code
- January 26 (Tuesday): Lecture 5, Writing R Code
- In-class examples: knitted HTML, R Markdown
- Reading: Appendix on writing R code (PDF, R for selected examples)
- January 28 (Thursday): Lecture 6, Simulation
- In-class examples: commented R file
- Reading: Chapter 5 (PDF,
R for selected examples)
- Homework 2 due (at 11:59 pm the night before)
- Homework 3: assignment,
stock_history.csv data file
- February 2 (Tuesday): Lecture 7, The Bootstrap
- Reading: Chapter 6 (PDF,
R for selected examples)
- Optional reading: Cox and Donnelly, chapter 8
- February 4 (Thursday): Lecture 8, Heteroskedasticity, weighted least
squares, and variance estimation
- Reading: Chapter 7 (PDF,
R for selected examples)
- Optional reading: Faraway, section 11.3
- Homework 3 due (at 11:59 pm the night before)
- Homework 4: assignment,
nampd.csv data set,
MoM.txt data set
- February 9 (Tuesday): Lecture 9, Splines
- In-class examples: HTML, Rmd
- Reading: Chapter 8 (PDF,
R for selected examples)
- Optional reading: Faraway, section 11.2
- February 11 (Thursday): Lecture 10, Additive models
- Reading: Chapter 9 (PDF,
R for selected examples)
- Optional reading: Faraway, chapter 12
- Homework 4 due (at 11:59 pm)
- Homework 5: assignment,
gmp-2006.csv
- February 16 (Tuesday): Lecture 11, Testing Regression Specifications
- Reading: Chapter 10 (PDF, R for selected examples)
- In-class demo: knitted HTML,
R Markdown source file
- Optional reading: Cox and Donnelly, chapter 7
- February 18 (Thursday): Lecture 12, Logistic Regression
- Reading: Chapter 11 (PDF,
R)
- Optional reading: Faraway, chapter 2 (omitting sections 2.11 and 2.12)
- Homework 5 due (at 11:59 pm)
- Homework 6: assignment,
ch.csv data file
- February 23 (Tuesday): Lecture 13, Generalized linear models and
generalized additive models
- Reading: Chapter 12 (PDF)
- Optional reading: Faraway, section 3.1 and chapter 6
- February 25 (Thursday): Lecture 14, GLMs and GAMs continued
- Reading and optional reading: Same as lecture 13
- Homework 6 due (at 11:59 pm)
- Exam 1: assignment,
RAJ.csv
- March 1 (Tuesday): Lecture 15, Multivariate Distributions
- Reading: Appendix on multivariate distributions (PDF)
- March 3 (Thursday): Lecture 16, Density Estimation
- Reading: Chapter 14 (PDF)
- Exam 1 due (at 11:59 pm)
Homework 7 assigned
- March 8 and 10: Spring break
- March 15 (Tuesday): Lecture 17, Principal Components Analysis
- Reading: Chapter 16 (PDF)
- March 17 (Thursday): Lecture 18, Factor Models
- Reading: Chapter 17 (PDF)
Homework 7 due (at 11:59 pm)
- Homework 8: assignment,
stockData.RData file
- March 22 (Tuesday): Lecture 19, Mixture Models
- Reading: Chapter 19 (PDF,
R for selected examples)
- March 24 (Thursday): Lecture 20, Missing Data
- Reading: TBD
- Optional reading: Cox and Donnelly, chapter 5
- Homework 8 due (at 11:59 pm)
- Homework 9: assignment
- March 29 (Tuesday): Lecture 21, Graphical Models
- Reading: Chapter 20 (PDF)
- March 31 (Thursday): Lecture 22, Graphical Causal Models
- Reading: Chapter 24 (PDF)
- Optional reading: Cox and Donnelly, chapters 6 and 9;
Pearl, "Causal
Inference in Statistics", section 1, 2, and 3 through 3.2
- Homework 9 due (at 11:59 pm)
- Homework 10: assignment,
data file
- April 5 (Tuesday): Lecture 23, Identifying Causal Effects from Observations
- Reading: Chapter 25 (PDF)
- Optional reading:
Pearl, "Causal
Inference in Statistics", sections 3.3--3.5, 4, and 5.1
- April 7 (Thursday): Lecture 24, Estimating Causal Effects from Observations
- Reading: Chapter 27 (PDF)
- Homework 10 due (at 11:59 pm)
- Exam 2: assignment,
paristan.csv
- April 12 (Tuesday): Lecture 25, Discovering Causal Structure from Observations
- Reading: Chapter 28 (PDF)
- April 14 (Thursday): Carnival, no class
- April 19 (Tuesday): Lecture 26, Time Series I
- Reading: Chapter 21 (PDF)
- April 21 (Thursday): Lecture 27, Time Series II
- Reading: Chapter 21 (PDF)
- Exam 2 due (at 11:59 pm)
- Homework 11: assignment; for data set, see homework 5
- April 26 (Tuesday): Lecture 28, Survival Analysis
- April 28 (Thursday): Lecture 29, Principles
- Homework 11 due (at 11:59 pm)
- Exam 3: assignment, macro.csv
- May 9 (Monday)
- Final exam due at 10:30 am