Cosma Shalizi

36-402, Undergraduate Advanced Data Analysis

Spring 2016

Tuesdays and Thursdays, 10:30--11:50 Wean Hall 7500
Keen-eyed fellow investigators

The goal of this class is to train you in using statistical models to analyze data — as data summaries, as predictive instruments, and as tools for scientific inference. We will build on the theory and applications of the linear model, introduced in 36-401, extending it to more general functional forms, and more general kinds of data, emphasizing the computation-intensive methods introduced since the 1980s. After taking the class, when you're faced with a new data-analysis problem, you should be able to (1) select appropriate methods, (2) use statistical software to implement them, (3) critically evaluate the resulting statistical models, and (4) communicate the results of your analyses to collaborators and to non-statisticians.

During the class, you will do data analyses with existing software, and write your own simple programs to implement and extend key techniques. You will also have to write reports about your analyses.

36-608 Graduate students from other departments wishing to take this course should register for it under the number "36-608". Enrollment for 36-608 is very limited, and by permission of the professors only.

Prerequisites

36-401, with a grade of C or better. Exceptions are only granted for graduate students in other departments taking 36-608.

Instructors

Professors Cosma Shalizi cshalizi [at] cmu.edu
Baker Hall 229C
Max G'Sell mgsell [at] stat.cmu.edu
Baker Hall 132B
Teaching assistants Ms. Purvasha Chakravarti
Mr. Jaehyeok Shin
Mr. Michael Spece
Mr. Michael Stanley

Topics, Notes, Readings

Model evaluation: statistical inference, prediction, and scientific inference; in-sample and out-of-sample errors, generalization and over-fitting, cross-validation; evaluating by simulating; the bootstrap; penalized fitting; mis-specification checks
Yet More Linear Regression: what is regression, really?; what ordinary linear regression actually does; what it cannot do; extensions
Smoothing: kernel smoothing, including local polynomial regression; splines; additive models; kernel density estimation
Generalized linear and additive models: logistic regression; generalized linear models; generalized additive models.
Latent variables and structured data: principal components; factor analysis and latent variables; latent cluster/mixture models; graphical models in general
Causality: graphical causal models; causal inference from randomized experiments; identification of causal effects from observations; estimation of causal effects; discovering causal structure
Dependent data: Markov models for time series without latent variables; hidden Markov models for time series with latent variables; smoothing and modeling for spatial and network data
See the end for the current lecture schedule, subject to revision. Lecture notes will be linked there, as available.

Course Mechanics

Grades will not go away if you avert your eyes (photo by laurent KB on Flickr) Homework will be 60% of the grade, two midterms 10% each, and the final 20%.

Homework

The homework will give you practice in using the techniques you are learning to analyze data, and to interpret the analyses. There will be 11 weekly homework assignments, nearly one every week; they will all be due on Wednesdays Thursdays at 11:59 pm (i.e., the night before after Thursday classes), through Blackboard. All homeworks count equally, totaling 60% of your grade. The lowest three homework grades will be dropped; consequently, no late homework will be accepted for any reason whatsoever.

Communicating your results to others is as important as getting good results in the first place. Every homework assignment will require you to write about that week's data analysis and what you learned from it. This portion of the assignment will be graded, along with the other questions. As always, raw computer output and R code is not acceptable; your document must be humanly readable. You should submit an R Markdown or knitr file, integrating text, figures and R code; submit both your knitted file and the source. If that is not feasible, contact the professors as soon as possible. Microsoft Word files will not be graded.

For help on using R Markdown, see "Using R Markdown for Class Reports".

Unlike PDF or plain text, Word files do not display consistently across different machines, different versions of the program on the same machine, etc., so not using them eliminates any doubt that what we grade differs from what you think you wrote. Word files are also much more of a security hole than PDF or (especially) plain text. Finally, it is obnoxious to force people to buy commercial, closed-source software just to read what you write. (It would be obnoxious even if Microsoft paid you for marketing its wares that way, but it doesn't.)

Exams

There will be two take-home mid-term exams (10% each), due at 11:59 pm on March 3rd and April 21st. You will have one week to work on each midterm. There will be no homework in those weeks. These due dates will not be moved; please schedule job interviews and other extra-curricular activities around them. There will also be a take-home final exam (20%), due at 10:30 am on May 9th.

Exams must also be submitted through Blackboard, under the same rules about file formats as homework.

Solutions

We will provide solutions for all homework and exams after their due date, through Blackboard. Do not share them with anyone.

Interviews

To help give more informative feedback about the progress of the class, every week (after the first week of classes), six students will be selected at random, and will meet with one of the professors for 10--15 minutes each, to explain their work and to answer questions about it. You may be selected on multiple weeks, if that's how the random numbers come up. This is not a punishment, but a way to see whether the problem sets are really measuring learning of the course material; being selected will not hurt your grade in any way (and might even help).

Grading Issues

Direct any questions or complaints about your grades directly to the professors; the teaching assistants have no authority to make changes.

Office Hours

If you want help with computing, please bring your laptop.

Monday 3:00--4:00 Mr. Shin Porter Hall 117
Tuesday 2:30--3:30 Mr. Spece Porter Hall 117
Wednesday 1:00--2:00 Prof. Shalizi Baker Hall 229A
Wednesday 4:30--5:30 Mr. Stanley Porter Hall 117
Thursday 12:30--1:30 Prof. G'Sell Doherty Hall 2122
Thursday 2:00--3:00 Ms. Chakravarti Porter Hall 117
Thursday 3:00--4:00 Prof. Shalizi Baker Hall 229A

If you cannot make office hours, please e-mail the professors about making an appointment.

Piazza

We will be using the Piazza website for question-answering. You will receive an invitation within the first week of class. Anonymous posting of questions and replies will be allowed, at least initially; if this leads to problems it may go away.

Blackboard

Blackboard will be used for submitting assignments electronically, and as a gradebook. All properly enrolled students should have access to the Blackboard site by the beginning of classes.

Textbook

The primary textbook for the course will be the draft Advanced Data Analysis from an Elementary Point of View. Chapters will be linked to here as they become needed. You are expected to read these chapters, and are unlikely to be able to do the assignments without doing so. (There will be a prize for the student who identifies the most errors by 27 April, presented at the last class meeting.) In addition, Paul Teetor, The R Cookbook (O'Reilly Media, 2011, ISBN 978-0-596-80915-7) is required as a reference.

Cox and Donnelly, Principles of Applied Statistics (Cambridge University Press, 2011, ISBN 978-1-107-64445-8); Faraway, Extending the Linear Model with R (Chapman Hall/CRC Press, 2006, ISBN 978-1-58488-424-8; errata); and Venables and Ripley, Modern Applied Statistics with S (Springer, 2003; ISBN 9780387954578) will be optional. The campus bookstore should have copies of all of these.

Collaboration, Cheating and Plagiarism

Cheating leads to desolation and ruin (photo by paddyjoe on Flickr) In general, you are free to discuss homework with each other, though all the work you turn in must be your own; you must not copy mathematical derivations, computer output and input, figures or writing from anyone or anywhere else, without reporting the source within your work. (This includes copying from solutions to previous assignments in this class.) You may not refer to solutions provided to previous semesters' of the course. You cannot discuss take-home exams with anyone except the professors and teaching assistants. Unacknowledged copying or unauthorized collaboration will lead to severe disciplinary action. Please read the CMU Policy on Academic Integrity, and don't plagiarize.

If you are unsure about what is or is not appropriate, please ask the professors before submitting anything; there will be no penalty for asking.

Physically Disabled and Learning Disabled Students

The Office of Equal Opportunity Services provides support services for both physically disabled and learning disabled students. For individualized academic adjustment based on a documented disability, contact Equal Opportunity Services at eos [at] andrew.cmu.edu or (412) 268-2012.

R

R is a free, open-source software package/programming language for statistical computing. You should have begun to learn it in 36-401 (if not before), and this class presumes that you have. Almost every assignment will require you to use it. No other form of computational work will be accepted. If you are not able to use R, or do not have ready, reliable access to a computer on which you can do so, let me know at once.

Here are some resources for learning R: Caught in a thicket of syntax (photo by missysnowkitten on Flickr)

Even if you know how to do some basic coding (or more), you should read the page of Minimal Advice on Programming.

Other Iterations of the Class

Some material is available from versions of this class taught in other years. Copying from any solutions provided there is not only cheating, it is very easily detected cheating.

Schedule

Subject to revision. Lecture notes, assignments and solutions will all be linked here, as they are available. Identifying significant features from background (photo by Gord McKenna on Flickr)

Current revision of the complete textbook

January 12 (Tuesday): Lecture 1, Introduction to the class; regression
Reading: Chapter 1 (PDF, selected R, 01.Rda data file for examples)
Optional reading: Cox and Donnelly, chapter 1; Faraway, chapter 1 (especially up to p. 17).
Homework 1: assignment, CAPA.csv data file
January 14 (Thursday): Lecture 2, The truth about linear regression
Reading: Chapter 2 (PDF, selected R)
Optional reading: Faraway, rest of chapter 1
January 19 (Tuesday): Lecture 3, Evaluation of Models: Error and inference
Reading: Notes, chapter 3 (PDF, selected R)
Optional reading: Cox and Donnelly, ch. 6
Handout: "predict and Friends: Common Methods for Predictive Models in R" (PDF, R Markdown)
January 21 (Thursday): Lecture 4, Smoothing methods in regression
Reading: Chapter 4 (PDF, selected R)
Optional readings: Faraway, section 11.1; Hayfield and Racine, "Nonparametric Econometrics: The np Package"; Gelman and Pardoe, "Average Predictive Comparisons for Models with Nonlinearity, Interactions, and Variance Components" [PDF]
Homework 1 due (at 11:59 pm the night before)
Homework 2: assignment, data file, starter code
January 26 (Tuesday): Lecture 5, Writing R Code
In-class examples: knitted HTML, R Markdown
Reading: Appendix on writing R code (PDF, R for selected examples)
January 28 (Thursday): Lecture 6, Simulation
In-class examples: commented R file
Reading: Chapter 5 (PDF, R for selected examples)
Homework 2 due (at 11:59 pm the night before)
Homework 3: assignment, stock_history.csv data file
February 2 (Tuesday): Lecture 7, The Bootstrap
Reading: Chapter 6 (PDF, R for selected examples)
Optional reading: Cox and Donnelly, chapter 8
February 4 (Thursday): Lecture 8, Heteroskedasticity, weighted least squares, and variance estimation
Reading: Chapter 7 (PDF, R for selected examples)
Optional reading: Faraway, section 11.3
Homework 3 due (at 11:59 pm the night before)
Homework 4: assignment, nampd.csv data set, MoM.txt data set
February 9 (Tuesday): Lecture 9, Splines
In-class examples: HTML, Rmd
Reading: Chapter 8 (PDF, R for selected examples)
Optional reading: Faraway, section 11.2
February 11 (Thursday): Lecture 10, Additive models
Reading: Chapter 9 (PDF, R for selected examples)
Optional reading: Faraway, chapter 12
Homework 4 due (at 11:59 pm)
Homework 5: assignment, gmp-2006.csv
February 16 (Tuesday): Lecture 11, Testing Regression Specifications
Reading: Chapter 10 (PDF, R for selected examples)
In-class demo: knitted HTML, R Markdown source file
Optional reading: Cox and Donnelly, chapter 7
February 18 (Thursday): Lecture 12, Logistic Regression
Reading: Chapter 11 (PDF, R)
Optional reading: Faraway, chapter 2 (omitting sections 2.11 and 2.12)
Homework 5 due (at 11:59 pm)
Homework 6: assignment, ch.csv data file
February 23 (Tuesday): Lecture 13, Generalized linear models and generalized additive models
Reading: Chapter 12 (PDF)
Optional reading: Faraway, section 3.1 and chapter 6
February 25 (Thursday): Lecture 14, GLMs and GAMs continued
Reading and optional reading: Same as lecture 13
Homework 6 due (at 11:59 pm)
Exam 1: assignment, RAJ.csv
March 1 (Tuesday): Lecture 15, Multivariate Distributions
Reading: Appendix on multivariate distributions (PDF)
March 3 (Thursday): Lecture 16, Density Estimation
Reading: Chapter 14 (PDF)
Exam 1 due (at 11:59 pm)
Homework 7 assigned
March 8 and 10: Spring break
March 15 (Tuesday): Lecture 17, Principal Components Analysis
Reading: Chapter 16 (PDF)
March 17 (Thursday): Lecture 18, Factor Models
Reading: Chapter 17 (PDF)
Homework 7 due (at 11:59 pm)
Homework 8: assignment, stockData.RData file
March 22 (Tuesday): Lecture 19, Mixture Models
Reading: Chapter 19 (PDF, R for selected examples)
March 24 (Thursday): Lecture 20, Missing Data
Reading: TBD
Optional reading: Cox and Donnelly, chapter 5
Homework 8 due (at 11:59 pm)
Homework 9: assignment
March 29 (Tuesday): Lecture 21, Graphical Models
Reading: Chapter 20 (PDF)
March 31 (Thursday): Lecture 22, Graphical Causal Models
Reading: Chapter 24 (PDF)
Optional reading: Cox and Donnelly, chapters 6 and 9; Pearl, "Causal Inference in Statistics", section 1, 2, and 3 through 3.2
Homework 9 due (at 11:59 pm)
Homework 10: assignment, data file
April 5 (Tuesday): Lecture 23, Identifying Causal Effects from Observations
Reading: Chapter 25 (PDF)
Optional reading: Pearl, "Causal Inference in Statistics", sections 3.3--3.5, 4, and 5.1
April 7 (Thursday): Lecture 24, Estimating Causal Effects from Observations
Reading: Chapter 27 (PDF)
Homework 10 due (at 11:59 pm)
Exam 2: assignment, paristan.csv
April 12 (Tuesday): Lecture 25, Discovering Causal Structure from Observations
Reading: Chapter 28 (PDF)
April 14 (Thursday): Carnival, no class
April 19 (Tuesday): Lecture 26, Time Series I
Reading: Chapter 21 (PDF)
April 21 (Thursday): Lecture 27, Time Series II
Reading: Chapter 21 (PDF)
Exam 2 due (at 11:59 pm)
Homework 11: assignment; for data set, see homework 5
April 26 (Tuesday): Lecture 28, Survival Analysis
April 28 (Thursday): Lecture 29, Principles
Homework 11 due (at 11:59 pm)
Exam 3: assignment, macro.csv
May 9 (Monday)
Final exam due at 10:30 am
photo by barjack on Flickr