Cosma Shalizi

36-401, Modern Regression, Section B

Fall 2015

Section B: Tuesdays and Thursdays, 3:00--4:20, Baker Hall 136A
Welcome to the community of statistical inquiry

Here's the official description:

This course is an introduction to the real world of statistics and data analysis. We will explore real data sets, examine various models for the data, assess the validity of their assumptions, and determine which conclusions we can make (if any). Data analysis is a bit of an art; there may be several valid approaches. We will strongly emphasize the importance of critical thinking about the data and the question of interest. Our overall goal is to use a basic set of modeling tools to explore and analyze data and to present the results in a scientific report. A minimum grade of C in any one of the pre-requisites is required. A grade of C is required to move on to 36-402 or any 36-46x course.

This is a class on linear statistical models: the oldest, most widely used, and mathematically simplest sort of statistical model. It serves as a first course in serious data analysis, as an introduction to statistical modeling and prediction, and as an initiation into a community of inquiry which has developed over two centuries and grown to include every branch of science, technology and policy.

During the class, you will do data analyses with existing software, and begin learning to write your own simple programs to implement and extend key techniques. You will also have to write reports about your analyses.

Graduate students from other departments wishing to take this course should register for it under the number "36-607". Enrollment for 36-607 is very limited, and by permission of the professor only.

Prerequisites

Mathematical statistics: one of 36-226, 36-326 or 36-625, with at least a grade of C; linear algebra, one of 21-240, 21-241 or 21-242, with at least a grade of C. These requirements will not be waived for undergraduates under any circumstances. Graduate students wishing to enroll in 36-607 will need to have had equivalent courses (as determined by the instructor).

Having previously taken 36-350, introduction to statistical computing, or taking it concurrently, is strongly recommended but not required.

Instructors

Professors Dr. Cosma Shalizi cshalizi [at] cmu.edu
229 C Baker Hall
Teaching assistants Ms. Natalie Klein
Ms. Amanda Luby
Mr. Michael Spece-Ibañez

Topics, Notes, Readings

This is currently a tentative listing of topics, in order.

Simple linear regression: Statistical prediction by least squares. Simple linear regression: using one quantitative variable to predict another. Optimal linear prediction. Estimation of the simple linear regression model. Gaussian estimation theory for the simple linear model. Assumption-checking and regression diagnostics. Prediction intervals.
Multiple linear regression: Linear predictive models with multiple predictor variables. "Population" form of multiple regression. Answering "what if" questions with multiple regression models. Ordinary least squares estimation of multiple regression. Standard errors. Gaussian estimation theory, confidence and prediction intervals. Regression diagnostics. Categorical predictor variables; analysis of variance.
Variable selection: Review of hypothesis testing theory from mathematical statistics. Significance tests for regression coefficients; confidence sets for coefficients. Common fallacies about "significant" coefficients, and how to avoid them. Model and variable selection.
Beyond strictly linear ordinary least squares: Interaction terms. Transformation of predictor variables. Transformation of response variable; common fallacies about transformed responses, and how to avoid them. Weighted least squares for non-constant variance; generalized least squares for time series.
Truly modern regression: Prediction and cross-validation for model and variable selection. Resampling and bootstrap for statistical inference. Regression trees.
See the end for the current lecture schedule, subject to revision. Lecture notes will be linked there, as available.

Course Mechanics

Grades will not go away if you avert your eyes (photo by browntj on Flicker) Exams will be 30% of your grade, data-analysis projects 45%, and homework 25%.

Theory Exams

There will be two in-class mid-term exams, both focusing on the theoretical portions of the course. Both exams will be cumulative. Each exam will be 15% of your final grade.

Data Analysis Projects

There will be three take-home projects where you will analyze real data sets, and write up your findings in the form of a scientific report. You will be graded both on the technical correctness of your work and on your ability to clearly communicate your findings; in particular, raw computer output or code are not acceptable. (Rubrics and example reports will be made available before the first DAP is assigned.)

The DAPs are exams; consequently, collaboration is not allowed. Each DAP will count for 15% of your final grade.

Homework

The homework will give you practice in using the techniques you are learning to analyze data, and to interpret the analyses. They will also include some theory questions, requiring you to do calculations or prove results mathematically. There will be one homework assignment every week in which there is not an exam. Every assignment will count equally towards 25% of your grade. Your lowest two homework grades will be dropped; consequently, no late homework will be accepted for any reason.

Communicating your results to others is as important as getting good results in the first place. A portion of the points available for every homework will be set aside to reflect the clarity of your writing, figures, data presentation, and other marks of communication. (Rubrics will be provided for each assignment.) In addition, at least two homeworks will be practice DAPs, where you will have to write reports in the same manner as the data analysis projects.

Formats and Submission of Assignments

Except as otherwise noted in the schedule, all assignments will be due at 3 pm on Thursdays (i.e., at the beginning of class), through Blackboard. Late assignments are not accepted for any reason. Coming late to class because you are uploading an assignment is unacceptable.

You will submit a PDF or HTML file containing a readable version of all your write-ups, mathematics, figures, tables, and selected portions of code as relevant. Word files will not be graded. (You may write in Word if you must, but you need to submit either PDF or HTML.)

You are strongly encouraged to use R Markdown to integrate text, code, images and mathematics. If you do, you will submit both the "knitted" PDF or HTML file, and the source .Rmd file. If you choose not to use R Markdown, you will submit both a humanly-readable file, as PDF or HTML, and a separate plain-text file containing all your R code, clearly commented and formatted to indicate which code section goes with which problem.

If you do not use an equation editor, LaTeX, etc., you may include pictures or scans of hand-written mathematics as needed.

Interviews

To help gauge how well the class is going, and how well the grading reflects actual understanding, every week (after the first week of classes), six students will be selected at random, and will meet with the professor for 10--15 minutes each, to explain their work and to answer questions about it. You may be selected on multiple weeks, if that's how the random numbers come up. This is not a punishment, but a way for the professor to see whether the problem sets are really measuring learning of the course material; being selected will not hurt your grade in any way (and might even help). Refusing to participate on a week you are selected will, however, automatically drop your final grade by one letter.

Office Hours

If you want help with computing, please bring your laptop.

Mondays, 2--3 pm Ms. Klein Porter Hall 117
Mondays, 3--4pm Mr. Spece-Ibanez Porter Hall 117
Wednesdays, noon--1 pm Prof. Shalizi Baker Hall 229C
Wednesdays, 4--5 pm Ms. Luby Porter Hall 117
Thursdays, noon--1 pm Prof. Shalizi Baker Hall 229C

If you cannot make the scheduled office hours, please e-mail the professor about making an appointment.

Blackboard

Blackboard will be used for submitting assignments electronically, and as a gradebook. All properly enrolled students should have access to the Blackboard site by the beginning of classes.

Textbook

The primary textbook for the course will be Kutner, Nachtsheim and Neter's Applied Linear Regression Models, 4th edition (McGraw-Hill, 2004, ISBN 0-07-238691-6). This is required. (The fifth edition is also acceptable, though if you use it, when specific problems or readings are assigned from the text, you are responsible for ensuring that they match up with what's intended.)

Four other books are recommended:

Collaboration, Cheating and Plagiarism

Cheating leads to desolation and ruin In general, you are free to discuss homework with each other, though all the work you turn in must be your own; you must not copy mathematical derivations, computer output and input, or written descriptions from anyone or anywhere else, without reporting the source within your work. (This includes copying from solutions provided to previous semesters' of the course.) Unacknowledged copying or unauthorized collaboration will lead to severe disciplinary action, beginning with an automatic grade of zero for all involved and escalating from there. Please read the CMU Policy on Cheating and Plagiarism, and don't plagiarize.

Physically Disabled and Learning Disabled Students

The Office of Equal Opportunity Services provides support services for both physically disabled and learning disabled students. For individualized academic adjustment based on a documented disability, contact Equal Opportunity Services at eos [at] andrew.cmu.edu or (412) 268-2012.

Computational Work

R is a free, open-source software package/programming language for statistical computing. Many of you will have some prior exposure to the language; for the rest, now is a great time to start learning. Almost every assignment will require you to use it. No other form of computational work will be accepted. If you are not able to use R, or do not have ready, reliable access to a computer on which you can do so, let me know at once.

R Markdown is an extension to R which lets you embed your code, and the calculations it produces, in ordinary text, which can also be formatted, contain figures and equations, etc. Using R Markdown is strongly encouraged. If you do, you need to submit both your "knitted" file (HTML or PDF, not Word), and the original .Rmd file.

If you choose to not use R Markdown, for all computational assignments you need to submit both a properly-formatted humanly-readable write-up, in PDF, and a raw text file contain your R code, commented so that it is clear which pieces of code go with which problem. The write-up may be a .txt file, PDF, or HTML; Word files will not be graded.

R Resources

Here are some resources for learning R: Surrounded by thickets of syntax

Even if you know how to do some basic coding (or more), you should read the page of Minimal Advice on Programming.

Other Iterations of the Class

In fall 2015, Section A of the class is being taught by Prof. Xizhen Cai; the two sections will be closely coordinated but are separate classes.

If you came here from a search engine, you may be looking for information on previous versions of the class, as taught by Prof. Rebecca Nugent.

Schedule

Subject to revision. Lecture notes, assignments and solutions will all be linked here, as they are available. All readings are from the textbook by Kutner et al., unless otherwise noted. Identifying significant features from background (photo by Udoy Bhaskar Borah on Flickr)
September 1, Lecture 1: Introduction to the course
Course mechanics; random variables and probability review; statistical prediction; optimal linear prediction.
Reading: Appendix A (on Blackboard if you do not yet have the textbook)
Homework 1: Assignment, fha.csv data set
September 3, Lecture 2: Exploratory data analysis and R
Office hours will be held in computing labs today and on selected days next week; see Blackboard for details. Attendance at one of these is optional but strongly encouraged.
Readings: "Introduction to R Selected Handouts for 36-401" (by Prof. Nugent), and "36-401 Fall 2015 R Introduction"
September 8, Lecture 3: About Statistical Modeling
An example data set. Drawing lines through scatterplots. Why prefer one line over another? Statistical models as data summaries; models as tools for inference. Sources of uncertainty in inference: sampling, measurement error, fluctuations. Models as assumptions on the data-generating process. Some examples. Inference within a model vs. checking model assumptions. Introducing the simple linear regression model.
For LaTeX/knitr users: the .Rnw file used to generate the notes
Reading for the week: sections 1.1--1.5 (on Blackboard)
September 10, Lecture 4: Simple linear regression models.
The simple linear regression model: once more with feeling. Consistency, unbiased-ness and variance of the plug-in estimator. "The method of least squares". The Gaussian noise ("normal error") simple linear regression model.
For LaTeX/knitr users: the .Rnw file used to generate the notes
Homework 1 due; solutions on Blackboard (please don't share beyond this class)
Homework 2: assignment
September 15, Lecture 5: Estimating simple linear regression I
The method of least squares. Assumptions of the method. Properties of the estimates. Predictive inference. Least-squares estimation in R. Reading: sections 1.6 and 1.7.
.Rnw file used to generate the notes
September 17, Lecture 6: Estimating simple linear regression II
The Gaussian model. Assumptions of the model. Consequences: maximum likelihood estimation; properties of the MLE. Reading: section 1.8.
R for in-class demos
Homework 2 due
Homework 3: assignment
September 22, Lecture 7: Diagnostics and Transformations
Assumption checking for the simple linear model; assumption checking for the simple linear model with Gaussian noise. Generalization out of sample. Nonlinearities: transforming the predictor; nonlinear least squares; nonparametric smoothing. Transformations of the response to make the assumptions hold; Box-Cox transformations. Cautions about transforming the response: changed interpretation, changed model of noise, utter lack of motivation for most common transformations. What the residuals look like under mis-specification.
.Rnw file which produced the notes
See also: supplement, based on class discussion: Interpreting models after transformations
Readings: section 3.1--3.3 and 3.8--3.9.
R for in-class demos
September 24, Lecture 8: Inference in simple linear regression I
Inference for coefficients: standard errors; confidence sets; hypothesis tests; reminders about translating between confidence sets and hypothesis tests; reminders that statistical significance is not practical importance. Readings: section 2.1--2.3.
.Rnw file which produced the notes
Homework 3 due
Homework 4: assignment, auto-mpg.csv, abalone.csv
September 29, Lecture 9: Inference in simple linear regression II
Inference for expected values: standard errors, confidence sets. Inference for new measurements: standard errors, confidence sets. Readings: sections 2.4--2.6.
.Rnw source file for the notes
Supplement, based on class discussion: Interpreting models after transformations
October 1, Lecture 10: F tests, R2 and other distractions
The F test for whether the slope is 0; F tests for linear models generally. Likelihood ratio tests as a more general alternative to F tests. R2: distraction or nuisance? Correlation and regression coefficients; "does anyone know when the correlation coefficient is useful?". How to honor tradition in science. Readings: section 2.7--2.9.
Homework 4 due
October 6, Lecture 11: Exam 1 review
October 8, Lecture 12: Theory exam 1
Data analysis project 1: project, mobility.csv
October 13, Lecture 13: Linear regression and linear algebra
Simple linear regression in matrix form. Readings: chapter 5 (all of it).
October 15, Lecture 14: Multiple linear regression
Linear models with multiple predictor variables. Ordinary least squares estimation. Why multiple regression doesn't just add up simple regressions. Readings; sections 6.1--6.4.
Data analysis project 1 due
Homework 5: assignment, gpa.txt, commercial.txt
October 20, Lecture 15: Diagnostics and Inference
Assumption-checking for multiple linear regression; diagnostics. Inference for ordinary least squares: sampling distributions, degrees of freedom, confidence sets and hypothesis tests. Readings: sections 6.6--6.8.
.Rnw source file for the lecture
October 22, Lecture 16: Polynomials and Categorical Predictors
Dealing with non-linearities by adding polynomial terms. Cautions about polynomials. Dealing with categorical predictors by adding "dummy" or "indicator" variables. Interpretation of coefficients on categoricals. Readings: sections 8.1--8.7.
.Rnw source file for the lecture
Homework 5 due
Homework 6: assignment, SENIC data set (see Blackboard for the excerpt from the textbook describing this file)
October 27, Lecture 17: Multicollinearity
Multicollinearity: what it is and why it's a problem. Identifying collinearity from pairs plots; why multicollinearity may not show up this way. Dealing with collinearity by dropping variables. Picking out multicollinearity from eigenvalues and eigenvectors; principal components regression. Ridge regression for multicollinearity and for stabilizing estimates. High dimensional regression.
Readings: sections 7.1--7.3 and 10.1--10.5.
.Rnw source file
October 29, Lecture 18: Testing and Confidence Sets for Multiple Coefficients
Tests for individual coefficients (in the context of a specific larger model). "Partial" F tests and likelihood ratio tests for groups of coefficients (in the context of a larger model). "Full" F tests and likelihood ratio tests for all the slopes at once (in the context of a larger model). Cautions about these tests. Confidence rectangles for multiple coefficients; confidence ellipsoids for multiple coefficients.
.Rnw source file for the notes
Readings: sections 7.3--7.4.
Homework 6 due
Homework 7: assignment, water.txt data file
November 3, Lecture 19: Interactions
General concept of interactions between variables. Conventional form of interactions in linear models. Interactions between numerical and categorical variables. Readings: sections 8.1--8.2.
.Rnw source file for the lecture
November 5, Lecture 20: Influential points and outliers
"Influence" of a data point on OLS estimates. Outlier detection. Dealing with outliers and influential points: by deletion; by robust (non-OLS) regression. Readings: section 10.1--10.5.
.Rnw source file for the lecture
Homework 7 due
Homework 8: assignment, real-estate.csv
November 10, Lecture 21: Model selection
Comparing competing models. Traditional approaches. Sound approaches. Difficulties of inference after selection. Readings: sections 9.1--9.4.
.Rnw source file
November 12, Lecture 22: Midterm review
Practice Exam 2
November 13
Homework 8 due at 4:30 pm
November 17, Lecture 23: Theory exam 2
Data analysis project 2: assignment, bikes.csv
November 19, Lecture 24: Non-Constant Noise Variance (special topics I)
"Heteroskedasticity" = changing noise variance. Dealing with heteroskedasticity by weighted least squares. WLS estimation in practice. Where do the weights come from? Readings: section 11.1; lecture notes.
November 24, Lecture 25: Correlated noise (special topics II)
Dealing with correlations in the noise by generalized least squares. GLS estimation in practice. Where do the correlations come from? Readings: chapter 12; lecture notes.
Data analysis project 2 due
December 1, Lecture 26: Variable Selection (special topics III)
Variable selection as a special case of model selection. Why p-values are very bad guides to which variables are important. Cross-validation for variable selection: leave-one-out and k-fold. Stepwise regression; stepwise regression in R. Cautions about inference after selection, again.
Readings: Re-read lecture 21!
December 3, Lecture 27: Regression trees (special topics IV)
"Regressograms": regression by averaging over discretized variables. Partitioning and trees. Interpretation of regression trees. Nonlinearity and interaction; average predictive comparisons. Fitting trees with cross-validation.
Note: Sections 1 and 2 of the lecture notes for today are the most relevant; section 3 is about what to do when the response variable is categorical.
Homework 9: assignment --- due on Tuesday, 8 December
December 8, Lecture 28: Bootstrap I (special topics V)
Sampling distributions and the bootstrap principle. Resampling. Inference when Gaussian assumptions are shaky. Bootstrap standard errors and confidence intervals. Readings: section 11.5; handouts.
Homework 9 due
Data analysis project 3: assignment. Your personalized data set has been e-mailed to your Andrew address; contact the professor as soon as possible if you have any problem with the data set.
December 10, Lecture 29: Bootstrap II (special topics VI)
More resampling. Bootstrap prediction intervals. Bootstrap plus model selection. When will bootstrapping not work?
December 15
Data analysis project 3 due at 5 pm
Onwards to 36-402!