Cosma Shalizi
36-401, Modern Regression, Section B
Fall 2015
Section B: Tuesdays and Thursdays, 3:00--4:20, Baker Hall 136A
Here's the official description:
This course is an introduction to the real world of statistics and
data analysis. We will explore real data sets, examine various models for the
data, assess the validity of their assumptions, and determine which conclusions
we can make (if any). Data analysis is a bit of an art; there may be several
valid approaches. We will strongly emphasize the importance of critical
thinking about the data and the question of interest. Our overall goal is to
use a basic set of modeling tools to explore and analyze data and to present
the results in a scientific report. A minimum grade of C in any one of the
pre-requisites is required. A grade of C is required to move on
to 36-402 or any 36-46x
course.
This is a class on linear statistical models: the oldest, most widely used,
and mathematically simplest sort of statistical model. It serves
as a first course in serious data analysis, as an introduction to
statistical modeling and prediction, and as an initiation into a community
of inquiry which has developed over two centuries and grown to include
every branch of science, technology and policy.
During the class, you will do data analyses with existing software, and
begin learning to write your own simple programs to implement and extend key
techniques. You will also have to write reports about your analyses.
Graduate students from other departments wishing to take this course should
register for it under the number "36-607". Enrollment for 36-607 is very
limited, and by permission of the professor only.
Prerequisites
Mathematical statistics: one of 36-226, 36-326 or 36-625, with at least a
grade of C; linear algebra, one of 21-240, 21-241 or 21-242, with at least a
grade of C. These requirements will not be waived for undergraduates under any
circumstances. Graduate students wishing to enroll in 36-607 will need to have
had equivalent courses (as determined by the instructor).
Having previously
taken 36-350,
introduction to statistical computing, or taking it concurrently, is strongly
recommended but not required.
Instructors
Professors | Dr. Cosma Shalizi | cshalizi [at] cmu.edu |
| | 229 C Baker Hall |
Teaching assistants | Ms. Natalie Klein | |
| Ms. Amanda Luby | |
| Mr. Michael Spece-Ibañez | |
Topics, Notes, Readings
This is currently a tentative listing of topics, in order.
- Simple linear regression: Statistical prediction by least
squares. Simple linear regression: using one quantitative variable to predict
another. Optimal linear prediction. Estimation of the simple linear
regression model. Gaussian estimation theory for the simple linear model.
Assumption-checking and regression diagnostics. Prediction intervals.
- Multiple linear regression: Linear predictive models with
multiple predictor variables. "Population" form of multiple regression.
Answering "what if" questions with multiple regression models. Ordinary least
squares estimation of multiple regression. Standard errors. Gaussian
estimation theory, confidence and prediction intervals. Regression
diagnostics. Categorical predictor variables; analysis of variance.
- Variable selection: Review of hypothesis testing theory
from mathematical statistics. Significance tests for regression coefficients;
confidence sets for coefficients. Common fallacies about "significant"
coefficients, and how to avoid them. Model and variable selection.
- Beyond strictly linear ordinary least squares: Interaction
terms. Transformation of predictor variables. Transformation of response
variable; common fallacies about transformed responses, and how to avoid them.
Weighted least squares for non-constant variance; generalized least squares for
time series.
- Truly modern regression: Prediction and cross-validation
for model and variable selection. Resampling and bootstrap for statistical
inference. Regression trees.
See the end for the current lecture schedule, subject to revision. Lecture notes will be linked there, as available.
Course Mechanics
Exams will be 30% of your grade, data-analysis projects 45%, and homework
25%.
Theory Exams
There will be two in-class mid-term exams, both focusing on
the theoretical portions of the course. Both exams will be cumulative. Each
exam will be 15% of your final grade.
Data Analysis Projects
There will be three take-home projects where you will analyze real data
sets, and write up your findings in the form of a scientific report. You will
be graded both on the technical correctness of your work and on your ability to
clearly communicate your findings; in particular, raw computer output or code
are not acceptable. (Rubrics and example reports will be made available before
the first DAP is assigned.)
The DAPs are exams; consequently, collaboration is not allowed.
Each DAP will count for 15% of your final grade.
Homework
The homework will give you practice in using the techniques you are learning
to analyze data, and to interpret the analyses. They will also include some
theory questions, requiring you to do calculations or prove results
mathematically. There will be one homework assignment every week in which
there is not an exam. Every assignment will count equally towards 25% of your
grade. Your lowest two homework grades will be dropped;
consequently, no late homework will be accepted for any
reason.
Communicating your results to others is as important as getting good results
in the first place. A portion of the points available for every homework will
be set aside to reflect the clarity of your writing, figures, data
presentation, and other marks of communication. (Rubrics will be provided for
each assignment.) In addition, at least two homeworks will be practice DAPs,
where you will have to write reports in the same manner as the data analysis
projects.
Formats and Submission of Assignments
Except as otherwise noted in the schedule, all assignments will be due at 3
pm on Thursdays (i.e., at the beginning of class), through Blackboard. Late
assignments are not accepted for any reason. Coming late to class because you
are uploading an assignment is unacceptable.
You will submit a PDF or HTML file containing a readable version of all your
write-ups, mathematics, figures, tables, and selected portions of code
as relevant. Word files will not be graded. (You may
write in Word if you must, but you need to submit either PDF or HTML.)
You are strongly encouraged to use R
Markdown to integrate text, code, images and mathematics. If you do, you
will submit both the "knitted" PDF or HTML file, and the source .Rmd file. If
you choose not to use R Markdown, you will submit both a humanly-readable file,
as PDF or HTML, and a separate plain-text file containing all your R code,
clearly commented and formatted to indicate which code section goes with which
problem.
If you do not use an equation editor, LaTeX, etc., you may include pictures
or scans of hand-written mathematics as needed.
Interviews
To help gauge how well the class is going, and how well the grading reflects
actual understanding, every week (after the first week of classes), six
students will be selected at random, and will meet with the professor for
10--15 minutes each, to explain their work and to answer questions about it.
You may be selected on multiple weeks, if that's how the random numbers come
up. This is not a punishment, but a way for the professor to see
whether the problem sets are really measuring learning of the course material;
being selected will not hurt your grade in any way (and might even help).
Refusing to participate on a week you are selected will, however, automatically
drop your final grade by one letter.
Office Hours
If you want help with computing, please bring your laptop.
Mondays, 2--3 pm | Ms. Klein | Porter Hall 117 |
Mondays, 3--4pm | Mr. Spece-Ibanez | Porter Hall 117 |
Wednesdays, noon--1 pm | Prof. Shalizi | Baker Hall 229C |
Wednesdays, 4--5 pm | Ms. Luby | Porter Hall 117 |
Thursdays, noon--1 pm | Prof. Shalizi | Baker Hall 229C |
If you cannot make the scheduled office hours, please e-mail the professor
about making an appointment.
Blackboard
Blackboard will be used for submitting assignments electronically, and as a
gradebook. All properly enrolled students should have access to the Blackboard
site by the beginning of classes.
Textbook
The primary textbook for the course will be Kutner, Nachtsheim and
Neter's Applied Linear Regression Models, 4th edition
(McGraw-Hill, 2004,
ISBN 0-07-238691-6).
This is required. (The fifth edition is also acceptable,
though if you use it, when specific problems or readings are assigned from the
text, you are responsible for ensuring that they match up with what's
intended.)
Four other books are recommended:
- Julian J. Faraway, Linear Models with R, second edition (CRC Press, 2014,
ISBN 978-1-439-88733-2)
- Paul Teetor, The R Cookbook (O'Reilly Media, 2011,
ISBN 978-0-596-80915-7)
- D. R. Cox and Christl Donnelly, Principles of Applied Statistics (Cambridge
University Press, 2011,
ISBN 978-1-107-64445-8)
- Richard A. Berk, Regression Analysis: A Constructive Critique
(Sage Press, 2004, ISBN 978-0-7619-2904-8)
Collaboration, Cheating and Plagiarism
In general, you are free to discuss homework with each other,
though all the work you turn in must be your own; you must not copy
mathematical derivations, computer output and input, or written descriptions
from anyone or anywhere else, without reporting the source within your work.
(This includes copying from solutions provided to previous semesters' of the
course.) Unacknowledged copying or unauthorized collaboration will lead to
severe disciplinary action, beginning with an automatic grade of zero for all
involved and escalating from there. Please read the
CMU Policy on
Cheating and Plagiarism, and don't plagiarize.
Physically Disabled and Learning Disabled Students
The Office of Equal Opportunity Services provides support services for both
physically disabled and learning disabled students. For individualized
academic adjustment based on a documented disability, contact Equal Opportunity
Services at eos [at] andrew.cmu.edu or (412) 268-2012.
Computational Work
R is a free, open-source software
package/programming language for statistical computing. Many of you will have
some prior exposure to the language; for the rest, now is a great time to start
learning. Almost every assignment will require you to use it. No other form
of computational work will be accepted. If you are not able to use R,
or do not have ready, reliable access to a computer on which you can do so, let
me know at once.
R Markdown is an extension to R
which lets you embed your code, and the calculations it produces, in ordinary
text, which can also be formatted, contain figures and equations, etc. Using R
Markdown is strongly encouraged. If you do, you need to
submit both your "knitted" file (HTML or PDF, not Word), and the original
.Rmd file.
If you choose to not use R Markdown, for all computational assignments
you need to submit both a properly-formatted humanly-readable write-up,
in PDF, and a raw text file contain your R code, commented so that
it is clear which pieces of code go with which problem. The write-up
may be a .txt file, PDF, or HTML; Word files will not be graded.
R Resources
Here are some resources for learning R:
- The official intro, "An Introduction to R", available online in
HTML
and PDF
- John Verzani, "simpleR",
in PDF
- Quick-R. This is
primarily aimed at those who already know a commercial statistics package like
SAS, SPSS or Stata, but it's very clear and well-organized, and others may find
it useful as well.
- Patrick
Burns, The R
Inferno. "If you are using R and you think you're in hell, this is a map
for you."
- Thomas Lumley, "R Fundamentals and Programming Techniques"
(large
PDF)
- Paul Teetor, The R Cookbook, explains how to use R to
do many, many common tasks. (It's like the inverse to R's help: "What command
does X?", instead of "What does command Y do?").
- The notes for 36-350, Introduction to
Statistical Computing
- There are now many books about R. Some recommendable ones:
- Joseph Adler R in a Nutshell
(O'Reilly, 2009;
ISBN 9780596801700). Probably most useful for those with previous experience programming in another language.
- W. John Braun and
Duncan
J. Murdoch, A
First Course in Statistical Programming with R (Cambridge University Press, 2008; ISBN 978-0-521-69424-7)
- John M. Chambers, Software for Data Analysis:
Programming with R
(Springer, 2008,
ISBN 978-0-387-75935-7).
The best book on writing clean and reliable R programs; probably more advanced
than you will need.
- Norman
Matloff, The Art of R Programming (No Starch Press, 2011,
ISBN 978-1-59327-384-2).
Good introduction to programming for complete novices using R. Less statistics
than Braun and Murdoch, more programming skills.
- The R Markdown Cheat Sheet
Even if you know how to do some basic coding (or more), you
should read the page of Minimal
Advice on Programming.
Other Iterations of the Class
In fall 2015, Section A of the class is being taught by Prof. Xizhen Cai;
the two sections will be closely coordinated but are separate classes.
If you came here from a search engine, you may be looking for
information on previous versions of the class, as taught by
Prof. Rebecca Nugent.
Schedule
Subject to revision. Lecture notes, assignments and solutions
will all be linked here, as they are available. All readings are from
the textbook by Kutner et al., unless otherwise noted.
- September 1, Lecture 1: Introduction to the course
- Course mechanics; random variables and probability review; statistical prediction; optimal linear prediction.
- Reading: Appendix A (on Blackboard if you do not yet have the textbook)
- Homework 1: Assignment,
fha.csv data set
- September 3, Lecture 2: Exploratory data analysis and R
- Office hours will be held in computing labs today and on selected
days next week; see Blackboard for details. Attendance at one of these is optional but strongly encouraged.
- Readings: "Introduction to R Selected Handouts for 36-401" (by Prof. Nugent),
and "36-401 Fall 2015 R Introduction"
- September 8, Lecture 3: About
Statistical Modeling
- An example data set. Drawing lines through scatterplots. Why
prefer one line over another? Statistical models as data
summaries; models as tools for inference. Sources of uncertainty
in inference: sampling, measurement error, fluctuations. Models as
assumptions on the data-generating process. Some examples. Inference within
a model vs. checking model assumptions. Introducing the simple linear
regression model.
- For LaTeX/knitr users: the .Rnw file used to generate the notes
- Reading for the week: sections 1.1--1.5 (on Blackboard)
- September 10, Lecture 4: Simple linear regression models.
- The simple linear regression model: once more with feeling.
Consistency, unbiased-ness and variance of the plug-in estimator. "The
method of least squares". The Gaussian noise ("normal error") simple
linear regression model.
- For LaTeX/knitr users: the .Rnw file used to generate the notes
- Homework 1 due; solutions on Blackboard (please don't share beyond this class)
- Homework 2: assignment
- September 15, Lecture 5: Estimating simple linear regression I
- The method of least squares. Assumptions of the method. Properties of the estimates. Predictive inference. Least-squares estimation in R. Reading: sections 1.6 and 1.7.
- .Rnw file used to generate the notes
- September 17, Lecture 6: Estimating simple linear regression II
- The Gaussian model. Assumptions of the model. Consequences:
maximum likelihood estimation; properties of the MLE. Reading: section 1.8.
- R for in-class demos
- Homework 2 due
- Homework 3: assignment
- September 22, Lecture 7: Diagnostics and Transformations
- Assumption checking for the simple linear model; assumption checking
for the simple linear model with Gaussian noise. Generalization out of sample.
Nonlinearities: transforming the predictor; nonlinear least squares;
nonparametric smoothing. Transformations of the response to make the
assumptions hold; Box-Cox transformations. Cautions about transforming the
response: changed interpretation, changed model of noise, utter lack of
motivation for most common transformations. What the residuals look like
under mis-specification.
- .Rnw file which produced the notes
- See also: supplement, based on class discussion: Interpreting models after transformations
- Readings: section 3.1--3.3 and 3.8--3.9.
- R for in-class demos
- September 24, Lecture 8: Inference in simple linear regression I
- Inference for coefficients: standard errors; confidence sets; hypothesis
tests; reminders about translating between confidence sets and hypothesis
tests; reminders that statistical significance is not practical importance.
Readings: section 2.1--2.3.
- .Rnw file which produced the notes
- Homework 3 due
- Homework 4: assignment,
auto-mpg.csv,
abalone.csv
- September 29, Lecture 9: Inference in simple linear regression II
- Inference for expected values: standard errors, confidence
sets. Inference for new measurements: standard errors, confidence
sets. Readings: sections 2.4--2.6.
- .Rnw source file for the notes
- Supplement, based on class discussion: Interpreting models after transformations
- October 1, Lecture 10: F tests, R2 and other distractions
- The F test for whether the slope is 0; F tests for linear
models generally. Likelihood ratio tests as a more general alternative
to F tests. R2: distraction or nuisance? Correlation
and regression coefficients; "does anyone know when the correlation coefficient is useful?". How to honor tradition in science.
Readings: section 2.7--2.9.
- Homework 4 due
- October 6, Lecture 11: Exam 1 review
- October 8, Lecture 12: Theory exam 1
- Data analysis project 1: project,
mobility.csv
- October 13, Lecture 13: Linear regression and linear algebra
- Simple linear regression in matrix
form. Readings: chapter 5 (all of it).
- October 15, Lecture 14: Multiple linear regression
- Linear models with multiple predictor variables. Ordinary
least squares estimation. Why multiple regression doesn't just add
up simple regressions. Readings; sections 6.1--6.4.
- Data analysis project 1 due
- Homework 5: assignment,
gpa.txt, commercial.txt
- October 20, Lecture 15: Diagnostics and Inference
- Assumption-checking for multiple linear regression; diagnostics.
Inference for ordinary least squares: sampling distributions, degrees
of freedom, confidence sets and hypothesis tests. Readings: sections 6.6--6.8.
- .Rnw source file for the lecture
- October 22, Lecture 16: Polynomials and Categorical Predictors
- Dealing with non-linearities by adding polynomial terms. Cautions
about polynomials. Dealing with categorical predictors by adding "dummy" or
"indicator" variables. Interpretation of coefficients on categoricals.
Readings: sections 8.1--8.7.
- .Rnw source file for the lecture
- Homework 5 due
- Homework 6: assignment,
SENIC data set (see Blackboard for
the excerpt from the textbook describing this file)
- October 27, Lecture 17: Multicollinearity
- Multicollinearity: what it is and why it's a problem. Identifying
collinearity from pairs plots; why multicollinearity may not show up this way.
Dealing with collinearity by dropping variables. Picking out multicollinearity
from eigenvalues and eigenvectors; principal components regression. Ridge
regression for multicollinearity and for stabilizing estimates. High
dimensional regression.
- Readings: sections 7.1--7.3 and 10.1--10.5.
- .Rnw source file
- October 29, Lecture 18: Testing and Confidence Sets for Multiple
Coefficients
- Tests for individual coefficients (in the context of a specific larger model).
"Partial" F tests and likelihood ratio tests for groups of coefficients (in the context of a larger model).
"Full" F tests and likelihood ratio tests for all the slopes at once (in the context of a larger model).
Cautions about these tests.
Confidence rectangles for multiple coefficients; confidence ellipsoids for
multiple coefficients.
- .Rnw source file for the notes
- Readings: sections 7.3--7.4.
- Homework 6 due
- Homework 7: assignment, water.txt data file
- November 3, Lecture 19: Interactions
- General concept of interactions between variables. Conventional
form of interactions in linear models. Interactions between numerical
and categorical variables. Readings: sections 8.1--8.2.
- .Rnw source file for the lecture
- November 5, Lecture 20: Influential points and outliers
- "Influence" of a data point on OLS estimates. Outlier
detection. Dealing with outliers and influential points: by deletion;
by robust (non-OLS) regression. Readings: section 10.1--10.5.
- .Rnw source file for the lecture
- Homework 7 due
- Homework 8: assignment,
real-estate.csv
- November 10, Lecture 21: Model selection
- Comparing competing models. Traditional approaches. Sound
approaches. Difficulties of inference after selection. Readings: sections
9.1--9.4.
- .Rnw source file
- November 12, Lecture 22: Midterm review
- Practice Exam 2
- November 13
- Homework 8 due at 4:30 pm
- November 17, Lecture 23: Theory exam 2
- Data analysis project 2: assignment, bikes.csv
- November 19, Lecture 24: Non-Constant Noise Variance (special topics I)
- "Heteroskedasticity" = changing noise variance. Dealing with
heteroskedasticity by weighted least squares. WLS estimation in practice.
Where do the weights come from? Readings: section 11.1; lecture notes.
- November 24, Lecture 25: Correlated noise (special topics II)
- Dealing
with correlations in the noise by generalized least squares. GLS estimation
in practice. Where do the correlations come from? Readings: chapter 12; lecture
notes.
- Data analysis project 2 due
- December 1, Lecture 26: Variable Selection (special topics III)
- Variable selection as a special case of model selection. Why p-values
are very bad guides to which variables are important. Cross-validation for
variable selection: leave-one-out and k-fold. Stepwise regression;
stepwise regression in R. Cautions about inference after selection, again.
- Readings: Re-read lecture 21!
- December 3, Lecture 27: Regression trees (special topics IV)
- "Regressograms": regression by averaging over discretized variables.
Partitioning and trees. Interpretation of regression trees. Nonlinearity and
interaction; average predictive comparisons. Fitting trees with
cross-validation.
- Note: Sections 1 and 2 of the lecture notes for today are the most relevant; section 3 is about what to do when the response variable is categorical.
- Homework 9: assignment --- due on Tuesday, 8 December
- December 8, Lecture 28: Bootstrap I (special topics V)
- Sampling distributions and the bootstrap principle. Resampling.
Inference when Gaussian assumptions are shaky. Bootstrap standard errors and
confidence intervals. Readings: section 11.5; handouts.
- Homework 9 due
- Data analysis project 3: assignment. Your personalized data set has been e-mailed to your Andrew address; contact the professor as soon as possible if you have any problem with the data set.
- December 10, Lecture 29: Bootstrap II (special topics VI)
- More resampling. Bootstrap prediction intervals. Bootstrap plus
model selection. When will bootstrapping not work?
- December 15
- Data analysis project 3 due at 5 pm