Cosma Shalizi
36-402, Undergraduate Advanced Data Analysis
Spring 2011
This page has information about the 2011 version of the class. The 2012
version is over here.
Tuesdays and Thursdays, 10:30--11:50 Porter Hall 100
The goal of this class is to train students in using statistical models to
analyze data — as data summaries, as predictive instruments, and as tools
for scientific inference. We will build on the theory and applications of the
linear model, introduced
in 36-401,
extending it to more general functional forms, and more general kinds of data,
emphasizing the computation-intensive methods introduced since the 1980s.
After taking the class, when you're faced with a new data-analysis problem, you
should be able to (1) select appropriate methods, (2) use statistical software
to implement them, (3) critically evaluate the resulting statistical models,
and (4) communicate the results of their analyses to collaborators and to
non-statisticians.
Graduate students from other departments wishing to take this course should
register for it under the number "36-608".
Prerequisites
36-401,
or, in unusual circumstances, an equivalent course approved by the instructor.
Instructors
Professor | Cosma Shalizi | cshalizi [at] cmu.edu |
| | 229 C Baker Hall |
| | 268-7826 |
Teaching assistants | Gaia Bellone | gbellone [at] stat.cmu.edu |
| Zachary Kurtz | zkurtz [at] stat.cmu.edu |
| Shuhei Okumura | sokumura [at] stat.cmu.edu |
Topics, Notes, Readings
- Model evaluation: statistical inference, prediction, and
scientific inference; in-sample and out-of-sample errors, generalization and
over-fitting, cross-validation; evaluating by simulating; bootstrap; penalized
fitting; information criteria; mis-specification checks; model averaging
- Yet More Linear Regression: what is regression, really?;
review of ordinary linear regression and its limits; extensions
- Smoothing: kernel smoothing, including local polynomial
regression; splines; additive models; classification and regression
trees; kernel density estimation
- GAMs: logistic regression; generalized
linear models; generalized additive models.
- Latent variables and structured data: principal
components; factor analysis and latent variables; graphical models in general;
latent cluster/mixture models; hierarchical models and partial pooling
- Causality: estimating causal
effects; discovering causal structure
- Time series: Markov models for time series without
latent variables; hidden Markov models for time series with latent variables
See the end for the current lecture schedule, subject to revision. Lecture notes will be linked there, as available.
Course Mechanics
Homework will be 60% of the grade, two midterms 10% each, and the final
20%.
Homework
There will be twelve eleven weekly homework assignments,
nearly one every week; they will all count equally, and be 60% of your grade.
The homework will give you practice in using the techniques you are learning to
analyze data, and to interpret the analyses. Communicating your results to
others is as important as getting good results in the first place. Raw
computer output and R code is not acceptable, but should be put in an appendix
to each assignment.
Homework will be due, in hard-copy, at the beginning of class on Tuesdays. The
lowest three homework grades will be dropped; consequently, no late homework
will be accepted.
Exams
There will be two take-home mid-term exams (10% each), due at 5 pm on March
1st and April 12th. (Please let me know as soon as possible if you have a
conflict with either date.) You will have one week to work on each midterm.
There will be no homework in those weeks, and lecture on the day they are due
will be replaced with special office hours. There will also be a take-home
final exam (20%), due at 10 am on May 9, which you will have two weeks to do.
Office Hours
Prof. Shalizi will hold office hours Mondays, 2--4 pm, in Baker Hall 229A, or
by appointment. Ms. Bellone will hold office hours Fridays 1:30 to 2:30, and
Mr. Okumura Thursdays 1--2 pm, both in Wean Hall 8110. If you want help with
computation, please bring your laptop.
Blackboard
Blackboard will be used only for
announcements, grades, and a discussion forum. Assignments and solutions will
be posted here.
Textbook
Julian
Faraway, Extending the Linear Model with R (Chapman Hall/CRC
Press, 2006,
ISBN 978-1-58488-424-8)
will be required.
(Faraway's page on the book,
with help and errata.) Adler's R in a Nutshell
(O'Reilly, 2009;
ISBN 9780596801700),
Berk's Statistical Learning From a Regression Perspective
(Springer,
2008;
ISBN 9780387775005),
and Venables and Ripley's Modern Applied Statistics with S
(Springer,
2003;
ISBN 9780387954578)
will all be optional. The campus bookstore should have
copies.
Collaboration, Cheating and Plagiarism
Feel free to discuss all aspects of the course with one another, including
homework and exams. However, the work you hand in must be your own. You must
not copy mathematical derivations, computer output and input, or written
descriptions from anyone or anywhere else, without reporting the source within
your work. Please review the
CMU Policy on
Cheating and Plagiarism.
Physically Disabled and Learning Disabled Students
The Office of Equal Opportunity Services provides support services for both
physically disabled and learning disabled students. For individualized
academic adjustment based on a documented disability, contact Equal Opportunity
Services at eos [at] andrew.cmu.edu or (412) 268-2012.
R
R is a free, open-source software
package/programming language for statistical computing. You should have begun
to learn it in 36-401 (if not before), and this class presumes that you have.
Many of the problems will be easier with R, and some of them will require R.
You should have no expectations of assistance from the instructors with
programming in any other language. If you are not able to use R,
or do not have ready, reliable access to a computer on which you can do so,
let me know at once.
Here are some resources for learning R:
- The official intro, "An Introduction to R", available online in
HTML
and PDF
- John Verzani, "simpleR",
in PDF
- Quick-R. This is
primarily aimed at those who already know a commercial statistics package like
SAS, SPSS or Stata, but it's very clear and well-organized, and others may find
it useful as well.
- Patrick
Burns, The R
Inferno. "If you are using R and you think you're in hell, this is a map
for you."
- Thomas Lumley, "R Fundamentals and Programming Techniques"
(large
PDF)
- There are now many books about R. Adler's R in a Nutshell, and Venables and Ripley, will be available at the campus bookstore.
John M. Chambers, Software for Data Analysis:
Programming with R
(Springer, 2008, ISBN 978-0-387-75935-7) is the best book on writing programs in R, but we will not have
to do much actual programming.
You should read the Notes on
Writing R Functions, and Re-writing
Your Code. Even if you know how to do some basic coding (or more), you
should read the page of Minimal
Advice on Programming.
Schedule
Subject to revision. Lecture notes, assignments and solutions will all be
linked here, as they are available.
- January 11 (Tuesday): Lecture 1, Introduction to the class
- Statistics is the science which studies methods for learning from imperfect
data. Regression is a statistical model of functional relationships between
variables. Getting relationships right means being able to predict well. The
least-squares optimal prediction is the expectation value; the conditional
expectation function is the regression function. The regression function must
be estimated from data; the bias-variance trade-off controls this estimation.
Ordinary least squares revisited as a smoothing method. Other linear smoothers:
nearest-neighbor averaging, kernel-weighted averaging.
- PDF, R, example data for the lecture
- Homework 1; data set
- January 13 (Thursday): Lecture 2, The truth about linear regression
- Using Taylor's theorem to justify linear regression locally. Collinearity.
Consistency of ordinary least squares estimates under weak conditions. Linear
regression coefficients will change with the distribution of the input
variables: examples. Why R2 is usually a distraction. Linear
regression coefficients will change with the distribution of unobserved
variables (omitted variable effects). Errors in variables. Transformations of
inputs and of outputs. Utility of probabilistic assumptions; the importance of
looking at the residuals. What "controlled for in a linear regression" really
means.
- PDF,
R
- January 18 (Tuesday): Lecture 3, Evaluation: Error and inference
- Goals of statistical analysis: summaries, prediction, scientific inference.
Evaluating predictions: in-sample error, generalization error; over-fitting.
Cross-validation for estimating generalization error and for model
selection.
- PDF, R for figures
- Homework 1 due: solutions
- Homework 2; R for problem #2
- January 20 (Thursday): Lecture 4, Smoothing methods in regression
- The bias-variance trade-off tells us how much we should smooth.
Adapting to unknown roughness with cross-validation; detailed examples.
Using kernel regression with multiple inputs: multivariate kernels, product
kernels. Using smoothing to automatically discover interactions.
Plots to help interpret multivariate smoothing results.
- PDF notes, R
- January 25 (Tuesday): Lecture 5, Heteroskedasticity, weighted least
squares, and variance estimation
- Average predictive comparisons. Weighted least squares estimates.
Heteroskedasticity and the problems it causes for inference. How weighted
least squares gets around the problems of heteroskedasticity, if we know the
variance function. Estimating the variance function from regression residuals.
An iterative method for estimating the regression function and the variance
function together. Locally constant and locally linear modeling. Lowess.
- PDF handout
- Homework 2 due: PDF of solutions,
R
- Homework 3 out: Assignment
- January 27 (Thursday): Lecture 6, Density estimation
- The desirability of estimating not just conditional means, variances, etc.,
but whole distribution functions. Parametric maximum likelihood is a solution,
if the parametric model is right. Histograms and empirical cumulative
distribution functions are non-parametric ways of estimating the distribution:
do they work? The Glivenko-Cantelli law on the convergence of empirical
distribution functions, a.k.a. "the fundamental theorem of statistics". More
on histograms: they converge on the right density, if bins keep shrinking but
the number of samples per bin keeps growing. Kernel density estimation and its
properties. An example with homework data. Estimating conditional densities;
another example with homework data. Some issues with likelihood, maximum
likelihood, and non-parametric estimation.
- PDF notes, R for figures
- February 1 (Tuesday): Lecture 7, Simulation
- Simulation: implementing the story encoded in the model, step by step, to
produce something data-like. Stochastic models have random components and so
require some random steps. Stochastic models specified through conditional
distributions are simulated by chaining together random numbers. Means of
generating random numbers with specified distributions. Simulation shows us
what a model predicts (expectations, higher moments, correlations, regression
functions, sampling distributions); analytical probability calculations are
short-cuts for exhaustive simulation. Simulation lets us check aspects of the
model: does the data look like typical simulation output? if we repeat our
exploratory analysis on the simulation output, do we get the same results?
Simulation-based estimation: the method of simulated moments.
- PDF notes,
R
- Homework 3 due: solutions, R
- Homework 4 out: Assignment, SPhistory.short.csv
- February 3 (Thursday): Lecture 8, The Bootstrap
- Quantifying uncertainty by looking at sampling distributions. The
bootstrap principle: sampling distributions under a good estimate of the truth
are close to the true sampling distributions. Parametric bootstrapping.
Non-parametric bootstrapping. Many examples. When does the bootstrap
fail?
- PDF notes,
R for figures and examples
- pareto.R, wealth.dat
- February 8 (Tuesday): Lecture 9, Catch-up and consolidation day
- Reviewing the course so far.
- Homework 4 due: Solutions
- Homework 5 out: Assignment
- February 10 (Thursday): Lecture 10, Testing regression specifications (guest lecture by Prof. Rinaldo)
- Non-parametric smoothers can be used to test parametric models. Forms of
tests: differences in in-sample performance; differences in generalization
performance; whether the parametric model's residuals have expectation zero
everywhere. Constructing a test statistic based on in-sample performance.
Using bootstrapping from the parametric model to find the null distribution of
the test statistic. An example where the parametric model is correctly
specified, and one where it is not. Cautions on the interpretation of
goodness-of-fit tests. Why use parametric models at all? Answers: speed of
convergence when correctly specified; and the scientific interpretation of
parameters, if the model actually comes from a scientific theory. Mis-specified
parametric models can predict better, at small sample sizes, than either correctly-specified parametric models or non-parametric smoothers, because
of their favorable bias-variance characteristics; an example.
- PDF notes, incorporating R examples
- February 15 (Tuesday): Lecture 11, Splines
- Kernel regression controls the amount of smoothing indirectly by bandwidth;
why not control the irregularity of the smoothed curve directly? The spline
smoothing problem is a penalized least squares problem: minimize mean squared
error, plus a penalty term proportional to average curvature of the
function over space. The solution is always a continuous piecewise cubic
polynomial, with continuous first and second derivatives. Altering the
strength of the penalty moves along a bias-variance trade-off, from pure OLS at
one extreme to pure interpolation at the other; changing the strength of the
penalty is equivalent to minimizing the mean squared error under a constraint
on the average curvature. To ensure consistency, the penalty/constraint should
weaken as the data grows; the appropriate size is selected by cross-validation.
An example with the data from homework 4, including confidence bands. Writing
splines as basis functions, and fitting as least squares on transformations of
the data, plus a regularization term. A brief look at splines in multiple
dimensions. Splines versus kernel regression. Appendix: Lagrange multipliers
and the correspondence between constrained and penalized optimization.
- PDF notes, incorporating R examples
- Homework 5 due: Solutions
- Homework 6 out: Assginment; data
files: gmp_2006.csv, pcgmp_2006.csv
- February 17 (Thursday): Lecture 12, Additive models
- The curse of dimensionality limits the usefulness of fully non-parametric
regression in problems with many variables: bias remains under control, but
variance grows rapidly with dimensionality. Parametric models do not have this
problem, but have bias and do not let us discover anything about the
true function. Structured or constrained non-parametric regression
compromises, by adding some bias so as to reduce variance. Additive models are
an example, where each input variable has a "partial response function", which
add together to get the total regression function; the partial response
functions are unconstrained. This generalizes linear models but still evades
the curse of dimensionality. Fitting additive models is done iteratively,
starting with some initial guess about each partial response function and then
doing one-dimensional smoothing, so that the guesses correct each other until a
self-consistent solution is reached. Examples in R using the California
house-price data. Conclusion: there is hardly ever any reason to prefer linear
models to additive ones, and the continued thoughtless use of linear regression
is a scandal.
- PDF notes,
incorporating R examples
- February 22 (Tuesday): Lecture 13, More about Hypothesis Testing
- Homework 6 due: PDF solutions, R code
- Midterm 1 out: Exam; your data set was e-mailed to your Andrew account
- February 24 (Thursday): No lecture
- March 1 (Tuesday): Q & A session
- Midterm 1 due (at 5 pm): PDF
solutions, R, master
data set
- March 3 (Thursday): Consolidation and examples
- With an emphasis on exam debriefing
- March 8 and March 10 (Tuesday and Thursday)
- Spring break
- March 15 (Tuesday): Lecture 14, Logistic regression
- Modeling conditional probabilities; using regression to model
probabilities; transforming probabilities to work better with regression; the
logistic regression model; maximum likelihood; numerical maximum likelihood by
Newton's method and by iteratively re-weighted least squares; comparing
logistic regression to logistic-additive models
- PDF notes
- Homework 7 out: PDF assignment
- March 17 (Thursday): Lecture 15, Generalized linear models and generalized additive models
- Poisson regression and other generalized linear models; over-dispersion;
generalized additive models
- March 22 (Tuesday): Lecture 16, Consolidation and examples
- Building a weather forecaster for Snoqualmie Falls, Wash., with logistic
regression. Exploratory examination of the data. Predicting wet or dry days
form the amount of precipitation the previous day. First logistic regression
model. Finding predicted probabilities and confidence intervals for them.
Comparison to spline smoothing and a generalized additive model. Model
comparison test detects significant mis-specification. Re-specifying the
model: dry days are special. The second logistic regression model and its
comparison to the data. Checking the calibration of the second model.
- PDF
handout, snoqualmie.csv
data set,
R
- Homework 8 out: assignment; Fair, 1978
- March 24 (Thursday): Lecture 17, Principal components analysis
- Principal components: the simplest, oldest and most robust of
dimensionality-reduction techniques. PCA works by finding the line (plane,
hyperplane) which passes closest, on average, to all of the data points. This
is equivalent to maximizing the variance of the coordinates of projections on
to the line/plane/hyperplane. Actually finding those principal components
reduces to finding eigenvalues and eigenvectors of the sample covariance
matrix. Why PCA is a data-analytic technique, and not a form of statistical
inference. An example with cars. PCA with words: "latent semantic analysis";
an example with real newspaper articles. Visualization with PCA and
multidimensional scaling. Cautions about PCA; the perils of reification;
illustration with genetic maps.
- PDF handout,
pca.R for
examples, cars data
set, R workspace for the New
York Times examples
- Homework 7 due (extended due to server outage): solutions
- March 29 (Tuesday): Lecture 18, Factor analysis
- Adding noise to PCA to get a statistical model. The factor analysis model,
or linear regression with unobserved independent variables. Assumptions of the
factor analysis model. Implications of the model: observable variables are
correlated only through shared factors; "tetrad equations" for one factor
models, more general correlation patterns for multiple factors. (Our first
look at latent variables and conditional independence.) Geometrically, the
factor model says the data have a Gaussian distribution on some low-dimensional
plane, plus noise moving them off the plane. Estimation by heroic linear
algebra; estimation by maximum likelihood. The rotation problem, and why it is
unwise to reify factors. Other models which produce the same correlation
patterns as factor models.
- PDF handout;
lecture-18.R computational
examples you should step through (not done in
class); correlates of sleep in
mammals data set for those
examples; thomson-model.R
- Homework 8 due: solutions; Li and Racine, 2004
- Homework 9: assignment, fx.csv data set
- March 31 (Thursday): Lecture 19, Mixture Models
- From factor analysis to mixture models by allowing the latent variable to
be discrete. From kernel density estimation to mixture models by reducing the
number of points with copies of the kernel. Probabilistic formulation of
mixture models. Geometry. Clustering. Estimation of mixture models by
maximum likelihood, and why it leads to a vicious circle. The
expectation-maximization (EM, Baum-Welch) algorithm replaces the vicious circle
with iterative approximation. More on the EM algorithm: convexity, Jensen's
inequality, optimizing a lower bound, proving that each step of EM increases
the likelihood. Mixtures of regressions. Other extensions.
- PDF handout
- April 5 (Tuesday): Lecture 20, Mixture model examples and complements
- Precipitation in Snoqualmie Falls revisited. Fitting a two-component
Gaussian mixture; examining the fitted distribution; checking calibration.
Using cross-validation to select the number of components to use. Examination
of the selected mixture model. Suspicious patterns in the parameters of the
selected model. Approximating complicated distributions vs. revealing hidden
structure. Using bootstrap hypothesis testing to select the number of mixture
components. The multivariate Gaussian distribution: definition, relation to
the univariate or scalar Gaussian distribution; effect of linear
transformations on the parameters; plotting probability density contours in two
dimensions; using eigenvalues and eigenvectors to understand the geometry of
multivariate Gaussians; estimation by maximum likelihood; computational aspects,
specifically in R.
- PDF, R; bootcomp.R
(patch graciously provided by Dr. Derek Young)
- Homework 9 due: solutions
- Midterm 2 out: Assignment; your data set
was mailed to you
- April 7 (Thursday): Lecture 21, Graphical models
- Conditional independence and dependence properties in factor models. The
generalization to graphical models. Directed acyclic graphs. DAG models.
Factor, mixture, and Markov models as DAGs. The graphical Markov property.
Reading conditional independence properties from a DAG. Creating conditional
dependence properties from a DAG. Statistical aspects of DAGs. Reasoning with
DAGs; does asbestos whiten teeth? Appendix: undirected graphical models, the
Gibbs-Markov theorem; directed but cyclic graphical models. Appendix: Some
basic notions of graph theory; Guthrie diagrams.
- PDF
- April 12 (Tuesday) Lecture 22, Graphical causal models
- Statistical dependence, counterfactuals, causation. Probabilistic
prediction (selecting a sub-ensemble) vs. causal prediction (generating a new
ensemble). Graphical causal models, structural equation models. The causal
Markov property. Faithfulness. Counterfactual prediction by "surgery" on
causal graphical models. The d-separation criterion. Path diagram rules.
Appendix: mutual information and independence; conditional mutual
information and conditional independence.
- PDF notes
- Midterm 2 due: Solutions, R for solutions
- Homework 10 out: assignment, fake-smoke.csv
- April 14 (Thursday): Spring carnival
- April 19 (Tuesday): Lecture 23, Estimating causal effects from observations
- Reprise of causal effects vs. probabilistic conditioning. "Why think, when
you can do the experiment?" Experimentation by controlling everything
(Galileo) and by randomizing (Fisher). Confounding and identifiability. The
back-door criterion for identifying causal effects: condition on covariates
which block undesired paths. The front-door criterion for identification: find
isolated and exhaustive causal mechanisms. Deciding how many black boxes to
open up. Instrumental variables for identification: finding some exogenous
source of variation and tracing its effects. Critique of instrumental
variables: vital role of theory, its fragility, consequences of weak
instruments. Irremovable confounding: an example with the detection of social
influence; the possibility of bounding unidentifiable effects. Matching and
propensity scores as computational short-cuts in back-door adjustment. Summary
recommendations for identifying and estimating causal effects.
- PDF notes
- Homework 10 due: Solutions
- Homework 11 out: Assignment
- April 21 (Thursday): Lecture 24, Discovering causal structure from observations
- How do we get our causal graph? Comparing rival DAGs by testing selected
conditional independence relations (or dependencies). The crucial difference
between common causes and common effects. Identifying colliders, and using
them to orient arrows. Inducing orientation to enforce consistency. The SGS
algorithm for discovering causal graphs; why it works. Refinements of the SGS
algorithm (the PC algorithm). What about latent variables?
Software: TETRAD and pcalg. Limits to observational causal
discovery: universal consistency is possible (and achieved), but uniform
consistency is not.
- PDF notes
- April 26 (Tuesday): Lecture 25, Recap on estimation causal effects
- Substituting consistent estimators into the formulas for front and back
door identification. Tricks to avoid estimating marginal distributions.
Uncertainty in estimates of effects
- Homework 11 due: Solutions
- Final exam out: Assignment
- April 28 (Thursday): General review
- May 9 (Monday): Final exam due at 10 am