36-402, Undergraduate Advanced Data Analysis

Spring 2011

This page has information about the 2011 version of the class. The 2012 version is over here.

Tuesdays and Thursdays, 10:30--11:50 Porter Hall 100

The goal of this class is to train students in using statistical models to analyze data — as data summaries, as predictive instruments, and as tools for scientific inference. We will build on the theory and applications of the linear model, introduced in 36-401, extending it to more general functional forms, and more general kinds of data, emphasizing the computation-intensive methods introduced since the 1980s. After taking the class, when you're faced with a new data-analysis problem, you should be able to (1) select appropriate methods, (2) use statistical software to implement them, (3) critically evaluate the resulting statistical models, and (4) communicate the results of their analyses to collaborators and to non-statisticians.

Graduate students from other departments wishing to take this course should register for it under the number "36-608".

Prerequisites

36-401, or, in unusual circumstances, an equivalent course approved by the instructor.

Instructors

Professor	Cosma Shalizi	cshalizi [at] cmu.edu
		229 C Baker Hall
		268-7826
Teaching assistants	Gaia Bellone	gbellone [at] stat.cmu.edu
	Zachary Kurtz	zkurtz [at] stat.cmu.edu
	Shuhei Okumura	sokumura [at] stat.cmu.edu

Topics, Notes, Readings

Model evaluation: statistical inference, prediction, and scientific inference; in-sample and out-of-sample errors, generalization and over-fitting, cross-validation; evaluating by simulating; bootstrap; penalized fitting; information criteria; mis-specification checks; model averaging

Yet More Linear Regression: what is regression, really?; review of ordinary linear regression and its limits; extensions

Smoothing: kernel smoothing, including local polynomial regression; splines; additive models; classification and regression trees; kernel density estimation

GAMs: logistic regression; generalized linear models; generalized additive models.

Latent variables and structured data: principal components; factor analysis and latent variables; graphical models in general; latent cluster/mixture models; hierarchical models and partial pooling

Causality: estimating causal effects; discovering causal structure

Time series: Markov models for time series without latent variables; hidden Markov models for time series with latent variables

See the end for the current lecture schedule, subject to revision. Lecture notes will be linked there, as available.

Course Mechanics

Grades will not go away if you avert your eyes (photo by laurent KB on Flickr)

Homework will be 60% of the grade, two midterms 10% each, and the final 20%.

Homework

There will be ~~twelve~~ eleven weekly homework assignments, nearly one every week; they will all count equally, and be 60% of your grade. The homework will give you practice in using the techniques you are learning to analyze data, and to interpret the analyses. Communicating your results to others is as important as getting good results in the first place. Raw computer output and R code is not acceptable, but should be put in an appendix to each assignment. Homework will be due, in hard-copy, at the beginning of class on Tuesdays. The lowest three homework grades will be dropped; consequently, no late homework will be accepted.

Exams

There will be two take-home mid-term exams (10% each), due at 5 pm on March 1st and April 12th. (Please let me know as soon as possible if you have a conflict with either date.) You will have one week to work on each midterm. There will be no homework in those weeks, and lecture on the day they are due will be replaced with special office hours. There will also be a take-home final exam (20%), due at 10 am on May 9, which you will have two weeks to do.

Office Hours

Prof. Shalizi will hold office hours Mondays, 2--4 pm, in Baker Hall 229A, or by appointment. Ms. Bellone will hold office hours Fridays 1:30 to 2:30, and Mr. Okumura Thursdays 1--2 pm, both in Wean Hall 8110. If you want help with computation, please bring your laptop.

Blackboard

Blackboard will be used only for announcements, grades, and a discussion forum. Assignments and solutions will be posted here.

Textbook

Julian Faraway, Extending the Linear Model with R (Chapman Hall/CRC Press, 2006, ISBN 978-1-58488-424-8) will be required. (Faraway's page on the book, with help and errata.) Adler's R in a Nutshell (O'Reilly, 2009; ISBN 9780596801700), Berk's Statistical Learning From a Regression Perspective (Springer, 2008; ISBN 9780387775005), and Venables and Ripley's Modern Applied Statistics with S (Springer, 2003; ISBN 9780387954578) will all be optional. The campus bookstore should have copies.

Collaboration, Cheating and Plagiarism

Cheating leads to desolation and ruin (photo by paddyjoe on Flickr)

Feel free to discuss all aspects of the course with one another, including homework and exams. However, the work you hand in must be your own. You must not copy mathematical derivations, computer output and input, or written descriptions from anyone or anywhere else, without reporting the source within your work. Please review the CMU Policy on Cheating and Plagiarism.

Physically Disabled and Learning Disabled Students

The Office of Equal Opportunity Services provides support services for both physically disabled and learning disabled students. For individualized academic adjustment based on a documented disability, contact Equal Opportunity Services at eos [at] andrew.cmu.edu or (412) 268-2012.

R

R is a free, open-source software package/programming language for statistical computing. You should have begun to learn it in 36-401 (if not before), and this class presumes that you have. Many of the problems will be easier with R, and some of them will require R. You should have no expectations of assistance from the instructors with programming in any other language. If you are not able to use R, or do not have ready, reliable access to a computer on which you can do so, let me know at once.

Here are some resources for learning R:

The official intro, "An Introduction to R", available online in HTML and PDF
John Verzani, "simpleR", in PDF
Quick-R. This is primarily aimed at those who already know a commercial statistics package like SAS, SPSS or Stata, but it's very clear and well-organized, and others may find it useful as well.
Patrick Burns, The R Inferno. "If you are using R and you think you're in hell, this is a map for you."
Thomas Lumley, "R Fundamentals and Programming Techniques" (large PDF)
There are now many books about R. Adler's R in a Nutshell, and Venables and Ripley, will be available at the campus bookstore. John M. Chambers, Software for Data Analysis: Programming with R (Springer, 2008, ISBN 978-0-387-75935-7) is the best book on writing programs in R, but we will not have to do much actual programming.

You should read the Notes on Writing R Functions, and Re-writing Your Code. Even if you know how to do some basic coding (or more), you should read the page of Minimal Advice on Programming.

Schedule

Subject to revision. Lecture notes, assignments and solutions will all be linked here, as they are available.

Identifying significant features from background (photo by Gord McKenna on Flickr)

January 11 (Tuesday): Lecture 1, Introduction to the class: Statistics is the science which studies methods for learning from imperfect data. Regression is a statistical model of functional relationships between variables. Getting relationships right means being able to predict well. The least-squares optimal prediction is the expectation value; the conditional expectation function is the regression function. The regression function must be estimated from data; the bias-variance trade-off controls this estimation. Ordinary least squares revisited as a smoothing method. Other linear smoothers: nearest-neighbor averaging, kernel-weighted averaging.; PDF, R, example data for the lecture; Homework 1; data set
January 13 (Thursday): Lecture 2, The truth about linear regression: Using Taylor's theorem to justify linear regression locally. Collinearity. Consistency of ordinary least squares estimates under weak conditions. Linear regression coefficients will change with the distribution of the input variables: examples. Why R² is usually a distraction. Linear regression coefficients will change with the distribution of unobserved variables (omitted variable effects). Errors in variables. Transformations of inputs and of outputs. Utility of probabilistic assumptions; the importance of looking at the residuals. What "controlled for in a linear regression" really means.; PDF, R
January 18 (Tuesday): Lecture 3, Evaluation: Error and inference: Goals of statistical analysis: summaries, prediction, scientific inference. Evaluating predictions: in-sample error, generalization error; over-fitting. Cross-validation for estimating generalization error and for model selection.; PDF , R for figures; Homework 1 due: solutions; Homework 2; R for problem #2
January 20 (Thursday): Lecture 4, Smoothing methods in regression: The bias-variance trade-off tells us how much we should smooth. Adapting to unknown roughness with cross-validation; detailed examples. Using kernel regression with multiple inputs: multivariate kernels, product kernels. Using smoothing to automatically discover interactions. Plots to help interpret multivariate smoothing results.; PDF notes, R
January 25 (Tuesday): Lecture 5, Heteroskedasticity, weighted least squares, and variance estimation: Average predictive comparisons. Weighted least squares estimates. Heteroskedasticity and the problems it causes for inference. How weighted least squares gets around the problems of heteroskedasticity, if we know the variance function. Estimating the variance function from regression residuals. An iterative method for estimating the regression function and the variance function together. Locally constant and locally linear modeling. Lowess.; PDF handout; Homework 2 due: PDF of solutions, R; Homework 3 out: Assignment
January 27 (Thursday): Lecture 6, Density estimation: The desirability of estimating not just conditional means, variances, etc., but whole distribution functions. Parametric maximum likelihood is a solution, if the parametric model is right. Histograms and empirical cumulative distribution functions are non-parametric ways of estimating the distribution: do they work? The Glivenko-Cantelli law on the convergence of empirical distribution functions, a.k.a. "the fundamental theorem of statistics". More on histograms: they converge on the right density, if bins keep shrinking but the number of samples per bin keeps growing. Kernel density estimation and its properties. An example with homework data. Estimating conditional densities; another example with homework data. Some issues with likelihood, maximum likelihood, and non-parametric estimation.; PDF notes, R for figures
February 1 (Tuesday): Lecture 7, Simulation: Simulation: implementing the story encoded in the model, step by step, to produce something data-like. Stochastic models have random components and so require some random steps. Stochastic models specified through conditional distributions are simulated by chaining together random numbers. Means of generating random numbers with specified distributions. Simulation shows us what a model predicts (expectations, higher moments, correlations, regression functions, sampling distributions); analytical probability calculations are short-cuts for exhaustive simulation. Simulation lets us check aspects of the model: does the data look like typical simulation output? if we repeat our exploratory analysis on the simulation output, do we get the same results? Simulation-based estimation: the method of simulated moments.; PDF notes, R; Homework 3 due: solutions, R; Homework 4 out: Assignment, SPhistory.short.csv
February 3 (Thursday): Lecture 8, The Bootstrap: Quantifying uncertainty by looking at sampling distributions. The bootstrap principle: sampling distributions under a good estimate of the truth are close to the true sampling distributions. Parametric bootstrapping. Non-parametric bootstrapping. Many examples. When does the bootstrap fail?; PDF notes, R for figures and examples; pareto.R, wealth.dat
February 8 (Tuesday): Lecture 9, Catch-up and consolidation day: Reviewing the course so far.; Homework 4 due: Solutions; Homework 5 out: Assignment
February 10 (Thursday): Lecture 10, Testing regression specifications (guest lecture by Prof. Rinaldo): Non-parametric smoothers can be used to test parametric models. Forms of tests: differences in in-sample performance; differences in generalization performance; whether the parametric model's residuals have expectation zero everywhere. Constructing a test statistic based on in-sample performance. Using bootstrapping from the parametric model to find the null distribution of the test statistic. An example where the parametric model is correctly specified, and one where it is not. Cautions on the interpretation of goodness-of-fit tests. Why use parametric models at all? Answers: speed of convergence when correctly specified; and the scientific interpretation of parameters, if the model actually comes from a scientific theory. Mis-specified parametric models can predict better, at small sample sizes, than either correctly-specified parametric models or non-parametric smoothers, because of their favorable bias-variance characteristics; an example.; PDF notes, incorporating R examples
February 15 (Tuesday): Lecture 11, Splines: Kernel regression controls the amount of smoothing indirectly by bandwidth; why not control the irregularity of the smoothed curve directly? The spline smoothing problem is a penalized least squares problem: minimize mean squared error, plus a penalty term proportional to average curvature of the function over space. The solution is always a continuous piecewise cubic polynomial, with continuous first and second derivatives. Altering the strength of the penalty moves along a bias-variance trade-off, from pure OLS at one extreme to pure interpolation at the other; changing the strength of the penalty is equivalent to minimizing the mean squared error under a constraint on the average curvature. To ensure consistency, the penalty/constraint should weaken as the data grows; the appropriate size is selected by cross-validation. An example with the data from homework 4, including confidence bands. Writing splines as basis functions, and fitting as least squares on transformations of the data, plus a regularization term. A brief look at splines in multiple dimensions. Splines versus kernel regression. Appendix: Lagrange multipliers and the correspondence between constrained and penalized optimization.; PDF notes, incorporating R examples; Homework 5 due: Solutions; Homework 6 out: Assginment; data files: gmp_2006.csv, pcgmp_2006.csv
February 17 (Thursday): Lecture 12, Additive models: The curse of dimensionality limits the usefulness of fully non-parametric regression in problems with many variables: bias remains under control, but variance grows rapidly with dimensionality. Parametric models do not have this problem, but have bias and do not let us discover anything about the true function. Structured or constrained non-parametric regression compromises, by adding some bias so as to reduce variance. Additive models are an example, where each input variable has a "partial response function", which add together to get the total regression function; the partial response functions are unconstrained. This generalizes linear models but still evades the curse of dimensionality. Fitting additive models is done iteratively, starting with some initial guess about each partial response function and then doing one-dimensional smoothing, so that the guesses correct each other until a self-consistent solution is reached. Examples in R using the California house-price data. Conclusion: there is hardly ever any reason to prefer linear models to additive ones, and the continued thoughtless use of linear regression is a scandal.; PDF notes, incorporating R examples
February 22 (Tuesday): Lecture 13, More about Hypothesis Testing: Homework 6 due: PDF solutions, R code; Midterm 1 out: Exam; your data set was e-mailed to your Andrew account
February 24 (Thursday): No lecture
March 1 (Tuesday): Q & A session: Midterm 1 due (at 5 pm): PDF solutions, R, master data set
March 3 (Thursday): Consolidation and examples: With an emphasis on exam debriefing
March 8 and March 10 (Tuesday and Thursday): Spring break
March 15 (Tuesday): Lecture 14, Logistic regression: Modeling conditional probabilities; using regression to model probabilities; transforming probabilities to work better with regression; the logistic regression model; maximum likelihood; numerical maximum likelihood by Newton's method and by iteratively re-weighted least squares; comparing logistic regression to logistic-additive models; PDF notes; Homework 7 out: PDF assignment
March 17 (Thursday): Lecture 15, Generalized linear models and generalized additive models: Poisson regression and other generalized linear models; over-dispersion; generalized additive models
March 22 (Tuesday): Lecture 16, Consolidation and examples: Building a weather forecaster for Snoqualmie Falls, Wash., with logistic regression. Exploratory examination of the data. Predicting wet or dry days form the amount of precipitation the previous day. First logistic regression model. Finding predicted probabilities and confidence intervals for them. Comparison to spline smoothing and a generalized additive model. Model comparison test detects significant mis-specification. Re-specifying the model: dry days are special. The second logistic regression model and its comparison to the data. Checking the calibration of the second model.; PDF handout, snoqualmie.csv data set, R; Homework 8 out: assignment; Fair, 1978
March 24 (Thursday): Lecture 17, Principal components analysis: Principal components: the simplest, oldest and most robust of dimensionality-reduction techniques. PCA works by finding the line (plane, hyperplane) which passes closest, on average, to all of the data points. This is equivalent to maximizing the variance of the coordinates of projections on to the line/plane/hyperplane. Actually finding those principal components reduces to finding eigenvalues and eigenvectors of the sample covariance matrix. Why PCA is a data-analytic technique, and not a form of statistical inference. An example with cars. PCA with words: "latent semantic analysis"; an example with real newspaper articles. Visualization with PCA and multidimensional scaling. Cautions about PCA; the perils of reification; illustration with genetic maps.; PDF handout, pca.R for examples, cars data set, R workspace for the New York Times examples; Homework 7 due (extended due to server outage): solutions
March 29 (Tuesday): Lecture 18, Factor analysis: Adding noise to PCA to get a statistical model. The factor analysis model, or linear regression with unobserved independent variables. Assumptions of the factor analysis model. Implications of the model: observable variables are correlated only through shared factors; "tetrad equations" for one factor models, more general correlation patterns for multiple factors. (Our first look at latent variables and conditional independence.) Geometrically, the factor model says the data have a Gaussian distribution on some low-dimensional plane, plus noise moving them off the plane. Estimation by heroic linear algebra; estimation by maximum likelihood. The rotation problem, and why it is unwise to reify factors. Other models which produce the same correlation patterns as factor models.; PDF handout; lecture-18.R computational examples you should step through (not done in class); correlates of sleep in mammals data set for those examples; thomson-model.R; Homework 8 due: solutions; Li and Racine, 2004; Homework 9: assignment, fx.csv data set
March 31 (Thursday): Lecture 19, Mixture Models: From factor analysis to mixture models by allowing the latent variable to be discrete. From kernel density estimation to mixture models by reducing the number of points with copies of the kernel. Probabilistic formulation of mixture models. Geometry. Clustering. Estimation of mixture models by maximum likelihood, and why it leads to a vicious circle. The expectation-maximization (EM, Baum-Welch) algorithm replaces the vicious circle with iterative approximation. More on the EM algorithm: convexity, Jensen's inequality, optimizing a lower bound, proving that each step of EM increases the likelihood. Mixtures of regressions. Other extensions.; PDF handout
April 5 (Tuesday): Lecture 20, Mixture model examples and complements: Precipitation in Snoqualmie Falls revisited. Fitting a two-component Gaussian mixture; examining the fitted distribution; checking calibration. Using cross-validation to select the number of components to use. Examination of the selected mixture model. Suspicious patterns in the parameters of the selected model. Approximating complicated distributions vs. revealing hidden structure. Using bootstrap hypothesis testing to select the number of mixture components. The multivariate Gaussian distribution: definition, relation to the univariate or scalar Gaussian distribution; effect of linear transformations on the parameters; plotting probability density contours in two dimensions; using eigenvalues and eigenvectors to understand the geometry of multivariate Gaussians; estimation by maximum likelihood; computational aspects, specifically in R.; PDF, R; bootcomp.R (patch graciously provided by Dr. Derek Young); Homework 9 due: solutions; Midterm 2 out: Assignment; your data set was mailed to you
April 7 (Thursday): Lecture 21, Graphical models: Conditional independence and dependence properties in factor models. The generalization to graphical models. Directed acyclic graphs. DAG models. Factor, mixture, and Markov models as DAGs. The graphical Markov property. Reading conditional independence properties from a DAG. Creating conditional dependence properties from a DAG. Statistical aspects of DAGs. Reasoning with DAGs; does asbestos whiten teeth? Appendix: undirected graphical models, the Gibbs-Markov theorem; directed but cyclic graphical models. Appendix: Some basic notions of graph theory; Guthrie diagrams.; PDF
April 12 (Tuesday) Lecture 22, Graphical causal models: Statistical dependence, counterfactuals, causation. Probabilistic prediction (selecting a sub-ensemble) vs. causal prediction (generating a new ensemble). Graphical causal models, structural equation models. The causal Markov property. Faithfulness. Counterfactual prediction by "surgery" on causal graphical models. The d-separation criterion. Path diagram rules. Appendix: mutual information and independence; conditional mutual information and conditional independence.; PDF notes; Midterm 2 due: Solutions, R for solutions; Homework 10 out: assignment, fake-smoke.csv
April 14 (Thursday): Spring carnival
April 19 (Tuesday): Lecture 23, Estimating causal effects from observations: Reprise of causal effects vs. probabilistic conditioning. "Why think, when you can do the experiment?" Experimentation by controlling everything (Galileo) and by randomizing (Fisher). Confounding and identifiability. The back-door criterion for identifying causal effects: condition on covariates which block undesired paths. The front-door criterion for identification: find isolated and exhaustive causal mechanisms. Deciding how many black boxes to open up. Instrumental variables for identification: finding some exogenous source of variation and tracing its effects. Critique of instrumental variables: vital role of theory, its fragility, consequences of weak instruments. Irremovable confounding: an example with the detection of social influence; the possibility of bounding unidentifiable effects. Matching and propensity scores as computational short-cuts in back-door adjustment. Summary recommendations for identifying and estimating causal effects.; PDF notes; Homework 10 due: Solutions; Homework 11 out: Assignment
April 21 (Thursday): Lecture 24, Discovering causal structure from observations: How do we get our causal graph? Comparing rival DAGs by testing selected conditional independence relations (or dependencies). The crucial difference between common causes and common effects. Identifying colliders, and using them to orient arrows. Inducing orientation to enforce consistency. The SGS algorithm for discovering causal graphs; why it works. Refinements of the SGS algorithm (the PC algorithm). What about latent variables? Software: TETRAD and pcalg. Limits to observational causal discovery: universal consistency is possible (and achieved), but uniform consistency is not.; PDF notes
April 26 (Tuesday): Lecture 25, Recap on estimation causal effects: Substituting consistent estimators into the formulas for front and back door identification. Tricks to avoid estimating marginal distributions. Uncertainty in estimates of effects; Homework 11 due: Solutions; Final exam out: Assignment
April 28 (Thursday): General review
May 9 (Monday): Final exam due at 10 am