Cosma Shalizi

36-402, Undergraduate Advanced Data Analysis

Spring 2013

Tuesdays and Thursdays, 10:30--11:50 Porter Hall 100
Keen-eyed fellow investigators

The goal of this class is to train you in using statistical models to analyze data — as data summaries, as predictive instruments, and as tools for scientific inference. We will build on the theory and applications of the linear model, introduced in 36-401, extending it to more general functional forms, and more general kinds of data, emphasizing the computation-intensive methods introduced since the 1980s. After taking the class, when you're faced with a new data-analysis problem, you should be able to (1) select appropriate methods, (2) use statistical software to implement them, (3) critically evaluate the resulting statistical models, and (4) communicate the results of their analyses to collaborators and to non-statisticians.

During the class, you will do data analyses with existing software, and write your own simple programs to implement and extend key techniques. You will also have to write reports about your analyses.

Graduate students from other departments wishing to take this course should register for it under the number "36-608". Enrollment for 36-608 is very limited, and by permission of the professor only.

Prerequisites

36-401, or consent of the instructor. The latter is only granted under very unusual circumstances.

Instructors

Professor Cosma Shalizi cshalizi [at] cmu.edu
229 C Baker Hall
268-7826
Teaching assistants Mr. Beau Dabbs
Ms. Francesca Matano
Mr. Mingyu Tang
Ms. Xiaolin Yang

Topics, Notes, Readings

Model evaluation: statistical inference, prediction, and scientific inference; in-sample and out-of-sample errors, generalization and over-fitting, cross-validation; evaluating by simulating; the bootstrap; penalized fitting; mis-specification checks
Yet More Linear Regression: what is regression, really?; what ordinary linear regression actually does; what it cannot do; extensions
Smoothing: kernel smoothing, including local polynomial regression; splines; additive models; kernel density estimation
Generalized linear and additive models: logistic regression; generalized linear models; generalized additive models.
Latent variables and structured data: principal components; factor analysis and latent variables; latent cluster/mixture models; graphical models in general
Causality: graphical causal models; identification of causal effects from observations; estimation of causal effects; discovering causal structure
Dependent data: Markov models for time series without latent variables; hidden Markov models for time series with latent variables; longitudinal, spatial and network data
See the end for the current lecture schedule, subject to revision. Lecture notes will be linked there, as available.

Course Mechanics

Grades will not go away if you avert your eyes (photo by laurent KB on Flickr) Homework will be 60% of the grade, two midterms 10% each, and the final 20%.

Homework

The homework will give you practice in using the techniques you are learning to analyze data, and to interpret the analyses. There will be twelve weekly homework assignments, nearly one every week; they will all be due on Mondays at 11:59 pm (i.e., the night before Tuesday classes), through Blackboard. All homeworks count equally, totaling 60% of your grade. The lowest three homework grades will be dropped; consequently, no late homework will be accepted for any reason whatsoever.

Communicating your results to others is as important as getting good results in the first place. Every homework assignment will require you to write about that week's data analysis and what you learned from it. This portion of the assignment will be graded, along with the other questions. As always, raw computer output and R code is not acceptable, but should be put in an appendix to each assignment. Homework may be submitted either as a PDF (preferred) or as a plain text file (.txt). If you prepare your homework in Word, be sure to submit a PDF file; .doc, .docx, etc., files will not be graded.

Unlike PDF or plain text, Word files do not display consistently across different machines, different versions of the program on the same machine, etc., so not using them eliminates any doubt that what we grade differs from what you think you wrote. Word files are also much more of a security hole than PDF or (especially) plain text. Finally, it is obnoxious to force people to buy commercial, closed-source software just to read what you write. (It would be obnoxious even if Microsoft paid you for marketing its wares that way, but it doesn't.)

Exams

There will be two take-home mid-term exams (10% each), due at 11:59 pm on March 4th and April 15th. You will have one week to work on each midterm. There will be no homework in those weeks. There will also be a take-home final exam (20%), due at 10:30 am on May 13, which you will have two weeks to do.

Exams must also be submitted through Blackboard, under the same rules as homework.

Quality Control

To help control the quality of the grading, every week (after the first week of classes), six students will be selected at random, and will meet with the professor for ten minutes each, to explain their work and to answer questions about it. You may be selected on multiple weeks, if that's how the random numbers come up. This is not a punishment, but a way for the instructor to see whether the problem sets are really measuring learning of the course material; being selected will not hurt your grade in any way.

Office Hours

Prof. Shalizi Baker Hall 229A Monday 11:00--12:00
Baker Hall 229C Thursday 12:00--1:00
Baker Hall 229C Friday 3:30--4:30
Mr. Dabbs FMS 320 Monday 2:00--4:00
FMS 320 Thursday 3:00--4:00
If you want help with computing, please bring your laptop.

Blackboard

Blackboard will be used for submitting assignments electronically, and as a gradebook. All properly enrolled students should have access to the Blackboard site by the beginning of classes.

Textbook

The primary textbook for the course will be the draft Advanced Data Analysis from an Elementary Point of View. Chapters will be linked to here as they become needed. You are expected to read these notes, and are unlikely to be able to do the assignments without doing so. In addition, Paul Teetor, The R Cookbook (O'Reilly Media, 2011, ISBN 978-0-596-80915-7) is required.

Cox and Donnelly's Principles of Applied Statistics (Cambridge University Press, 2011, ISBN 978-1-107-64445-8) will also have required readings, but we will not use all of it. If you are unable to purchase it, contact the professor for photocopies.

Julian Faraway, Extending the Linear Model with R (Chapman Hall/CRC Press, 2006, ISBN 978-1-58488-424-8), and Venables and Ripley's Modern Applied Statistics with S (Springer, 2003; ISBN 9780387954578) will be optional. (Faraway's page on the book, with help and errata.) The campus bookstore should have copies of all of these.

Collaboration, Cheating and Plagiarism

Cheating leads to desolation and ruin (photo by paddyjoe on Flickr) Feel free to discuss all aspects of the course with one another, including homework and exams. However, the work you hand in must be your own. You must not copy mathematical derivations, computer output and input, or written descriptions from anyone or anywhere else, without reporting the source within your work. This includes copying from solutions provided to previous semesters' of the course. Unacknowledged copying will lead to severe disciplinary action. Please read the CMU Policy on Cheating and Plagiarism, and don't plagiarize.

Physically Disabled and Learning Disabled Students

The Office of Equal Opportunity Services provides support services for both physically disabled and learning disabled students. For individualized academic adjustment based on a documented disability, contact Equal Opportunity Services at eos [at] andrew.cmu.edu or (412) 268-2012.

R

R is a free, open-source software package/programming language for statistical computing. You should have begun to learn it in 36-401 (if not before), and this class presumes that you have. Almost every assignment will require you to use it. No other form of computational work will be accepted. If you are not able to use R, or do not have ready, reliable access to a computer on which you can do so, let me know at once.

Here are some resources for learning R: Caught in a thicket of syntax (photo by missysnowkitten on Flickr)

Even if you know how to do some basic coding (or more), you should read the page of Minimal Advice on Programming.

Reminders

Some handouts on stuff all of you should already know, but where evidently some of you could use refreshers:
  1. Uncorrelated vs. Independent
  2. Propagation of Error
  3. Which Bootstrap When?

Other Iterations of the Class

Some material is available from versions of this class taught in other years. Copying from any solutions provided there is not only cheating, it is very easily detected cheating.

Schedule

Subject to revision. Lecture notes, assignments and solutions will all be linked here, as they are available. Identifying significant features from background (photo by Gord McKenna on Flickr)

Current revision of the complete notes

January 15 (Tuesday): Lecture 1, Introduction to the class; regression
Statistics is the branch of mathematical engineering which designs and analyzes methods for learning from imperfect data. Regression is a statistical model of functional relationships between variables. Getting relationships right means being able to predict well. The least-squares optimal prediction is the expectation value; the conditional expectation function is the regression function. The regression function must be estimated from data; the bias-variance trade-off controls this estimation. Ordinary least squares revisited as a smoothing method. Other linear smoothers: nearest-neighbor averaging, kernel-weighted averaging.
Reading: Notes, chapter 1 (examples.dat for running example; ckm.csv data set for optional exercises); Cox and Donnelly, chapter 1
Optional reading: Faraway, chapter 1 (especially up to p. 17)
Homework 1: assignment, data
January 17 (Thursday): Lecture 2, The truth about linear regression
Using Taylor's theorem to justify linear regression locally. Collinearity. Consistency of ordinary least squares estimates under weak conditions. Linear regression coefficients will change with the distribution of the input variables: examples. Why R2 is usually a distraction. Linear regression coefficients will change with the distribution of unobserved variables (omitted variable effects). Errors in variables. Transformations of inputs and of outputs. Utility of probabilistic assumptions; the importance of looking at the residuals. What it really means when coefficients are significantly non-zero. What "controlled for in a linear regression" really means.
Reading: Notes, chapter 2 (R); Notes, appendix B
Optional reading: Faraway, rest of chapter 1
January 22 (Tuesday): Lecture 3, Evaluation: Error and inference
Statistical models have three main uses: as ways of summarizing (reducing, compressing) the data; as scientific models, facilitating actually scientific inference; and as predictors. Both summarizing and scientific inference are linked to prediction (though in different ways), so we'll focus on prediction. In particular for now we focus on the average error of prediction, under some particular measure of error. The distinction between in-sample error and generalization error, and why the former is almost invariably optimistic about the latter. Over-fitting. Examples of just how spectacularly one can over-fit really very harmless data. A brief sketch of the ideas of learning theory and capacity control. Data-set-splitting as a first attempt at practically controlling over-fitting. Cross-validation for estimating generalization error and for model selection. Justifying model-based inferences.
Reading: Notes, chapter 3 (R)
Cox and Donnelly, ch. 6 (on Blackboard)
Homework 1 due at midnight on Monday
Homework 2: assignment, R, penn-select.csv data file
January 24 (Thursday): Lecture 4, Smoothing methods in regression
The bias-variance trade-off tells us how much we should smooth. Adapting to unknown roughness with cross-validation; detailed examples. How quickly does kernel smoothing converge on the truth? Using kernel regression with multiple inputs. Using smoothing to automatically discover interactions. Plots to help interpret multivariate smoothing results. Average predictive comparisons.
Reading: Notes, chapter 4 (R)
Optional readings: Faraway, section 11.1; Hayfield and Racine, "Nonparametric Econometrics: The np Package"; Gelman and Pardoe, "Average Predictive Comparisons for Models with Nonlinearity, Interactions, and Variance Components" [PDF]
January 29 (Tuesday): Lecture 5, Simulation
Simulation: implementing the story encoded in the model, step by step, to produce something data-like. Stochastic models have random components and so require some random steps. Stochastic models specified through conditional distributions are simulated by chaining together random variables. How to generate random variables with specified distributions. Simulation shows us what a model predicts (expectations, higher moments, correlations, regression functions, sampling distributions); analytical probability calculations are short-cuts for exhaustive simulation. Simulation lets us check aspects of the model: does the data look like typical simulation output? if we repeat our exploratory analysis on the simulation output, do we get the same results? Simulation-based estimation: the method of simulated moments.
Reading: Notes, chapter 5 (but sections 5.4--5.6 are optional); R
Homework 2 due at midnight on Monday
Homework 3 assigned: assignment, SPhistory.short.csv
January 31 (Thursday): Lecture 6, The Bootstrap
Quantifying uncertainty by looking at sampling distributions. The bootstrap principle: sampling distributions under a good estimate of the truth are close to the true sampling distributions. Parametric bootstrapping: simulating from a model. Non-parametric bootstrapping: re-sampling the data. Special issues for regression: re-sampling residuals vs. re-sampling cases. Many examples. When does the bootstrap fail?
Reading: Notes, chapter 6 (R for figures and examples; pareto.R; wealth.dat)
Lecture slides; R for in-class examples
Cox and Donnelly, chapter 8
February 5 (Tuesday): Lecture 7, Writing R Code
R programs are built around functions: pieces of code that take inputs or arguments, do calculations on them, and give back outputs or return values. The most basic use of a function is to encapsulate something we've done in the terminal, so we can repeat it, or make it more flexible. To assure ourselves that the function does what we want it to do, we subject it to sanity-checks, or "write tests". To make functions more flexible, we use control structures, so that the calculation done, and not just the result, depends on the argument. R functions can call other functions; this lets us break complex problems into simpler steps, passing partial results between functions. Programs inevitably have bugs: debugging is the cycle of figuring out what the bug is, finding where it is in your code, and fixing it. Good programming habits make debugging easier, as do some tricks. Avoiding iteration. Re-writing code to avoid mistakes and confusion, to be clearer, and to be more flexible.
Reading: Notes, Appendix A
Optional reading: Slides from 36-350, introduction to statistical computing, especially through lecture 15.
Homework 3 due at midnight on Monday
Homework 4 Assignment, nampd.csv data set, code for the assignment
February 7 (Thursday): Lecture 8, Heteroskedasticity, weighted least squares, and variance estimation
Weighted least squares estimates, to give more emphasis to particular data points. Heteroskedasticity and the problems it causes for inference. How weighted least squares gets around the problems of heteroskedasticity, if we know the variance function. Estimating the conditional variance function from regression residuals. An iterative method for estimating the regression function and the variance function together. Examples of conditional variance estimation. Locally constant and locally linear modeling. Lowess.
Reading: Notes, chapter 7
Optional reading: Faraway, section 11.3
February 12 (Tuesday): Lecture 9, Splines
Kernel regression controls the amount of smoothing indirectly by bandwidth; why not control the irregularity of the smoothed curve directly? The spline smoothing problem is a penalized least squares problem: minimize mean squared error, plus a penalty term proportional to average curvature of the function over space. The solution is always a continuous piecewise cubic polynomial, with continuous first and second derivatives. Altering the strength of the penalty moves along a bias-variance trade-off, from pure linear regression at one extreme to pure interpolation at the other; changing the strength of the penalty is equivalent to minimizing the mean squared error under a constraint on the average curvature. To ensure consistency, the penalty/constraint should weaken as the data grows; the appropriate size is selected by cross-validation. An example with the data, including confidence bands. Writing splines as basis functions, and fitting as least squares on transformations of the data, plus a regularization term. A brief look at splines in multiple dimensions. Splines versus kernel regression.
Reading: Notes, chapter 8
Optional reading: Faraway, section 11.2
Homework 4 due at midnight on Monday
Homework 5 Assignment
February 14 (Thursday): Lecture 10, Additive models
The curse of dimensionality limits the usefulness of fully non-parametric regression in problems with many variables: bias remains under control, but variance grows rapidly with dimensionality. Parametric models do not have this problem, but have bias and do not let us discover anything about the true function. Structured or constrained non-parametric regression compromises, by adding some bias so as to reduce variance. Additive models are an example, where each input variable has a "partial response function", which add together to get the total regression function; the partial response functions are unconstrained. This generalizes linear models but still evades the curse of dimensionality. Fitting additive models is done iteratively, starting with some initial guess about each partial response function and then doing one-dimensional smoothing, so that the guesses correct each other until a self-consistent solution is reached. Examples in R using the California house-price data. Conclusion: there is hardly ever any reason to prefer linear models to additive ones, and the continued thoughtless use of linear regression is a scandal.
Reading: Notes, chapter 9 (mapper.R)
Optional reading: Faraway, chapter 12
February 19 (Tuesday): Lecture 11, Testing Regression Specifications
Non-parametric smoothers can be used to test parametric models. Forms of tests: differences in in-sample performance; differences in generalization performance; whether the parametric model's residuals have expectation zero everywhere. Constructing a test statistic based on in-sample performance. Using bootstrapping from the parametric model to find the null distribution of the test statistic. An example where the parametric model is correctly specified, and one where it is not. Cautions on the interpretation of goodness-of-fit tests. Why use parametric models at all? Answers: speed of convergence when correctly specified; and the scientific interpretation of parameters, if the model actually comes from a scientific theory. Mis-specified parametric models can predict better, at small sample sizes, than either correctly-specified parametric models or non-parametric smoothers, because of their favorable bias-variance characteristics; an example.
Reading: Notes, chapter 10; R for in-class demos
Cox and Donnelly, chapter 7
Homework 5 due at midnight on Monday
Homework 6: assignment
February 21 (Thursday): Lecture 12, More about Hypothesis Testing
The logic of hypothesis testing: significance, power, the will to believe, and the (shadow) price of power. Severe tests of hypotheses: severity of rejection vs. severity of acceptance. Common abuses. Confidence sets as the "dual" to hypothesis tests. Crucial role of sampling distributions. Examples, right and wrong.
Reading: Notes, chapter 11
February 26 (Tuesday): Lecture 13, Logistic regression
Modeling conditional probabilities; using regression to model probabilities; transforming probabilities to work better with regression; the logistic regression model; maximum likelihood; numerical maximum likelihood by Newton's method and by iteratively re-weighted least squares; comparing logistic regression to logistic-additive models.
Reading: Notes, chapter 12
Optional reading: Faraway, chapter 2 (omitting sections 2.11 and 2.12)
Homework 6 due at midnight on Monday
Midterm 1: assignment. Your data-set has been e-mailed to you.
February 28 (Thursday): Lecture 14, Generalized linear models and generalized additive models
Poisson regression for counts; iteratively re-weighted least squares again. The general pattern of generalized linear models; over-dispersion. Generalized additive models.
Reading: Notes, first half of chapter 13
Optional reading: Faraway, section 3.1 and chapter 6
March 5 (Tuesday): Lecture 15, GLM and GAM Examples
Building a weather forecaster for Snoqualmie Falls, Wash., with logistic regression. Exploratory examination of the data. Predicting wet or dry days form the amount of precipitation the previous day. First logistic regression model. Finding predicted probabilities and confidence intervals for them. Comparison to spline smoothing and a generalized additive model. Model comparison test detects significant mis-specification. Re-specifying the model: dry days are special. The second logistic regression model and its comparison to the data. Checking the calibration of the second model.
Reading: Notes, second half of chapter 13
Optional reading: Faraway, chapters 6 and 7 (continued from previous lecture)
Midterm 1 due at midnight on Monday
March 7 (Thursday): Lecture 16, Multivariate Distributions
Reminders about multivariate distributions. The multivariate Gaussian distribution: definition, relation to the univariate or scalar Gaussian distribution; effect of linear transformations on the parameters; plotting probability density contours in two dimensions; using eigenvalues and eigenvectors to understand the geometry of multivariate Gaussians; conditional distributions in multivariate Gaussians and linear regression; computational aspects, specifically in R. General methods for estimating parametric distributional models in arbitrary dimensions: moment-matching and maximum likelihood; asymptotics of maximum likelihood; bootstrapping; model comparison by cross-validation and by likelihood ratio tests; goodness of fit by the random projection trick.
Reading: Notes, chapter 14
March 12 and 14: Spring break
March 19 (Tuesday): Lecture 17, Density Estimation
The desirability of estimating not just conditional means, variances, etc., but whole distribution functions. Parametric maximum likelihood is a solution, if the parametric model is right. Histograms and empirical cumulative distribution functions are non-parametric ways of estimating the distribution: do they work? The Glivenko-Cantelli law on the convergence of empirical distribution functions, a.k.a. "the fundamental theorem of statistics". More on histograms: they converge on the right density, if bins keep shrinking but the number of samples per bin keeps growing. Kernel density estimation and its properties: convergence on the true density if the bandwidth shrinks at the right rate; superior performance to histograms; the curse of dimensionality again. An example with cross-country economic data. Kernels for discrete variables. Estimating conditional densities; another example with the OECD data. Some issues with likelihood, maximum likelihood, and non-parametric estimation.
Reading: Notes, chapter 15
Homework 7 assignment, n90_pol.csv data
March 21 (Thursday): Lecture 18, Relative Distributions and Smooth Tests of Goodness-of-Fit
Applying the right CDF to a continuous random variable makes it uniformly distributed. How do we test whether some variable is uniform? The smooth test idea, based on series expansions for the log density. Asymptotic theory of the smooth test. Choosing the basis functions for the test and its order. Smooth tests for non-uniform distributions through the transformation. Dealing with estimated parameters. Some examples. Non-parametric density estimation on [0,1]. Checking conditional distributions and calibration with smooth tests. The relative distribution idea: comparing whole distributions by seeing where one set of samples falls in another distribution. Relative density and its estimation. Illustrations of relative densities. Decomposing shifts in relative distributions.
Reading: Notes, chapter 16
Optional reading: Bera and Ghosh, "Neyman's Smooth Test and Its Applications in Econometrics"; Handcock and Morris, "Relative Distribution Methods"
March 26 (Tuesday): Lecture 19, Principal Components Analysis
Principal components is the simplest, oldest and most robust of dimensionality-reduction techniques. It works by finding the line (plane, hyperplane) which passes closest, on average, to all of the data points. This is equivalent to maximizing the variance of the projection of the data on to the line/plane/hyperplane. Actually finding those principal components reduces to finding eigenvalues and eigenvectors of the sample covariance matrix. Why PCA is a data-analytic technique, and not a form of statistical inference. An example with cars. PCA with words: "latent semantic analysis"; an example with real newspaper articles. Visualization with PCA and multidimensional scaling. Cautions about PCA; the perils of reification; illustration with genetic maps.
Reading: Notes, chapter 17; pca.R, pca-examples.Rdata, and cars-fixed04.dat
Homework 7 due at midnight on Monday
Homework 8 assignment, MOM data file
March 28 (Thursday): Lecture 20, Factor Analysis
Adding noise to PCA to get a statistical model. The factor analysis model, or linear regression with unobserved independent variables. Assumptions of the factor analysis model. Implications of the model: observable variables are correlated only through shared factors; "tetrad equations" for one factor models, more general correlation patterns for multiple factors. (Our first look at latent variables and conditional independence.) Geometrically, the factor model says the data have a Gaussian distribution on some low-dimensional plane, plus noise moving them off the plane. Estimation by heroic linear algebra; estimation by maximum likelihood. The rotation problem, and why it is unwise to reify factors. Other models which produce the same correlation patterns as factor models.
Reading: Notes, chapter 18; factors.R and sleep.txt
April 2 (Tuesday): Lecture 21, Mixture Models
From factor analysis to mixture models by allowing the latent variable to be discrete. From kernel density estimation to mixture models by reducing the number of points with copies of the kernel. Probabilistic formulation of mixture models. Geometry: planes again. Probabilistic clustering. Estimation of mixture models by maximum likelihood, and why it leads to a vicious circle. The expectation-maximization (EM, Baum-Welch) algorithm replaces the vicious circle with iterative approximation. More on the EM algorithm: convexity, Jensen's inequality, optimizing a lower bound, proving that each step of EM increases the likelihood. Mixtures of regressions. Other extensions.
Reading: Notes, first half of chapter 19
Homework 8 due at midnight on Monday
Homework 9 (cancelled)
April 4 (Thursday): Lecture 22, Mixture Model Examples and Complements
Precipitation in Snoqualmie Falls revisited. Fitting a two-component Gaussian mixture; examining the fitted distribution; checking calibration. Using cross-validation to select the number of components to use. Examination of the selected mixture model. Suspicious patterns in the parameters of the selected model. Approximating complicated distributions vs. revealing hidden structure. Using bootstrap hypothesis testing to select the number of mixture components.
Reading: Notes, second half of chapter 19; mixture-examples.R
April 9 (Tuesday): Lecture 23, Graphical Models
Conditional independence and dependence properties in factor models. The generalization to graphical models. Directed acyclic graphs. DAG models. Factor, mixture, and Markov models as DAGs. The graphical Markov property. Reading conditional independence properties from a DAG. Creating conditional dependence properties from a DAG. Statistical aspects of DAGs. Reasoning with DAGs; does asbestos whiten teeth?
Reading: Notes, chapter 20
Homework 9 due at midnight on Monday
Exam 2: assignment (your data set was mailed to you)
April 11 (Thursday): Lecture 24, Graphical Causal Models
Probabilistic prediction is about passively selecting a sub-ensemble, leaving all the mechanisms in place, and seeing what turns up after applying that filter. Causal prediction is about actively producing a new ensemble, and seeing what would happen if something were to change ("counterfactuals"). Graphical causal models are a way of reasoning about causal prediction; their algebraic counterparts are structural equation models (generally nonlinear and non-Gaussian). The causal Markov property. Faithfulness. Performing causal prediction by "surgery" on causal graphical models. The d-separation criterion. Path diagram rules for linear models.
Reading: Notes, chapter 21
Optional reading: Cox and Donnelly, chapter 9; Pearl, "Causal Inference in Statistics", section 1, 2, and 3 through 3.2
April 16 (Tuesday): Lecture 25, Identifying Causal Effects from Observations
Reprise of causal effects vs. probabilistic conditioning. "Why think, when you can do the experiment?" Experimentation by controlling everything (Galileo) and by randomizing (Fisher). Confounding and identifiability. The back-door criterion for identifying causal effects: condition on covariates which block undesired paths. The front-door criterion for identification: find isolated and exhaustive causal mechanisms. Deciding how many black boxes to open up. Instrumental variables for identification: finding some exogenous source of variation and tracing its effects. Critique of instrumental variables: vital role of theory, its fragility, consequences of weak instruments. Irremovable confounding: an example with the detection of social influence; the possibility of bounding unidentifiable effects. Summary recommendations for identifying causal effects.
Reading: Notes, chapter 22
Optional reading: Pearl, "Causal Inference in Statistics", sections 3.3--3.5, 4, and 5.1
Midterm 2 due at midnight on Monday
Homework 10: assignment, sesame.csv
April 18 (Thursday): Carnival, no class
April 23 (Tuesday): Lecture 26, Estimating Causal Effects from Observations
Estimating graphical models: substituting consistent estimators into the formulas for front and back door identification; average effects and regression; tricks to avoid estimating marginal distributions; propensity scores and matching and propensity scores as computational short-cuts in back-door adjustment. Instrumental variables estimation: the Wald estimator, two-stage least-squares. Summary recommendations for estimating causal effects.
Reading: Notes, chapter 23
Homework 10 due at midnight on Monday
Homework 11 assignment, debt.csv
April 25 (Thursday): Lecture 27, Discovering Causal Structure from Observations
How do we get our causal graph? Comparing rival DAGs by testing selected conditional independence relations (or dependencies). Equivalence classes of graphs. Causal arrows never go away no matter what you condition on ("no causation without association"). The crucial difference between common causes and common effects: conditioning on common causes makes their effects independent, conditioning on common effects makes their causes dependent. Identifying colliders, and using them to orient arrows. Inducing orientation to enforce consistency. The SGS algorithm for discovering causal graphs; why it works. The PC algorithm: the SGS algorithm for lazy people. What about latent variables? Software: TETRAD and pcalg; examples of working with pcalg. Limits to observational causal discovery: universal consistency is possible (and achieved), but uniform consistency is not.
Reading: Notes, chapter 24
April 30 (Tuesday): Lecture 28, Time Series I
What time series are. Properties: autocorrelation or serial correlation; strong and weak stationarity. The correlation time, the world's simplest ergodic theorem, effective sample size. The meaning of ergodicity: a single increasing long time series becomes representative of the whole process. Conditional probability estimates; Markov models; the meaning of the Markov property. Autoregressive models, especially additive autoregressions; conditional variance estimates. Bootstrapping time series. Trends and de-trending.
Reading: Notes, chapter 25; R for examples; gdp-pc.csv
Homework 11 due at midnight on Monday
Final exam assignment; strikes.csv and macro.csv data sets
Help installing pcalg
May 2 (Thursday): Lecture 29, Time Series II
Cross-validation for time series. Change-points and "structural breaks". Moving averages: spurious correlations (Yule effect) and oscillations (Slutsky effect). State-space or hidden Markov models; moving average and ARMA models as state-space models. The EM algorithm for hidden Markov models; particle filtering. Multiple time series: "dynamic" graphical models; "Granger" causality (which is not causal); the possibility of real causality.
May 13
Final exam due at 10:30 am
photo by barjack on Flickr