Data over Space and Time (36-467/667)

Fall 2018

If you are looking for another iteration of this class, perhaps because you are taking it, see here

Cosma Shalizi
Tuesdays and Thursdays, 10:30--11:50, Posner Mellon Auditorium

This course is an introduction to the opportunities and challenges of analyzing data from processes unfolding over space and time. It will cover basic descriptive statistics for spatial and temporal patterns; linear methods for interpolating, extrapolating, and smoothing spatio-temporal data; basic nonlinear modeling; and statistical inference with dependent observations. Class work will combine practical exercises in R, some mathematics of the underlying theory, and case studies analyzing real data from various fields (economics, history, meteorology, ecology, etc.). Depending on available time and class interest, additional topics may include: statistics of Markov and hidden-Markov (state-space) models; statistics of point processes; simulation and simulation-based inference; agent-based modeling; dynamical systems theory.

Co-requisite: For undergraduates taking the course as 36-467, 36-401. For graduate students taking the course as 36-667, consent of the professor.

Note: Graduate students must register for the course as 36-667; if the system does let you sign up for 36-467, you will be dropped from the roster. Undergraduates (whether statistics majors or not) must register for 36-467.

This webpage will serve as the class syllabus. Course materials (notes, homework assignments, etc.) will be posted here, as available.

Goals and Learning Outcomes

(Accreditation officials look here)

The goal of this class is to train you in using statistical models to analyze interdependent data spread out over space, time, or both, using the models as data summaries, as predictive instruments, and as tools for scientific inference. We will build on the theory of statistical inference for independent data taught in 36-226, and complement the theory and applications of the linear model, introduced in 36-401. After taking the class, when you're faced with a new temporal, spatial, or spatio-temporal data-analysis problem, you should be able to (1) select appropriate methods, (2) use statistical software to implement them, (3) critically evaluate the resulting statistical models, and (4) communicate the results of your analyses to collaborators and to non-statisticians.

Topics

This class will not give much coverage to ARIMA models of time series, a subject treated extensively in 36-618.

Course Mechanics

Textbooks

The only required textbook is
Gidon Eshel, Spatiotemporal Data Analysis (Princeton, New Jersey: Princeton University Press, 2011, ISBN 978-0-691-12891-7, available on JSTOR).
The CMU library has electronic access to the full text, in PDF, through the JSTOR service. (You will need to either be on campus, or logged in to the university library.) Links to individual chapters will be posted as appropriate.

In addition, we will assign some sections from

Peter Guttorp, Stochastic Modeling of Scientific Data ( Boca Raton, Florida: Chapman & Hall / CRC Press, 1995. ISBN 978-0-412-99281-0).
Because this book is expensive, the library doesn't have electronic access, and a lot of it is about (interesting and important) topics outside the scope of the class, it is not required. Instead, scans of the appropriate sections will be distributed via Canvas.

You will also be doing a lot of computational work in R, so

Paul Teetor, The R Cookbook (O'Reilly Media, 2011, ISBN 978-0-596-80915-7)
is recommended. R's help files answer "What does command X do?" questions. This book is organized to answer "What commands do I use to do Y?" questions.

Assignments

There are three reasons you will get assignments in this course. In order of decreasing importance:
  1. Practice. Practice is essential to developing the skills you are learning in this class. It also actually helps you learn, because some things which seem murky clarify when you actually do them, and sometimes trying to do something shows what you only thought you understood.
  2. Feedback. By seeing what you can and cannot do, and what comes easily and what you struggle with, I can help you learn better, by giving advice and, if need be, adjusting the course.
  3. Evaluation. The university is, in the end, going to stake its reputation (and that of its faculty) on assuring the world that you have mastered the skills and learned the material that goes with your degree. Before doing that, it requires an assessment of how well you have, in fact, mastered the material and skills being taught in this course.

To serve these goals, there will be three kinds of assignment in this course.

In-class exercises
Most lectures will have in-class exercises. These will be short (10--20 minutes) assignments, emphasizing problem solving, done in class in small groups. The assignments will be given out in class, and must be handed in on paper by the end of class. On some days, a randomly-selected group may be asked to present their solution to the class.
Homework
Most weeks will have a homework assignment, divided into a series of questions or problems. These will have a common theme, and will usually build on each other, but different problems may involve statistical theory, analyzing real data sets on the computer, and communicating the results. The in-class exercises will either be problems from that week's homework, or close enough that seeing how to do the exercise should tell you how to do some of the problems.
All homework will be submitted electronically through Canvas. Most weeks, homework will be due at 6:00 pm on Wednesday. There will be a few weeks, clearly noted on the syllabus and on the assignments, when Thursday lecture will be canceled and homework will be due at noon on Thursday, i.e., the end of the lecture period. (When this means that there are only six days for the next homework, it will be shortened accordingly.)
There are specific formatting requirements for homework --- see below.
Exams
There will be both a midterm and a final exam. Each of these will require you to analyze a real-world data set, answering questions posed about it in the exam, and to write up your analysis in the form of a scientific report. The exam assignments will provide the data set, the specific questions, and a rubric for your report.
Both exams will be take-home, and you will have at least one week to work on each, without homework (from this class anyway). Both exams will be cumulative.
Exams are to be submitted through Canvas, and follow the same formatting requirements as the homework --- see below.
The mid-term will be due on October 11, and the final on December 14. If you might have a conflict with these dates, contact me as soon as possible.

Grading

Grades will be broken down as follows:

Grade boundaries will be as follows:
A [90, 100]
B [80, 90)
C [70, 80)
D [60, 70)
R < 60

To be fair to everyone, these boundaries will be held to strictly.

If you think that particular assignment was wrongly graded, tell me as soon as possible. Direct any questions or complaints about your grades to me; the teaching assistants have no authority to make changes. (This also goes for your final letter grade.) Complaints that the thresholds for letter grades are unfair, that you deserve a higher grade, etc., will accomplish much less than pointing to concrete problems in the grading of specific assignments.

As a final word of advice about grading, "what is the least amount of work I need to do in order to get the grade I want?" is a much worse way to approach higher education than "how can I learn the most from this class and from my teachers?".

Lectures

Lectures will be used to amplify the readings, provide examples and demos, and answer questions and generally discuss the material. They are also when you will do the in-class assignments which will help with your homework, and are part of your grade.

You are expected to do the readings before coming to class.

Do not use any electronic devices during lecture: no laptops, no tablets, no phones, no watches that do more than tell time. If you need to use an electronic assistive device, please make arrangements with me beforehand. (Experiments show, pretty clearly, that students learn more in electronics-free classrooms, not least because your device isn't distracting your neighbors.)

Office Hours

If you want help with computing, please bring a laptop.

Monday 2:30--3:30 Mr. Elliott Baker Hall 229A for 10 September, Porter Hall A20A
Tuesday 3:00--4:00 Mr. Elliott Baker Hall 132Q
Wednesday 1:00--3:00 Prof. Shalizi Baker Hall 229C

If you cannot make the regular office hours, or have concerns you'd rather discuss privately, please e-mail me about making an appointment.

R, R Markdown, and Reproducibility

R is a free, open-source software package/programming language for statistical computing. You should have begun to learn it in 36-401 (if not before). No other form of computational work will be accepted. If you are not able to use R, or do not have ready, reliable access to a computer on which you can do so, let me know at once.

Communicating your results to others is as important as getting good results in the first place. Every homework assignment will require you to write about that week's data analysis and what you learned from it; this writing is part of the assignment and will be graded. Raw computer output and R code is not acceptable; your document must be humanly readable.

All homework and exam assignments are to be written up in R Markdown. (If you know what knitr is and would rather use it, ask first.) R Markdown is a system that lets you embed R code, and its output, into a single document. This helps ensure that your work is reproducible, meaning that other people can re-do your analysis and get the same results. It also helps ensure that what you report in your text and figures really is the result of your code. For help on using R Markdown, see "Using R Markdown for Class Reports".

Format Requirements for Homework and Exams

For each assignment, you should submit two, and only two, files: an R Markdown source file, integrating text, generated figures and R code, and the "knitted", humanly-readable document, in either PDF (preferred) or HTML format. (I cannot read Word files, and you will lose points if you submit them.) I will be re-running the R Markdown file of randomly selected students; you should expect to be picked for this about once in the semester. You will lose points if your R Markdown file does not, in fact, generate your knitted file (making obvious allowances for random numbers, etc.).

Some problems in the homework will require you to do math. R Markdown provides a simple but powerful system for type-setting math. (It's based on the LaTeX document-preparation system widely used in the sciences.) If you can't get it to work, you can hand-write the math and include scans or photos of your writing in the appropriate places in your R Markdown document. You will, however, lose points for doing so, starting with no penalty for homework 1, and growing to a 90% penalty (for those problems) by homework 12.

Canvas and Piazza

Homework and exams will be submitted electronically through Canvas, which will also be used as the gradebook. Some readings and course materials will also be distributed through Canvas.

We will be using the Piazza website for question-answering. You will receive an invitation within the first week of class. Anonymous-to-other-students posting of questions and replies will be allowed, at least initially. Anonymity will go away for everyone if it is abused.

Collaboration, Cheating and Plagiarism

Except for explicit group exercises, everything you turn in for a grade must be your own work, or a clearly acknowledged borrowing from an approved source; this includes all mathematical derivations, computer code and output, figures, and text. Any use of permitted sources must be clearly acknowledged in your work, with citations letting the reader verify your source. You are free to consult the textbook and recommended class texts, lecture slides and demos, any resources provided through the class website, solutions provided to this semester's previous assignments in this course, books and papers in the library, or legitimate online resources, though again, all use of these sources must be acknowledged in your work. (Websites which compile course materials are not legitimate online resources.)

In general, you are free to discuss homework with other students in the class, though not to share work; such conversations must be acknowledged in your assignments. You may not discuss the content of assignments with anyone other than current students or the instructors until after the assignments are due. (Exceptions can be made, with prior permission, for approved tutors.) You are, naturally, free to complain, in general terms, about any aspect of the course, to whomever you like.

During the take-home exams, you are not allowed to discuss the content of the exams with anyone other than the instructors; in particular, you may not discuss the content of the exam with other students in the course.

Any use of solutions provided for any assignment in this course in previous years is strictly prohibited, both for homework and for exams. This prohibition applies even to students who are re-taking the course. Do not copy the old solutions (in whole or in part), do not "consult" them, do not read them, do not ask your friend who took the course last year if they "happen to remember" or "can give you a hint". Doing any of these things, or anything like these things, is cheating, it is easily detected cheating, and those who thought they could get away with it in the past have failed the course.

If you are unsure about what is or is not appropriate, please ask me before submitting anything; there will be no penalty for asking. If you do violate these policies but then think better of it, it is your responsibility to tell me as soon as possible to discuss how your mis-deeds might be rectified. Otherwise, violations of any sort will lead to severe, formal disciplinary action, under the terms of the university's policy on academic integrity.

On the first day of class, every student will receive a written copy of the university's policy on academic integrity, a written copy of these course policies, and a "homework 0" on the content of these policies. This assignment will not factor into your grade, but you must complete it before you can get any credit for any other assignment.

Accommodations for Students with Disabilities

The Office of Equal Opportunity Services provides support services for both physically disabled and learning disabled students. For individualized academic adjustment based on a documented disability, contact Equal Opportunity Services at eos [at] andrew.cmu.edu or (412) 268-2012; they will coordinate with me.

Schedule

SUBJECT TO CHANGE (with notice)
(The readings will be made specific at least one week before they're assigned.)
August 28 (Tuesday): Lecture 1, Introduction to the course
Welcome; course mechanics; data distributed over space and time; goals and challenges; basic EDA by way of pictures
Slides for lecture 1 (R Markdown source file)
Homework 0: assignment; see above for this course's policy on cheating, collaboration and plagiarism; here for CMU's policy on academic integrity; and on Piazza for the excerpt from Turabian's Manual for Writers
Homework 1: assignment, kyoto.csv data file
August 30 (Thursday): Lecture 2, Smoothing, Trends, Detrending I
Smoothing by local averaging. The idea of a trend, and de-trending. Smoothing as EDA. Some of the math of smoothing: the hat matrix, degrees of freedom. Expanding in eigenvectors.
Notes for lectures 2 and 3 (Rmd source file)
Reading: Eshel, chapter 7; Guttorp, introduction and chapter 1; Turabian, excerpts (on Piazza)
Optional reading: Karen Kafadar, "Smoothing Geographical Data, Particularly Rates of Disease", Statistics in Medicine 15 (1996): 2539--2560
Homework 0 due
September 4 (Tuesday): Lecture 3, Smothing, Trends, Detrending II
The hat matrix as the source of all knowledge. Residuals after de-trending as estimates of the fluctuations. The Yule-Slutsky effect. Picking how much to smooth by cross-validation. Special considerations for ratios (Kafadar).
Slides for lecture 3 (Rmd source file)
Reading: Eshel, chapter 8
September 6 (Thursday): Lecture 4, Principal Components I
The goal of principal components: finding simpler, linear structure in complicated, high-dimensional data. Math of principal components: linear approximation -> preserving variance -> eigenproblem. Reminders from linear algebra about eigenproblems. Mathematical solution to PCA. How to do PCA in R.
Slides (Rmd source)
Reading: Eshel, chapter 4, and skim chapter 5
Homework 1 due at 6 pm on Wednesday, September 5
Homework 2: assignment; smoother.matrix.R for use in problem 1
September 11 (Tuesday): Lecture 5, Principal Components II
Brief recap on PCA. Applying PCA to multiple time series. Applying PCA to spatial data. Applying PCA to spatio-temporal data. Interpreting PCA results. Why PCA can be good exploratory analysis, but is not statistical inference. Glimpses of some alternatives to PCA: independent component analysis, slow feature analysis, etc.
Slides (Rmd source)
Reading: Eshel, chapter 11, sections 11.1--11.7 and 11.9--11.10 (i.e., skipping 11.8 and 11.11--11.12)
September 13 (Thursday): No class
Homework 2 due at noon on Thursday, September 13
Homework 3: assignment; soccomp.irep1.csv data file; soccomp.csv data file.
September 18 (Tuesday): Lecture 6, Optimal Linear Prediction
Mathematics of prediction. Mathematics of optimal linear prediction, in any context whatsoever. Ordinary least squares as an estimator of the optimal linear predictor. Why we need the covariance functions.
Slides (Rmd)
Reading: Eshel, chapter 9, sections 9.1--9.3
September 20 (Thursday): Lecture 7, Linear Interpolation and Extrapolation of Time Series
Applying the linear-predictor idea to time series: interpolating between observations; extrapolating into the future (or past). The concept of stationarity. Auto- and cross- covariance. Covariance functions as EDA. Basic covariance estimation in R.
Slides (Rmd)
Reading: Eshel, chapter 9, section 9.5 (skipping 9.5.3 and 9.5.4)
Homework 3 due at 6:00 pm on Wednesday, September 19
Homework 4: assignment
September 25 (Tuesday): Lecture 8, Linear Interpolation and Extrapolation of Spatial and Spatio-Temporal Data
Applying the linear-predictor idea to spatial or spatio-temporal data ("kriging"): interpolating between observations, extrapolating into the unobserved. More advanced covariance estimation in spatial contexts. Concepts of stationarity, isotropy, separability, etc.
Slides (Rmd)
September 27 (Thursday): Lecture 9, Separating Signal and Noise with Linear Methods
Applying the linear-predictor idea to remove observational noise ("the Wiener filter"). Extracting periodic components and seasonalk adjustment.
Slides (Rmd)
Homework 4 due at 6:00 pm on Wednesday, September 26
Homework 5: assignment
October 2 (Tuesday): Lecture 10, Fourier Methods I
Decomposing time series into periodic signals, a.k.a. "going from the time domain to the frequency domain", a.k.a. "spectral analysis". The Fourier transform and the inverse Fourier transform. Fourier transform of a time series. Fourier transform of an autocovariance function, a.k.a. "the power spectrum". Wiener-Khinchin theorem. Interpreting the power spectrum; hunting for periodic components. Estimating the power spectrum: the periodogram.
Slides (Rmd)
Reading: Eshel, section 4.3.2 and 9.5.4
October 4 (Thursday): Lecture 11, Fourier Methods II Midterm review
Recap on Fourier analysis. More on estimating the power spectrum: smoothed periodograms. Frequency-domain covariance estimation. Spatial and spatio-temporal Fourier transforms. More interpretation. Fourier analysis vs. PCA. A brief glimpse of wavelets. Generating new time series from the spectrum.
Homework 5 due at 6 pm on Wednesday, 3 October
Midterm exam: Assignment, ccw.csv data-set
October 9 (Tuesday): Guest lecture by Prof. Patrick Manning: "African Population and Migration: Statistical Estimates, 1650--1900"
Reading: Handout distributed in class on 4 October [PDF]
October 11 (Thursday): No class
Midterm exam due at noon
Homework 6: assignment
October 16 (Tuesday): Lecture 12, Linear Generative Models for Time Series
Linear generative models for random sequences: autoregressions. Deterministic dynamical systems; more fun with eigenvalues and eigenvectors. Stochastic aspects. Vector auto-regressions.
Slides (Rmd)
Handout: AR(p) vs. VAR(1) models
Reading: Eshel, sections 9.5 and 9.7
October 18 (Thursday): Lecture 13, Linear Generative Models for Spatial and Spatio-Temporal Data
Simultaneous vs. conditional autoregressions for random fields. The "Gibbs sampler" trick. Autoregressions for spatio-temporal processes.
Slides (Rmd)
Homework 6 due at 6:00 pm on Wednesday, October 17
Homework 7: assignment, sial.csv data file
October 23 (Tuesday): Lecture 14, Statistical Inference with Dependent Data I
Reminder: why maximum likelihood and Gaussian approximations work for IID data. Consistency from convergence (law of large numbers); Gaussian approximation from fluctuations (central limit theorem). The "sandwich covariance" for general estimators. How these ideas carry over to dependent data.
Slides (.Rmd, pareto.R)
Reading: Guttorp, Appendix A
October 25 (Thursday): Lecture 15, Inference with Dependent Data II
Ergodic theory, a.k.a. laws of large numbers for dependent data. Basic ergodic theory for stochastic processes. Correlation times and effective sample size. Inference with autoregressions. Gestures at more advanced ergodic theory.
Reading: N/A
Slides (.Rmd)
Homework 7 due at 6:00 pm on Wednesday, October 24
Homework 8: Assignment, data (=simulation) files demorun.csv and remorun.csv
October 30 (Tuesday), Lecture 16: CANCELLED
No class meeting today.
November 1 (Thursday): Lecture 17, Simulation
General idea of simulating a statistical model. The "Monte Carlo method": using simulation to compute probabilities, expected values, etc.
Slides (.Rmd)
Homework 8 due at 6:00 pm on Wednesday, October 31
Homework 9 assigned canceled
November 6 (Tuesday): Lecture 18, Simulation for Inference I: The Bootstrap
The bootstrap principle: approximating the sample distribution by simulating a good estimate of the data-generating distribution. Uncertainty via model-based bootstraps. Uncertainty via resampling bootstraps for time series and for spatial processes. Related ideas: "surrogate data" tests of null hypotheses; ensemble forecasts.
Slides (.Rmd)
Reading: Shalizi, "The Bootstrap", American Scientist 98:3 (May-June 2010), 186--190
November 8 (Thursday): Lecture 19, Simulation for Inference II: Matching Simulations to Data
Reminder about estimation in general. The method of moments. The method of simulated moments. Indirect inference. Some asymptotics.
Slides (.Rmd)
Homework 9 due at 6:00 pm on Wednesday, November 7 canceled
Homework 10: assignment
November 13 (Tuesday): Lecture 20, Markov Chains I
Markov chains and the Markov property. Examples. Basic properties of Markov chains; special kinds of chain. Yet more fun with eigenvalues and eigenvectors. How one trajectory evolves vs. how a population evolves. Ergodicity and central limit theorems. Markov chain Monte Carlo. Higher-order Markov chains and related models.
Slides (.Rmd)
Reading: Guttorp, chapter 2, sections 2.1--2.6 (inclusive)
Optional reading: Handouts on "Monte Carlo and Markov Chains" (especially Section 2), and "Markov Chain Monte Carlo" from stat. computing 2013
November 15 (Thursday): Lecture 21, Compartment Models
General idea of compartment models as a special kind of Markov model. Applications in demography, epidemiology, sociology, chemistry, etc.
Notes; .Rnw source file for the notes
Homework 10 due at 6:00 pm on Wednesday, November 14
Homework 11: assignment; ckm_nodes.csv data file; ckm_network data file (only needed for the extra credit)
November 20 (Tuesday): Information Theory and Optimal Prediction
Because this is Thanksgiving week, this is an optional special-topics lecture.
Information theory: entropy, mutual information, entropy rate, information rate. Measuring prediction quality with entropy. Mathematical construction of the prediction process. Optimality and Markov properties. Sketch of how to make this work on data.
Slides
Homework 11 due at 6:00 pm on Tuesday, 20 November
No new assignment this week --- enjoy Thanksgiving!
November 27 (Tuesday): Lecture 22, Markov Chains II
Likelihood inference for individual trajectories. Least-squares inference for population data. Conditional density estimates for continuous spaces. Model-checking.
Reading: Guttorp, chapter 2, sections 2.7--2.9 (inclusive)
Slides (.Rmd)
Optional reading: Maximum Likelihood Estimation for Markov Chains handout from the (no longer taught) 36-462, 2009 (uses slightly different notation than we're doing)
Homework 12: assignment, dicty-seq-1.dat and dicty-seq-2.dat data files
November 29 (Thursday): Lecture 23, Markov Random Fields
Markov models in space. Applications: ecology; image analysis. Spatio-temporal Markov models: general idea; cellular automata; interacting particle systems. Inference.
Slides (.Rmd)
Reading: Guttorp, chapter 4, omitting section 4.6, and skimming section 4.3
December 4 (Tuesday): Lecture 24, State-Space or Hidden-Markov Models
Markov dynamics + distorting or noisy observations = Non-Markov observations. Model formulation. Inference: E-M algorithm, Kalman filter, particle filter, simulation-based methods. Spatio-temporal version: dynamic factor models.
Notes (.Rmd)
Reading: Guttorp, section 2.12
December 6 (Thursday): Lecture 25, Nonlinear Models
Using smoothing to estimate regression functions. Nonlinear autoregressions. Examples. R implementations.
Homework 12 due at 6:00 pm on Wednesday, December 5
Final exam: Assignment, sial.csv
December 14 (Friday): Final exam due at 10:30 am

Image credit: Pictures on this page are from my teacher David Griffeath's Particle Soup Kitchen website, except for Umberto Boccioni's Riot in the Galleria.