Data over Space and Time

Data over Space and Time (36-467/667)

Fall 2020

Cosma Shalizi
Tuesdays and Thursdays, 9:50--11:10, online only

This course is an introduction to the opportunities and challenges of analyzing data from processes unfolding over space and time. It will cover basic descriptive statistics for spatial and temporal patterns; linear methods for interpolating, extrapolating, and smoothing spatio-temporal data; basic nonlinear modeling; and statistical inference with dependent observations. Class work will combine practical exercises in R, some mathematics of the underlying theory, and case studies analyzing real data from various fields (history, meteorology, climatology, ecology, demography, etc.). Depending on available time and class interest, additional topics may include: statistics of Markov and hidden-Markov (state-space) models; statistics of point processes; simulation and simulation-based inference; agent-based modeling; dynamical systems theory.

This webpage will serve as the class syllabus. Course materials (notes, homework assignments, etc.) will be linked to from here, as available.

Undergraduates must register for the course as 36-467; graduate students must register for it as 36-667. If the system does let you register for the wrong section, you'll be dropped from the roster.

Pre-requisite: For undergraduates taking the course as 36-467, 36-401 with a grade of C or higher. For graduate students taking the course as 36-667, there are formally no pre-requisites, but you really will need to know how to do linear regression, both in theory (as taught in 401) and on real data using R (as also taught in 401), and the mathematics that forms 401's pre-requisites (linear algebra, calculus in multiple variables, probability, mathematical statistics). If you're not sure whether you're ready for 467, ask me!

Instructors

Professor Dr. Cosma Shalizi cshalizi [at] cmu [dot] edu
Teaching Assistants Mr. Raghav Bansal not to be bothered by e-mail
Mr. Mateo Dulce Rubio

Goals and Learning Outcomes

(Accreditation officials look here)

The goal of this class is to train you in using statistical models to analyze interdependent data spread out over space, time, or both, using the models as data summaries, as predictive instruments, and as tools for scientific inference. We will build on the theory of statistical inference for independent data taught in 36-226, and complement the theory and applications of the linear model, introduced in 36-401. After taking the class, when you're faced with a new temporal, spatial, or spatio-temporal data-analysis problem, you should be able to (1) describe the statistical challenges the problem presents, (2) select appropriate methods, (3) use statistical software to implement those methods, (4) critically evaluate the resulting statistical models, and (5) communicate the results of your analyses to collaborators and to non-statisticians.

Topics Covered

This class will not give much coverage to ARIMA models of time series, a subject treated extensively in 36-618.

Course Mechanics

Lectures and Remote-Only Instruction

Lectures will be used to amplify the readings, provide examples and demos, and answer questions and generally discuss the material. You will usually find the readings more rewarding if you do the readings before lecture, rather than after (or during). Since this is an online-only class this semester, lectures will be held via Zoom; the link for each session will be on Canvas. I know that the class time is late at night or early in the morning for many of you; I nonetheless urge you to come to class and participate.

No Recordings: I will not be recording lectures. This is because the value of class meetings lies precisely in your chance to ask questions, discuss, and generally interact. (Otherwise, you could just read a book.) Recordings interfere with this in two ways:

  1. They tempt you to skip class and/or to zone out and/or try to multi-task during it. (Nobody is really any good at multi-tasking.) Even if you do watch the recording later, you will not learn as much from it as if you had attended in the first place.
  2. People are understandably reluctant to participate when they know they're being recorded. (It's only too easy to manipulate recordings to make anyone seem dumb and/or obnoxious.) Maybe this doesn't bother you; it doesn't bother me, much, because I'm protected by academic freedom and by tenure, but a good proportion of your classmates won't participate if they're being recorded, and that diminishes the value of the class for everyone.

Recording someone without their permission is illegal in many places, and more importantly is unethical everywhere, so don't make your own recordings of the class.

(Taking notes during class is fine and I strongly encourage it; taking notes forces you to think about what you are hearing and how to organize it, which helps you understand and remember the content.)

Textbooks

The only required textbook is
Gidon Eshel, Spatiotemporal Data Analysis (Princeton, New Jersey: Princeton University Press, 2011, ISBN 978-0-691-12891-7, available on JSTOR).
The CMU library has electronic access to the full text, in PDF, through the JSTOR service. (You will need to either be on campus, or logged in to the university library.) Links to individual chapters will be posted as appropriate.

In addition, we will assign some sections from

Peter Guttorp, Stochastic Modeling of Scientific Data (Boca Raton, Florida: Chapman & Hall / CRC Press, 1995. ISBN 978-0-412-99281-0).
Because this book is expensive, the library doesn't have electronic access, and a lot of it is about (interesting and important) topics outside the scope of the class, it is not required. Instead, scans of the appropriate sections will be distributed via Canvas. (I am working on getting electronic access through the library, and will update students if I succeed.)

You will also be doing a lot of computational work in R, so

Paul Teetor, The R Cookbook (O'Reilly Media, 2011, ISBN 978-0-596-80915-7)
is recommended. R's help files answer "What does command X do?" questions. This book is organized to answer "What commands do I use to do Y?" questions.

Assignments

There are three reasons you will get assignments in this course. In order of decreasing importance:
  1. Practice. Practice is essential to developing the skills you are learning in this class. It also actually helps you learn, because some things which seem murky clarify when you actually do them, and sometimes trying to do something shows what you only thought you understood.
  2. Feedback. By seeing what you can and cannot do, and what comes easily and what you struggle with, I can help you learn better, by giving advice and, if need be, adjusting the course.
  3. Evaluation. The university is, in the end, going to stake its reputation (and that of its faculty) on assuring the world that you have mastered the skills and learned the material that goes with your degree. Before doing that, it requires an assessment of how well you have, in fact, mastered the material and skills being taught in this course.

To serve these goals, there will be two kinds of assignment in this course.

After-class comprehension questions and exercises
Following every lecture, there will be a brief set of questions about the material covered in lecture. Sometimes these will be about specific points in the lecture, sometimes about specific aspects of the reading assigned to go with the lecture. These will be done on Canvas and will be due the day after each lecture. These should take no more than 10 minutes, but will be untimed (so no accommodations for extra time are necessary). If the questions ask you to do any math (and not all of them will!), a scan or photograph of hand-written math is OK, so long as the picture is clearly legible. (Black ink or dark pencil on white paper helps.)
Homework
Most weeks will have a homework assignment, divided into a series of questions or problems. These will have a common theme, and will usually build on each other, but different problems may involve statistical theory, analyzing real data sets on the computer, and communicating the results.
All homework will be submitted electronically through Gradescope/Canvas. Most weeks, homework will be due at 6:00 pm on Thursdays (Pittsburgh time). There will be a few weeks, clearly noted on the syllabus and on the assignments, when this won't be the case. When this results in less than seven days between an assignment's due date and the previous due date, the homework will be shortened.
There are specific formatting requirements for homework --- see below.

Time Expectatons

You should expect to spend 5--7 hours on assignments every week, averaging over the semester. (This follows from the university's rules about how course credits translate into hours of student time.) If you find yourself spending significantly more time than that on the class, please come to talk to me.

Grading

Grades will be broken down as follows:

Grade boundaries will be as follows:
A [90, 100]
B [80, 90)
C [70, 80)
D [60, 70)
R < 60

To be fair to everyone, these boundaries will be held to strictly.

Grade changes and regrading: If you think that particular assignment was wrongly graded, tell me as soon as possible. Direct any questions or complaints about your grades to me; the teaching assistants have no authority to make changes. (This also goes for your final letter grade.) Complaints that the thresholds for letter grades are unfair, that you deserve a higher grade, etc., will accomplish much less than pointing to concrete problems in the grading of specific assignments. As a final word of advice about grading, "what is the least amount of work I need to do in order to get the grade I want?" is a much worse way to approach higher education than "how can I learn the most from this class and from my teachers?".

Office Hours

For this semester, Zoom office hours will be times when I am available to answer questions in a Zoom chat. Piazza office hours will be times when an instructor will be logged in and answering questions on Piazza --- so if you want or need to have a back-and-forth, someone will be available.

Instructor Day Time (Pittsburgh) Venue
Mr. Dulce Rubio Mondays 9:30--10:30 am Piazza
Mr. Bansal Tuesdays 2:00--3:00 pm Piazza
Prof. Shalizi Wednesdays 2:00--3:00 pm Zoom
Prof. Shalizi Thursdays 2:00--3:00 pm Piazza

If you cannot make the regular office hours, or have concerns you'd rather discuss privately (e.g., grades), please e-mail me about making an appointment to meet by Zoom.

R, R Markdown, and Reproducibility

R is a free, open-source software package/programming language for statistical computing. You should have begun to learn it in 36-401 (if not before). No other form of computational work will be accepted. If you are not able to use R, or do not have ready, reliable access to a computer on which you can do so, let me know at once.

(There are plenty of other perfectly good computing systems for data analysis --- I learned to do it using Fortran and C, so help me --- but a uniform language is a lot easier to grade, and statisticians have self-organized on R as a standard.)

Communicating your results to others is as important as getting good results in the first place. Every homework assignment will require you to write about that week's data analysis and what you learned from it; this writing is part of the assignment and will be graded. Raw computer output and R code is not acceptable; your document must be humanly readable.

All homework and exam assignments are to be written up in R Markdown. (If you know what knitr is and would rather use it, ask first.) R Markdown is a system that lets you embed R code, and its output, into a single document. This helps ensure that your work is reproducible, meaning that other people can re-do your analysis and get the same results. It also helps ensure that what you report in your text and figures really is the result of your code. For help on using R Markdown, see "Using R Markdown for Class Reports".

Format Requirements for Homework

For each assignment, you should submit a single PDF file, which is the "knitted", humanly-readable document generated by your R Markdown source file. That source file should integrate all of your text, and the R code to generate all of your numerical results, figures and tables.

Some problems in the homework will require you to do math. R Markdown provides a simple but powerful system for type-setting math. (It's based on the LaTeX document-preparation system widely used in the sciences.) If you can't get it to work, you can hand-write the math and include scans or photos of your writing in the appropriate places in your R Markdown document. You will, however, lose points for doing so, starting with no penalty for homework 1, and growing to a 90% penalty (for those problems) by the final homework. For help on this aspect of using R Markdown, see "Using R Markdown for Class Reports".

Every week, I will randomly select some students and ask you to send me your R Markdown file. You will lose points if your R Markdown file does not, in fact, generate your knitted file (making the obvious allowances for random numbers, etc.). You should expect to be picked for this about once in the semester, but since it will be random sampling with replacement, you may be asked for your R Markdown more than once.

Canvas and Piazza

Homework will be submitted electronically through Gradescope/Canvas. Canvas will also be used for the after-class questions, as a calendar showing all assignments and their due-dates, to distribute some readings, and as the official gradebook.

We will be using the Piazza website for question-answering. You will receive an invitation within the first week of class. Anonymous-to-other-students posting of questions and replies will be allowed, at least initially. Anonymity will go away for everyone if it is abused. During Piazza office hours, someone will be online to respond to questions (and follow-ups) in real time. You are welcome to post at any time, but outside of normal working hours you should expect that the instructors have lives.

Materials from Previous Versions of the Course

Public materials (lecture slides and notes, homework assignments and data files but not solutions, etc.) from other semesters I've taught this course can be found here. You're welcome to look at them, but lectures and assignments will change.

Collaboration, Cheating and Plagiarism

Except for explicit group exercises, everything you turn in for a grade must be your own work, or a clearly acknowledged borrowing from an approved source; this includes all mathematical derivations, computer code and output, figures, and text. Any use of permitted sources must be clearly acknowledged in your work, with citations letting the reader verify your source. You are free to consult the textbooks and recommended class texts, lecture slides and demos, any resources provided through the class website, solutions provided to this semester's previous assignments in this course, books and papers in the library, or legitimate online resources, though again, all use of these sources must be acknowledged in your work. (Websites which compile course materials are not legitimate online resources.)

In general, you are free to discuss homework with other students in the class, though not to share or compare work; such conversations must be acknowledged in your assignments. You may not discuss the content of assignments with anyone other than current students, the instructors, or your teachers in other current classes at CMU, until after the assignments are due. (Exceptions can be made, with prior permission, for approved tutors.) You are, naturally, free to complain, in general terms, about any aspect of the course, to whomever you like.

Any use of solutions provided for any assignment in this course, or in other courses, in previous semesters is strictly prohibited. This prohibition applies even to students who are re-taking the course. Do not copy the old solutions (in whole or in part), do not "consult" them, do not read them, do not ask your friend who took the course last year if they "happen to remember" or "can give you a hint". Doing any of these things, or anything like these things, is cheating, it is easily detected cheating, and those who thought they could get away with it in the past have failed the course. Even more importantly: doing any of those things means that the assignment doesn't give you a chance to practice; it makes any feedback you get meaningless; and of course it makes any evaluation based on that assignment unfair.

If you are unsure about what is or is not appropriate, please ask me before submitting anything; there will never be a penalty for asking. If you do violate these policies but then think better of it, it is your responsibility to tell me as soon as possible to discuss how to rectify matters. Otherwise, violations of any sort will lead to severe, formal disciplinary action, under the terms of the university's policy on academic integrity.

On the first day of class, every student will receive a written copy of the university's policy on academic integrity, a written copy of these course policies, and a "homework 0" on the content of these policies. This assignment will not factor into your grade, but you must complete it before you can get any credit for any other assignment.

Accommodations for Students with Disabilities

The Office of Equal Opportunity Services provides support services for both physically disabled and learning disabled students. For individualized academic adjustment based on a documented disability, contact Equal Opportunity Services at eos [at] andrew.cmu.edu or (412) 268-2012; they will coordinate with me.

Inclusion and Respectful Participation

The university is a community of scholars, that is, of people seeking knowledge. All of our accumulated knowledge has to be re-learned by every new generation of scholars, and re-tested, which requires debate and discussion. Everyone enrolled in the course has a right to participate in the class discussions. This doesn't mean that everything everyone says is equally correct or equally important, but does mean that everyone needs to be treated with respect as persons, and criticism and debate should be directed at ideas and not at people. Don't dismiss (or enhance) anyone in the course because of where they come from, and don't use your participation in the class as a way of shutting up others. (Don't be rude, and don't go looking for things to be offended by.) While methods for spatio-temporal data analysis don't usually lead to heated debate, some of the subjects we'll be applying them to might. If someone else is saying something you think is really wrong-headed, and you think it's important to correct it, address why it doesn't make sense, and listen if they give a counter-argument.

The classroom is not a democracy; as the teacher, I have the right and the responsibility to guide the discussion in what I judge are productive directions. This may include shutting down discussions which are not helping us learn about statistics, even if those discussions are important to have elsewhere. I will do my best to guide the course in a way which respects everyone's dignity as a human being and as a member of the university.

Schedule

Lecture slides and/or notes will be linked to here after each class. This page will also give links to assignments and data files for homeworks (though they will also be linked to on Canvas).

Possible changes: Topics, and the order in which we cover topics, may change. The material up to the beginning of November is unlikely to alter, but the more advanced, and more miscellaneous, topics after that are less set, and will depend on students' expressed interests, what I can find good examples for, whether we need to make up for disruptions earlier in the semester, etc. I will give as much warning to any chances as I can.

Readings: Please do these before coming to class, if you possibly can --- things will make more sense if you do! (Obviously this doesn't apply to the first class.) Readings marked with a star (*) are optional, either because they're more advanced, or longer, or tangential. Readings marked with multiple stars are especially optional. There are currently some lectures for which I haven't fixed on readings, marked with "TBD"; these will be made specific at least three days before class.

Lecture 1 (Tuesday, 1 September): Introduction to the course

Lecture 2 (Thursday, 3 September): Graphics and Exploratory Analyses

Lecture 3 (Tuesday, 8 September): Smoothing, Trends, Detrending I

Lecture 4 (Thursday, 10 September): Smothing, Trends, Detrending II

  • Slides (.Rmd source)
  • Homework:

    Lecture 5 (Tuesday, 15 September): Principal Components I

    Lecture 6 (Thursday, 17 September): Lecture 5, Principal Components II

    Lecture 7 (Tuesday, 22 September): Optimal Linear Prediction

    Lecture 8 (Thursday, 24 September): Linear Interpolation and Extrapolation of Time Series

    Lecture 9 (Tuesday, 29 September): Optimal Linear Prediction for Spatial and Spatio-Temporal Data

    Lecture 10 (Thursday, 1 October): Separating Signal and Noise with Linear Methods

    Lecture 11 (Tuesday, 6 October): Linear Generative Models for Time Series

    Lecture 12 (Thursday, 8 October): Linear Generative Models for Spatial and Spatio-Temporal Data

    Lecture 13 (Tuesday, 13 October): Statistical Inference with Dependent Data I: Really Understanding Inference with Independent Data

    Lecture 14 (Thursday, 15 October): Inference with Dependent Data II

    Lecture 15 (Tuesday, 20 October): Simulation

    Lecture 16 (Thursday, 22 October): Simulation for Inference I: The Bootstrap

    Lecture 17 (Tuesday, 27 October): Simulation for Inference II: Matching Simulations to Data

    Lecture 18 (Thursday, 29 October): Markov Chains I

    Election Day (Tuesday, 3 November): NO CLASS

    Lecture 19 (Thursday, 5 November): Markov Chains II

    Lecture 20 (Tuesday, 10 November): Epidemic Models

    Lecture 21 (Thursday, 12 November): Compartment Models

    Lecture 22 (Tuesday, 17 November): Markov Random Fields

    Lecture 23 (Thursday, 19 November): State-Space or Hidden-Markov Models

    NO CLASS on Tuesday, 24 November

    Thanksgiving Day (Thursday, 26 November): NO CLASS

    Lecture 24 (Tuesday, 1 December): Nonlinear Prediction I: Model-Agnostic Predictions

    Lecture 25 (Thursday, 3 December): Nonlinear prediction II: Model-Reliant Predictions

    Lecture 26 (Tuesday, 8 December): Regressions with Dependent Observations

    Lecture 27 (Thursday, 10 December): Causal Inference over Time

    Image credit: Pictures on this page are from my teacher David Griffeath's Particle Soup Kitchen website, except for Umberto Boccioni's Riot in the Galleria.