Data over Space and Time
Data over Space and Time (36-467/667)
Fall 2020
Cosma Shalizi
Tuesdays and Thursdays, 9:50--11:10, online only
This course is an introduction to the opportunities and challenges of
analyzing data from processes unfolding over space and time. It will cover
basic descriptive statistics for spatial and temporal patterns; linear methods
for interpolating, extrapolating, and smoothing spatio-temporal data; basic
nonlinear modeling; and statistical inference with dependent observations.
Class work will combine practical exercises in R, some mathematics of the
underlying theory, and case studies analyzing real data from various fields
(history, meteorology, climatology, ecology, demography, etc.). Depending on
available time and class interest, additional topics may include: statistics of
Markov and hidden-Markov (state-space) models; statistics of point processes;
simulation and simulation-based inference; agent-based modeling; dynamical
systems theory.
This webpage will serve as the class syllabus. Course materials (notes,
homework assignments, etc.) will be linked to from here, as available.
Undergraduates must register for the course as 36-467; graduate students
must register for it as 36-667. If the system does let you register for the
wrong section, you'll be dropped from the roster.
Pre-requisite: For undergraduates taking the course as
36-467, 36-401 with a
grade of C or higher. For graduate students taking the course as 36-667, there
are formally no pre-requisites, but you really will need to know how to do
linear regression, both in theory (as taught in 401) and on real data
using R (as
also taught in 401), and the mathematics that forms 401's pre-requisites
(linear algebra, calculus in multiple variables, probability, mathematical
statistics). If you're not sure whether you're ready for 467, ask me!
Instructors
Professor | Dr. Cosma Shalizi | cshalizi [at] cmu [dot] edu |
Teaching Assistants | Mr. Raghav Bansal | not to be bothered by e-mail |
| Mr. Mateo Dulce Rubio | |
Goals and Learning Outcomes
(Accreditation officials look here)
The goal of this class is to train you in using statistical models to
analyze interdependent data spread out over space, time, or both,
using the models as data summaries, as predictive instruments, and as tools for
scientific inference. We will build on the theory of statistical inference for
independent data taught in 36-226, and complement the theory and applications
of the linear model, introduced
in 36-401. After taking
the class, when you're faced with a new temporal, spatial, or spatio-temporal
data-analysis problem, you should be able to (1) describe the
statistical challenges the problem presents, (2) select appropriate
methods, (3) use statistical software to implement those methods,
(4) critically evaluate the resulting statistical models, and (5)
communicate the results of your analyses to collaborators and to
non-statisticians.
Topics Covered
- Exploratory data analysis for temporal and spatial data:
Graphics; levels vs. rates, stocks vs. flows; smoothing; trends vs. fluctuations, detrending; auto- and cross- covariances;
nonlinear association measures
- Optimal linear prediction and its uses: Theory of optimal
linear prediction; prediction for interpolation, extrapolation, and noise
removal; "Wiener filter"; "krgiging"; estimating optimal linear predictors
- Inference with dependent data: Statistical estimation
with dependent data; ergodic properties; the bootstrap; simulation-based
inference
- Generative models: Linear autoregressive models for time
series and spatial processes; Markov chains and Markov processes; compartment
(especially epidemic) models; state-space or hidden-Markov models;
nonparametric, nonlinear autoregressions; Markov random fields
- Possible advanced or supplementary topics:
Longitudinal/panel data analysis, and regressions with dependent observations;
Fourier methods; point processes; nonlinear dynamical systems theory and chaos;
cellular automata and interacting particle systems; agent-based modeling;
causal inference across time series; stochastic differential equations; optimal
nonlinear prediction.
This class will not give much coverage to ARIMA models of time
series, a subject treated extensively in 36-618.
Course Mechanics
Lectures will be used to amplify the readings, provide examples and demos, and
answer questions and generally discuss the material. You will usually find the
readings more rewarding if you do the readings before lecture, rather
than after (or during). Since this is an online-only class this semester,
lectures will be held via Zoom; the link for each session will be on Canvas. I
know that the class time is late at night or early in the morning for many of
you; I nonetheless urge you to come to class and participate.
No Recordings: I will not be recording lectures.
This is because the value of class meetings lies precisely in your chance to
ask questions, discuss, and generally interact. (Otherwise, you could just
read a book.) Recordings interfere with this in two ways:
- They tempt you to skip class and/or to zone out and/or try to multi-task
during it. (Nobody is really any good at multi-tasking.) Even if
you do watch the recording later, you will not learn as much from it
as if you had attended in the first place.
- People are understandably reluctant to participate when they know they're
being recorded. (It's only too easy to manipulate recordings to make anyone
seem dumb and/or obnoxious.) Maybe this doesn't bother you; it doesn't bother
me, much, because I'm protected by academic freedom and by tenure, but a good
proportion of your classmates won't participate if they're being recorded,
and that diminishes the value of the class for everyone.
Recording someone without their permission is illegal in many places, and
more importantly is unethical everywhere, so don't make your own recordings
of the class.
(Taking notes during class is fine and I strongly encourage it; taking notes
forces you to think about what you are hearing and how to organize it, which
helps you understand and remember the content.)
Textbooks
The only required textbook is
Gidon
Eshel, Spatiotemporal
Data Analysis (Princeton, New Jersey: Princeton University Press,
2011, ISBN 978-0-691-12891-7, available
on JSTOR).
The CMU library
has electronic access to the
full text, in PDF, through the JSTOR service. (You will need to either be
on campus, or logged in to the university library.) Links to individual
chapters will be posted as appropriate.
In addition, we will assign some sections from
Peter Guttorp, Stochastic Modeling of Scientific Data
(Boca Raton, Florida: Chapman & Hall / CRC Press, 1995. ISBN
978-0-412-99281-0).
Because this book is expensive, the library doesn't have electronic access, and
a lot of it is about (interesting and important) topics outside the scope of
the class, it is not required. Instead, scans of the appropriate sections will
be distributed via Canvas. (I am working on getting electronic access
through the library, and will update students if I succeed.)
You will also be doing a lot of computational work in R, so
Paul Teetor, The R Cookbook
(O'Reilly Media, 2011,
ISBN 978-0-596-80915-7)
is recommended. R's help files answer "What does command X
do?" questions. This book is organized to answer "What commands do I use to do
Y?" questions.
Assignments
There are three reasons you will get assignments in this course. In order of
decreasing importance:
- Practice. Practice is essential to developing the skills you are
learning in this class. It also actually helps you learn, because some things
which seem murky clarify when you actually do them, and sometimes trying to do
something shows what you only thought you understood.
- Feedback. By seeing what you can and cannot do, and what comes
easily and what you struggle with, I can help you learn better, by giving
advice and, if need be, adjusting the course.
- Evaluation. The university is, in the end, going to stake its
reputation (and that of its faculty) on assuring the world that you have
mastered the skills and learned the material that goes with your degree.
Before doing that, it requires an assessment of how well you have, in fact,
mastered the material and skills being taught in this course.
To serve these goals, there will be two kinds of assignment in this
course.
- After-class comprehension questions and exercises
- Following every lecture, there will be a brief set of questions
about the material covered in lecture. Sometimes these will be about specific
points in the lecture, sometimes about specific aspects of the reading assigned
to go with the lecture. These will be done on Canvas and will be due the day
after each lecture. These should take no more than 10 minutes, but will be
untimed (so no accommodations for extra time are necessary). If the questions
ask you to do any math (and not all of them will!), a scan or photograph of
hand-written math is OK, so long as the picture is clearly legible. (Black ink
or dark pencil on white paper helps.)
- Homework
- Most weeks will have a homework assignment, divided into a series of
questions or problems. These will have a common theme, and will usually build
on each other, but different problems may involve statistical theory, analyzing
real data sets on the computer, and communicating the results.
- All homework will be submitted electronically through Gradescope/Canvas.
Most weeks, homework will be due at 6:00 pm on Thursdays
(Pittsburgh time). There will be a few weeks, clearly noted on the syllabus
and on the assignments, when this won't be the case. When this results in less
than seven days between an assignment's due date and the previous due date, the
homework will be shortened.
- There are specific formatting requirements for
homework --- see below.
Time Expectatons
You should expect to spend 5--7 hours on assignments every week, averaging over
the semester. (This follows from the university's rules about how course
credits translate into hours of student time.) If you find yourself spending
significantly more time than that on the class, please come to talk to me.
Grading
Grades will be broken down as follows:
- After-class questions: 10%. All sets of questions will
have equal weight. The lowest 4 will be dropped, no questions asked.
- Homework: 90%. All homeworks will have equal weight. Your lowest 3
homework grades will be dropped, no questions asked. If you turn in all
homework assignments on time, for a grade of at least 60% (each), your lowest four homework grades
will be dropped. Late homework will not be accepted for any
reason.
Grade boundaries will be as follows:
A | [90, 100] |
B | [80, 90) |
C | [70, 80) |
D | [60, 70) |
R | < 60 |
To be fair to everyone, these boundaries will be held to strictly.
Grade changes and regrading: If you think that particular
assignment was wrongly graded, tell me as soon as possible. Direct any
questions or complaints about your grades to me; the teaching assistants have
no authority to make changes. (This also goes for your final letter grade.)
Complaints that the thresholds for letter grades are unfair, that you deserve a
higher grade, etc., will accomplish much less than pointing to concrete
problems in the grading of specific assignments.
As a final word of advice about grading, "what is the least amount of work I
need to do in order to get the grade I want?" is a much worse way to approach
higher education than "how can I learn the most from this class and from my
teachers?".
Office Hours
For this semester, Zoom office hours will be times when I am available to
answer questions in a Zoom
chat. Piazza office hours
will be times when an instructor will be logged in and answering questions on
Piazza --- so if you want or need to have a back-and-forth, someone will be
available.
Instructor | Day | Time (Pittsburgh) | Venue |
Mr. Dulce Rubio | Mondays | 9:30--10:30 am | Piazza |
Mr. Bansal | Tuesdays | 2:00--3:00 pm | Piazza |
Prof. Shalizi | Wednesdays | 2:00--3:00 pm | Zoom |
Prof. Shalizi | Thursdays | 2:00--3:00 pm | Piazza |
If you cannot make the regular office hours, or have concerns you'd rather
discuss privately (e.g., grades), please e-mail me about making an appointment to meet by Zoom.
R, R Markdown, and Reproducibility
R is a free, open-source software package/programming language for statistical computing. You should have begun to learn it in 36-401 (if not before). No other form of computational work will be accepted. If you are not able to use R, or do not have ready, reliable access to a computer on which you can do so, let me know at once.
(There are plenty of other perfectly good computing systems for data
analysis --- I learned to do it using Fortran and C, so help me --- but
a uniform language is a lot easier to grade, and statisticians have self-organized on R as a standard.)
Communicating your results to others is as important as getting good results in the first place. Every homework assignment will require you to write about that week's data analysis and what you learned from it; this writing is part of the assignment and will be graded. Raw computer output and R code is not acceptable; your document must be humanly readable.
All homework and exam assignments are to be written up in R Markdown. (If you know what knitr is and would rather use it, ask first.) R Markdown is a system that lets you embed R code, and its output, into a single document. This helps ensure that your work is reproducible, meaning that other people can re-do your analysis and get the same results. It also helps ensure that what you report in your text and figures really is the result of your code. For help on using R Markdown, see "Using R Markdown for Class Reports".
For each assignment, you should submit a single PDF file, which is the
"knitted", humanly-readable document generated by your R Markdown source file.
That source file should integrate all of your text, and the R code to generate
all of your numerical results, figures and tables.
Some problems in the homework will require you to do math. R Markdown
provides a simple but powerful system for type-setting math. (It's based on
the LaTeX document-preparation system widely used in the sciences.) If you
can't get it to work, you can hand-write the math and include scans or photos
of your writing in the appropriate places in your R Markdown document. You
will, however, lose points for doing so, starting with no penalty for homework
1, and growing to a 90% penalty (for those problems) by the final homework.
For help on this aspect of using R Markdown,
see "Using R Markdown
for Class Reports".
Every week, I will randomly
select some students and ask you to send me your R Markdown file. You will
lose points if your R Markdown file does not, in fact, generate your knitted
file (making the obvious allowances for random numbers, etc.). You should
expect to be picked for this about once in the semester, but since it will
be random sampling with replacement, you may be asked for your R Markdown
more than once.
Canvas and Piazza
Homework will be submitted electronically through Gradescope/Canvas. Canvas
will also be used for the after-class questions, as a calendar showing all
assignments and their due-dates, to distribute some readings, and as the
official gradebook.
We will be using the Piazza
website for question-answering. You will receive an invitation within the
first week of class. Anonymous-to-other-students posting of questions and
replies will be allowed, at least initially. Anonymity will go away for
everyone if it is abused. During Piazza office hours, someone will be online
to respond to questions (and follow-ups) in real time. You are welcome to post
at any time, but outside of normal working hours you should expect that the
instructors have lives.
Materials from Previous Versions of the Course
Public materials (lecture slides and notes, homework assignments and data
files but not solutions, etc.) from other semesters I've taught this course
can be found here. You're welcome to look at them, but
lectures and assignments will change.
Collaboration, Cheating and Plagiarism
Except for explicit group exercises,
everything you turn in for a grade must be your own work, or a clearly
acknowledged borrowing from an approved source; this includes all mathematical
derivations, computer code and output, figures, and text. Any use of permitted
sources must be clearly acknowledged in your work, with citations letting the
reader verify your source. You are free to consult the textbooks and
recommended class texts, lecture slides and demos, any resources provided
through the class website, solutions provided to this semester's
previous assignments in this course, books and papers in the library, or
legitimate online resources, though again, all use of these sources must be
acknowledged in your work. (Websites which compile course materials
are not legitimate online resources.)
In general, you are free to discuss homework with other students in the
class, though not to share or compare work; such conversations must be
acknowledged in your assignments. You may not discuss the content of
assignments with anyone other than current students, the instructors, or your
teachers in other current classes at CMU, until after the assignments are due.
(Exceptions can be made, with prior permission, for approved tutors.) You are,
naturally, free to complain, in general terms, about any aspect of the course,
to whomever you like.
Any use of solutions provided for any assignment in this course, or in other
courses, in previous semesters is strictly prohibited. This prohibition
applies even to students who are re-taking the course. Do not copy the old
solutions (in whole or in part), do not "consult" them, do not read them, do
not ask your friend who took the course last year if they "happen to remember"
or "can give you a hint". Doing any of these things, or anything like these
things, is cheating, it is easily detected cheating, and those who thought they
could get away with it in the past have failed the course. Even more
importantly: doing any of those things means that the
assignment doesn't give you a chance to practice; it makes any
feedback you get meaningless; and of course it makes any evaluation based on
that assignment unfair.
If you are unsure about what is or is not appropriate, please ask me before
submitting anything; there will never be a penalty for asking. If you do
violate these policies but then think better of it, it is your responsibility
to tell me as soon as possible to discuss how to rectify matters. Otherwise,
violations of any sort will lead to severe, formal disciplinary action, under
the terms of the university's
policy
on academic integrity.
On the first day of class, every student will receive a written copy of the
university's policy on academic integrity, a written copy of these course
policies, and a "homework 0" on the content of these policies. This assignment
will not factor into your grade, but you must complete it before you
can get any credit for any other assignment.
Accommodations for Students with Disabilities
The Office of Equal Opportunity Services provides support services for both
physically disabled and learning disabled students. For individualized
academic adjustment based on a documented disability, contact Equal Opportunity
Services at eos [at] andrew.cmu.edu or (412) 268-2012; they will coordinate
with me.
Inclusion and Respectful Participation
The university is a community of scholars, that is, of people seeking
knowledge. All of our accumulated knowledge has to be re-learned by every new
generation of scholars, and re-tested, which requires debate and discussion.
Everyone enrolled in the course has a right to participate in the class
discussions. This doesn't mean that everything everyone says is equally
correct or equally important, but does mean that everyone needs to be treated
with respect as persons, and criticism and debate should be directed
at ideas and not at people. Don't dismiss (or enhance) anyone in the course
because of where they come from, and don't use your participation in the class
as a way of shutting up others. (Don't be rude, and don't go looking for
things to be offended by.) While methods for spatio-temporal data analysis
don't usually lead to heated debate, some of the subjects we'll be
applying them to might. If someone else is saying something you think is
really wrong-headed, and you think it's important to correct it, address why it
doesn't make sense, and listen if they give a counter-argument.
The classroom is not a democracy; as the teacher, I have the right and the
responsibility to guide the discussion in what I judge are productive
directions. This may include shutting down discussions which are not helping
us learn about statistics, even if those discussions are important to have
elsewhere. I will do my best to guide the course in a way which respects
everyone's dignity as a human being and as a member of the university.
Schedule
Lecture slides and/or notes will be linked to here after each class.
This page will also give links to assignments and data files for homeworks
(though they will also be linked to on Canvas).
Possible changes: Topics, and the order in which we cover
topics, may change. The material up to the beginning of November is unlikely
to alter, but the more advanced, and more miscellaneous, topics after that are
less set, and will depend on students' expressed interests, what I can find
good examples for, whether we need to make up for disruptions earlier in the
semester, etc. I will give as much warning to any chances as I can.
Readings: Please do these before coming to class,
if you possibly can --- things will make more sense if you do! (Obviously this doesn't apply to the first class.) Readings marked with a star (*) are optional, either because they're more advanced, or longer, or tangential. Readings marked with multiple stars are especially optional. There are currently some lectures for which I haven't fixed on readings, marked
with "TBD"; these will be made specific at least three days before class.
Lecture 1 (Tuesday, 1 September):
Introduction to the course
Lecture 2 (Thursday, 3 September):
Graphics and Exploratory Analyses
- Kinds of data and basic EDA by way of pictures. Data for regions (areas, intervals, periods) vs. data at points in space
and time. Plotting over time: basic ideas and pitfalls. Tricks for plotting
over time: index numbers (relative magnitude), differencing (rate of change
over time) and summing (accumulation over time), calendar time vs. time
relative to some event. Scatterplots of successive values to get at
dynamics. Relationships between two variables: why scatterplots are better
than plots with two vertical axes. Plotting over space: maps; types of maps; a
little bit about map projections. Relationships between variables over space
(scatterplots are better again). Re-doing all our usual EDA (boxplots,
histograms, tables, etc.) by region in space and/or time to get at
variability.
- Reading:
- Handout on Stocks, flows, growth rates, etc. (.Rmd)
- Kieran Healy, "America's Ur-Choropleths", 12 June 2015
- Optional readings
- (*) Kieran Healy, Data Visualization: A Practical Introduction (Princeton: Princeton University Press, 2019), especially Chapter 7, "Draw Maps" [Prof. Healy's website for the book includes the full text in draft form and code for reproducing examples]
- (**) Whitney Battle-Baptiste and Britt Rusert (eds.), W. E. B. Du Bois's Data Portraits: Visualizing Black America: The Color Line at the Turn of the Twentieth Century (New York: Princeton Architectural Press, 2018) [About half of Du Bois's plots depict change over time, variation over space, or both. If you read both this and Healy's book, ask yourself how Healy would re-do Du Bois's plots.]
- (**) Judy L. Klein, Statistical Visions in Time: A History of Time Series Analysis, 1662--1938, especially Part I [This is, among other things, an in-depth look at how different techniques for plotting time series emerged from "commercial arithmetic"]
- Slides (with notes on a few points that came up during lecture), R Markdown file used to make the slides
- Homework:
Lecture 3 (Tuesday, 8 September):
Smoothing, Trends, Detrending I
Lecture 4 (Thursday, 10 September):
Smothing, Trends, Detrending II
- The influence matrix as the source of all knowledge. Residuals after
de-trending as estimates of the fluctuations. The Yule-Slutsky effect.
Picking how much to smooth by cross-validation. Special considerations for ratios (Kafadar).
- Reading:
- Eshel, chapter 8
- Optional reading:
Slides (.Rmd source)
Homework:
Lecture 5 (Tuesday, 15 September):
Principal Components I
- The goal of principal components: finding simpler, linear structure
in complicated, high-dimensional data. Math of principal components: linear
approximation -> preserving variance -> eigenproblem. Reminders from linear
algebra about eigenproblems. Mathematical solution to PCA. How to do PCA in
R.
- Reading: Eshel, chapter 4, and skim chapter 5
- Slides (.Rmd)
Lecture 6 (Thursday, 17 September):
Lecture 5, Principal Components II
- Brief recap on PCA. Applying PCA to
spatial data. Applying PCA to multiple time series. Applying PCA to spatio-temporal data. Interpreting PCA results.
Why PCA can be good exploratory analysis, but is not statistical inference.
- Reading: Eshel, chapter 11, sections 11.1--11.7 and 11.9--11.10 (i.e., skipping 11.8 and 11.11--11.12)
- Slides (.Rmd)
- Homework:
- Homework 2 due
- Homework 3: Assignment (with links to the data files)
Lecture 7 (Tuesday, 22 September):
Optimal Linear Prediction
- Mathematics of prediction. Mathematics of optimal linear
prediction, in any context whatsoever. Ordinary least squares as an estimator
of the optimal linear predictor. Why we need the covariance functions.
- Reading: Eshel, chapter 9, sections 9.1--9.3
- Slides (.Rmd)
Lecture 8 (Thursday, 24 September):
Linear Interpolation and Extrapolation of Time Series
- Applying the linear-predictor idea to time series: interpolating
between observations; extrapolating into the future (or past). The concept of stationarity. Auto- and
cross- covariance. Covariance functions as EDA. Basic covariance estimation
in R. Removing trends; stationary fluctuations after detrending. Historical notes: Wiener and Kolmogorov.
- Reading: Eshel, chapter 9, section 9.5 (skipping 9.5.3 and 9.5.4)
- Slides (.Rmd, with comments on the sample code provided)
- Homework:
Lecture 9 (Tuesday, 29 September):
Optimal Linear Prediction for Spatial and Spatio-Temporal Data
- Applying the linear-predictor idea to data spread over space or over space
and time ("kriging"). The importance of estimating covariance between spatial
locations. Assumptions restricting the form of the covariance and so enabling
estimation: stationarity, isotropy, separability. Estimating parametric
covariance functions. Examples.
- Reading: No required reading
- Slides (.Rmd)
Lecture 10 (Thursday, 1 October):
Separating Signal and Noise with Linear Methods
- Observational noise:
using the linear-predictor idea to remove observational
noise, a.k.a. "the Wiener filter". The myseriously-named "nugget effect" (accounting for measurement noise that's not auto-correlated).
Periodicity:
noticing periodicity from time series; from autocorrelation functions. Extracting periodic components with a known period by averaging. "Climate" and
"anomaly". Seasonal adjustment of time series.
- Reading:
- No required reading
- Optional reading:
- (****) Norbert Wiener, Extrapolation, Interpolation and Smoothing of Stationary Time-Series: with Engineering Applications (Cambridge, Massachusetts: The Technology Press, 1949 [but originally published as a classified technical report, National Defense Research Council, 1942])
- Slides (.Rmd)
- Homework:
Lecture 11 (Tuesday, 6 October):
Linear Generative Models for Time Series
- Linear generative models for random sequences: autoregressions.
Deterministic dynamical systems; more fun with eigenvalues and eigenvectors.
Stochastic aspects. Vector auto-regressions.
- Reading:
- Eshel, sections 9.5 and 9.7
- Optional reading:
- (**) Judy L. Klein, Statistical Visions in Time: A History of Time Series Analysis, 1662--1938, especially Part II [This
studies how linear regression, a method developed to adjust for differences across a population at a single time, came to be used to predict changes over time in a single quantity, which sounds weird when you put it that way]
- Slides (.Rmd)
- Handout: AR(p) vs. higher-dimensional VAR(1)
Lecture 12 (Thursday, 8 October):
Linear Generative Models for Spatial and Spatio-Temporal Data
- Simultaneous vs. conditional autoregressions for random fields.
The "Gibbs sampler" trick. Autoregressions for spatio-temporal processes.
- Reading: None
- Slides (.Rmd)
- Homework:
Lecture 13 (Tuesday, 13 October):
Statistical Inference with Dependent Data I: Really Understanding Inference with Independent Data
- Reminder: why maximum likelihood and Gaussian approximations work for IID
data. Consistency from convergence (law of large numbers); Gaussian
approximation from fluctuations (central limit theorem). The "sandwich
covariance" for general estimators. Looking ahead at how these ideas carry
over to dependent data.
- Reading: Guttorp, Appendix A
- Slides (.Rmd)
Lecture 14 (Thursday, 15 October):
Inference with Dependent Data II
- Ergodic theory, a.k.a. laws of large numbers for dependent data.
Basic ergodic theory for stochastic processes. Correlation times and effective
sample size. Inference with
autoregressions. Gestures at more advanced ergodic theory. Likelihood-based inference for dependent data.
- Slides (.Rmd)
- Homework:
- Homework 6 due
- Homework 7: Assignment. Note: Because I messed up posting this on time, this is now due at the same time as HW 8, and will be extra credit, replacing your lowest grade on the other homeworks.
Lecture 15 (Tuesday, 20 October):
Simulation
- General idea of simulating a statistical model. The "Monte Carlo
method": using simulation to compute probabilities, expected values, etc.
- Slides (.Rmd)
Lecture 16 (Thursday, 22 October):
Simulation for Inference I: The Bootstrap
- The bootstrap principle: approximating the sample distribution by
simulating a good estimate of the data-generating distribution. Uncertainty
via model-based bootstraps. Uncertainty via resampling bootstraps for time
series and for spatial processes. Related ideas: "surrogate data" tests of null
hypotheses; ensemble forecasts.
- Reading:
- Slides (.Rmd)
- Homework:
Lecture 17 (Tuesday, 27 October):
Simulation for Inference II: Matching Simulations to Data
- Reminder about estimation in general. The method of moments.
The method of simulated moments. "Indirect" inference: matching the parameters estimated from an "auxiliary" or "working" model. Some asymptotics.
- Optional reading:
- Slides (.Rmd)
Lecture 18 (Thursday, 29 October):
Markov Chains I
- Markov chains and the Markov property. Examples. Basic properties
of Markov chains; special kinds of chain. Yet more fun with eigenvalues
and eigenvectors. How one trajectory evolves vs. how a population evolves. Ergodicity and central limit
theorems. Higher-order Markov chains and
related models. Markov chain Monte Carlo.
- Reading:
- Slides (.Rmd)
- Homework:
Election Day (Tuesday, 3 November): NO CLASS
Lecture 19 (Thursday, 5 November):
Markov Chains II
- Likelihood inference for individual trajectories. Least-squares
inference for population data. Conditional density estimates for
continuous spaces. Model-checking.
- Reading:
- Slides (.Rmd)
- Homework:
Lecture 20 (Tuesday, 10 November):
Epidemic Models
- The basic "susceptible-infectious-removed" (SIR) epidemic model. The
probability model and its deterministic limit. The idea of the "basic
reproductive number" R0 and how it relates to the rates of transmission and
removal. Why diseases do not necessarily evolve to be less lethal to their
hosts. The epidemic threshold when R0=1.
Complications: gaps between being infected and becoming infectious; the
possibility of being infectious without showing symptoms; re-infection.
Epidemics in social networks, and how network structure affects the epidemic
threshold; why high-degree people tend to be among the first infected, and
disease-control strategies based on "destroying the hubs". Statistical issues
in connecting epidemic models to data.
- Reading:
- Zeynep Tufekci, "Don’t Believe the COVID-19 Models: That’s not what they’re for", The Atlantic 2 April 2020
- Optional readings:
- (**) Mark E. J. Newman, "The spread of epidemic disease on networks",
Physical Review E 66 (2002): 016128, arxiv:cond-mat/0205009
- (*) Tom Britton, "Epidemic models on social networks -- with inference", arxiv:1908.05517
- (**) Romualdo Pastor-Satorras and Alessandro Vespignani, "Immunization of complex networks", Physical Review E 65 (2002): 036104,
arxiv:cond-mat/0107066
- (*) Lisa Sattenspiel (with contributions by Alun Lloyd),
The Geographic Spread of Infectious Diseases: Models and Applications (Princeton, New Jersey: Princeton University Press, 2009) [Full text access via JSTOR]
- Slides (.Rmd)
Lecture 21 (Thursday, 12 November):
Compartment Models
- General idea of compartment models as a special kind of Markov
model. Applications in demography, epidemiology, sociology, chemistry, etc.
- Reading: handout (.Rnw)
- Slides (.Rmd)
- Homework:
Lecture 22 (Tuesday, 17 November):
Markov Random Fields
- Markov models in space. The Gibbs-Markov equivalence. The Gibbs sampler
again. Examples with the Ising model. Inference. Spatio-temporal Markov models: general idea; cellular automata.
- Reading: Guttorp, chapter 4, omitting section 4.6, and skimming section 4.3
- Slides (.Rmd)
Lecture 23 (Thursday, 19 November):
State-Space or Hidden-Markov Models
- Markov dynamics + distorting or noisy observations = Non-Markov observations. Model formulation. Inference: E-M algorithm, Kalman
filter, particle filter, simulation-based methods. Spatio-temporal version: dynamic factor models.
- Reading: Guttorp, section 2.12
- Slides (.Rmd); the more detailed handout (.Rmd)
- Homework:
NO CLASS on Tuesday, 24 November
- There was going to be an optional lecture on point processes, but it's become clear that there really won't be enough attendance to justify this. I'll still post the slides/notes, but we won't be meeting.
- Reading:
- Guttorp, ch. 5
- Alex Reinhart, "A Review of Self-Exciting Spatio-Temporal Point Processes and Their Applications", Statistical Science 33 (2018): 299--318, arxiv:1708.02647
- Brad Leun and Philip B. Stark, "Testing Earthquake Predictions",
pp. 302--315 in
Deborah Nolan and Terry Speed (eds.), Probability and Statistics: Essays in Honor of David A. Freedman (Beachwood, Ohio, USA: Institute of Mathematical Statistics, 2008)
- Extra-optional more advanced reading:
- (***) Seth Flaxman, Yee Whye Teh, and Dino Sejdinovic, "Poisson intensity estimation with reproducing kernels", AISTATS 2017, arxiv:1610.08623
- (*) Charles Loeffler and Seth Flaxman, "Is Gun Violence Contagious? A Spatiotemporal Test",
Journal of Quantitative Criminology 34 (2018): 999--1017,
arxiv:1611.06713
- (*) Alex Reinhart and Joel Greenhouse, "Self-exciting point processes with spatial covariates: modeling the dynamics of crime",
Journal of the Royal Statistical Society C 67 (2018): 1305--1329, arxiv:1708.03579
Thanksgiving Day (Thursday, 26 November): NO CLASS
Lecture 24 (Tuesday, 1 December):
Nonlinear Prediction I: Model-Agnostic Predictions
- Using smoothing to estimate regression functions. Nonlinear
autoregressions. Additive autoregressions. The "time-delay embedding" method
and the question of "how many lags?" When can we expect model-agnostic methods
to work?
- Readings (all advanced and optional):
- (*) Jianqing Fan and Qiwei Yao, Nonlinear Time Series: Nonparametric and Parametric Methods (Berlin: Springer-Verlag, 2003) [Full-text access via Springerlink]
- (*) Holger Kantz and Thomas Schreiber, Nonlinear Time Series Analysis (2nd edition, Cambridge, UK: Cambridge University Press, 2004)
- (**) Norman H. Packard, James P. Crutchfield, J. Doyne
Farmer and Robert S. Shaw, "Geometry from a Time Series",
Physical Review Letters 45 (1980): 712--716
- (***) Norbert Wiener, "Nonlinear Prediction and Dynamics",
vol. III, pp. 247--252 in
Jerzy Neyman (ed.), Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability (Berkeley: University of California Press, 1956)
- Slides (.Rmd)
Lecture 25 (Thursday, 3 December):
Nonlinear prediction II: Model-Reliant Predictions
- Estimating the parameters of a model. Estimating the state of a model.
Extrapolating the estimated state forward in time using the estimated
parameters. Ensemble-based forecasts to handle uncertainty. Issues with
extremes and "functional box-plots". The importance of model-checking.
- Slides (.Rmd); handout on propagation of error
- Reading: optional readings on the last page of the slides
- Homework:
Lecture 26 (Tuesday, 8 December):
Regressions with Dependent Observations
- Reminders about why regression theory usually assumes observations are IID.
Situations where this breaks down: "panel" or "longitudinal" data and
correlations within a "unit" over time; correlations between countries or
regions in spatial cross-sections; correlations because of shared "ancestry".
Effects on linear regression: OLS is still unbiased but inefficient, and all
your inferential statistics are wrong. Solution: generalized least squares
would be efficient if we knew the covariance structure; ways of figuring out
the covariance without knowing it to start with. Some examples, and some
case studies in what goes wrong when we ignore these issues.
- Reading:
- Recent, fairly easy-to-read papers that highlight important issues:
- Morgan Kelly, "The Standard Errors of Persistence",
SSRN/3398303 (2019)
- Youjin Lee and Elizabeth L. Ogburn, "Testing for Network and Spatial Autocorrelation", pp. 91--104 in Naoki Masuda, Kwang-Il Goh, Tao Jia, Junichi Yamanoi and Hiroki Sayama (eds.), Proceedings of NetSci-X 2020: Sixth International Winter School and Conference on Network Science, arxiv:1710.03296
- Youjin Lee and Elizabeth L. Ogburn, "Network Dependence Can Lead to Spurious Associations and Invalid Inference", Journal of the American Statistical Association forthcoming (2020), arxiv:1908.00520
- Thomas B. Pepinsky, "On Whorfian Socioeconomics",
SSRN/3321347 (2019)
- Classic, harder-to-read papers about methods:
- (**) Peter Diggle, Kung-Yee Liang and Scott L. Zeger, Analysis of Longitudinal Data (Oxford: Oxford University Press, 1994) [This is actually pretty easy to read, if you have the time for a full-length book; the CMU library has electronic access]
- (***) Kung-Yee Liang and Scott L. Zeger, "Longitudinal data analysis using generalized linear models", Biometrika 73 (1986): 13--22
- (***) Scott L. Zeger and Kung‐Yee Liang, "An overview of methods for the analysis of longitudinal data", Statistics in Medicine 11 (1992): 1825--1839
- Slides (.Rmd)
Lecture 27 (Thursday, 10 December):
Causal Inference over Time
- What do statisticians mean by "causality"? What "Granger causality" is,
and why it's usually not interesting.
Graphical causal models.
Defining causal effects in terms of "surgery" on the graph. Graphical causal
models for variables evolving over time. Discovering the right graph,
assuming additive dependence.
- Reading: (**) Tianjiao Chu and Clark Glymour, "Search for Additive Nonlinear Time Series Causal Models", Journal of Machine Learning Research 9 (2008): 967--991
- Slides (.Rmd)
- Homework:
Image credit: Pictures on this page are from my teacher David
Griffeath's Particle Soup Kitchen
website, except for Umberto
Boccioni's Riot
in the Galleria.