Introduction to the Course

Cosma Shalizi

28 August 2018, 36-467/667

Data over Space and Time

Statistical study for processes that unfold over time, or space, or both
History, science, and lots of practical policy & technology
Not everything:
- Good experiments
- Good surveys
- Wishful thinking

Special Statistical Issues

The goal: learn how here and now depends on there and then
The two problems:
- Everything depends on everything else \(\Rightarrow\) basic theory doesn’t apply
- We don’t get multiple samples of time or of space
- \(\Rightarrow\) \(n=1\), always

Course Mechanics

Syllabus: http://www.stat.cmu.edu/~cshalizi/dst/18
- All assignments will be posted there
- Information about readings and other course resources ditto
Canvas as a gradebook and to distribute some electronic readings
Piazza for question-answering

Office hours:

Me, Wednesdays 1:00–3:00 in Baker Hall 229C
TAs, TBD

Textbooks

Gidon Eshel, Spatiotemporal Data Analysis

Required, but full text PDF through JSTOR

Expect to read some of this nearly every week

Textbooks

Peter Guttorp, Stochastic Modeling of Scientific Data

Recommended; required excerpts will be distributed through Canvas

Textbooks

Paul Teetor, The R Cookbook

Recommended; consult as needed

Lectures

Clarification, amplification, examples, alternatives
Complements to the readings
- \(\therefore\) Do the readings ahead of time
No electronics, except for documented need
Graded in-class exercises most lectures

Assignments

In-class exercises in small groups (15%)
- In class, on paper, 15% of your grade
- Will get you started on the homework, or get you over hurdles
- Expect one of these most class meetings
Weekly homework \(\times 12\) (40%)
- Data analysis & computing & a little theory
- Always turned in electronically via Canvas
- Usually due at 6 pm on Wednesdays most weeks
- Lowest 2 dropped, no questions asked
- If you turn in all 12, lowest 3 dropped
- NO LATE HOMEWORK FOR ANY REASON
Take-home mid-term (20%) and final exams (25%)
- Midterm due 11 October; final due 14 December
- One week (at least) to work on each
- Electronic submission via Canvas

R Markdown

R Markdown is a straightforward way to combine text, math, code, and code output in one document
Enforces reproducibility: your numbers, figures, etc., are generated by the code, so results always match
Use is required for this class
- Your homework and exams will lose points if you don’t use it
- You will lose more points as the semester goes on
- Resources for learning R Markdown on the syllabus page

Collaboration, Cheating & Plagiarism

Except for the in-class group assignments, you need to work on your own
Full policy for this class is part of the syllabus
Read that policy and the CMU policy on academic integrity
Turn in Homework 0 on these policies by Thursday
- Not graded, but you won’t get credit for any work until you complete it successfully

What are the big issues?

We see \(X\) at time \(t\) and point \(r\), \(X(r,t)\), and what to know how it relates to \(Y\) at time \(s\) and point \(q\), \(Y(q,s)\)

Problems:

Basic statistical theory is about independent, identically distributed (IID) data.
But we (usually) only see one realization of a whole process.
Every observation is dependent on every other observation.
Basic statistical theory then says that \(n=1\) and can’t draw any inferences.

How are we going to deal with these issues?

Methods for describing relationships and finding patterns in the data
Especially methods for predicting \(X(r,t)\) from \(Y(q,s)\)
Incorporate dependence into statistical theory, so we can say when methods will work
Quantifying uncertain is best done through modeling and simulation

Why is this worth knowing?

Italo Calvino, “All at One Point” (1963) imagines the world un-extended in time or space¹

Otherwise, every branch of science deals with data spread over space and time
Our examples will come from
- Geology
- Climatology
- Meteorology
- Ecology
- Epidemiology
- Demography
- Economics
- Neuroscience
- Physics
Could (and might) add examples from technology, policy, business, etc.

So what are we going to cover?

Exploratory methods for spatio-temporal data
Linear prediction
Inference with dependent data
Generative models
Simulations and simulation-based inference
Organized by method, skip around on subject matter

What we will not be covering

ARIMA (etc.) models — take 36-618
Finance — have a care for your soul

Exploratory data analysis

You know that we always start with EDA
Impose few or no assumptions, try to find patterns
- Or: check whether some pattern holds
All the usual stuff can still be useful
- Scatter-plots, histograms, cross-tabulations…
Special tools for spatio-temporal data:
- Local averaging to remove noise and reveal trends
- Removing trends to reveal random fluctuations
- Measuring dependence through covariance
- Finding patterns of covariance
- Finding periodic oscillations

Start with something vivid, which is also HW 1

Cherry blossoms in Kyoto

Cherries at the Hirano shrine in Kyoto (David Montasco on flickr)

Flowering of cherry trees has been a central part of Japanese art & culture for well over a millennium

Hanami

Kitao Shigemasa, Sangatsu, Asukayam Hanami = Third Lunar Month, Blossom Viewing at Asuka Hill, c. 1776, via Library of Congress

Notice the date in the title!

This is data!

Ancient diaries², poetry, etc., and modern newspapers, record when cherry trees in Kyoto came into bloom
- Kyoto because it’s the ancient capital and has been a continuous seat of art & culture

Cherry blossoms track climate

Japan gets cold

Snow at the Hirano shrine (yopparainokobito on flickr)

Cherries only blossom when it gets warm enough
\(\therefore\) The date when cherry trees are in full flower tells us about how warm the year was
- Date of first flowering is also informative, but less often recorded

A data set

Assembled by Prof. Yasuyuki Aono
- Data points going back to the early 800s
- Almost continuous for modern times
- Re-formatted version at [http://www.stat.cmu.edu/~cshalizi/dst/18/data/kyoto.csv]
For each year, the day of the year (1–366) of full bloom
- April 1 = 91 or 92
- Aono had to search out the ancient records, poems, histories, etc.
- and convert dates before 1873 to our (Gregorian) calendar and (common) era (Meiji revolution)

A data set

kyoto <- read.csv("http://www.stat.cmu.edu/~cshalizi/dst/18/data/kyoto.csv")
plot(Flowering.DOY ~ Year.AD, data=kyoto, type="l",
     ylab="Day in year of full flowering", xlab="Year (AD)",
     main="Cherry blossoms in Kyoto")

Problems

How to fill in the data between the observations? (interpolation)
- When did the cherries bloom in 1015?
How to extend beyond the observations? (extrapolation)
- Make a guess for 2020, or for 800
How to remove measure noise in the observations? (filtering)
- Poets can be sloppy about dates
How to separate year-to-year fluctuations from longer-term trends?
- How do we model fluctuations?
- How do we model trends?
How seriously should we treat the warming since about 1800? (inference)
- Could it be an artifact of denser observations?
- Does the climate just do stuff like this occasionally?
- How long do we need to wait to be confident?

Smoothed cherry blossoms

Both curves come from averaging nearby values

Break for small-group exercise

Which of the two curves do you think is better? Or are they both bad? Or, if you can’t tell, what more information would you need?

We’re going to need to build some concepts

Stochastic process = collection of random variables over time or space or both, typically dependent, say \(X(t)\)
Trend = central tendency of the process, say \(\mu(t) = \mathbb{E}[X(t)]\)
Fluctuations = difference from the trend
How can we figure out the trend if we just see \(X\) once?

By Thursday:

Make sure you’re on Canvas & Piazza for the course
- Let me know if not
Homework 0
- Read the course policy on collaboration, cheating and plagiarism
- Read the university policy on academic integrity
- Read the excerpt from Taurabian’s Manual for Writers (Canvas)
- Complete Homework 0
Readings: Eshel, ch. 7; Guttorp, introduction and ch. 1

Take-aways

Everything is statistically dependent on everything else
Dependence means what happens here and now gives us information about there and then, so we can predict (interpolate, extrapolate, filter, forecast)
Dependence means the statistical theory we’ve learned needs to be fixed
For now, we will focus on describing dependence
Next time: trends and smoothing

Calvino’s Cosmicomics is a precious part of our common cultural heritage.↩
The Pillow Book of Sei Shonagon is a precious part of our common cultural heritage.↩

Introduction to the Course

Data over Space and Time

Special Statistical Issues

Course Mechanics

Textbooks

Textbooks

Textbooks

Lectures

Assignments

R Markdown

Collaboration, Cheating & Plagiarism

Any questions?

What are the big issues?

How are we going to deal with these issues?

Why is this worth knowing?

So what are we going to cover?

What we will not be covering

Exploratory data analysis

Cherry blossoms in Kyoto

Hanami

This is data!

Cherry blossoms track climate

A data set

A data set

Problems

Smoothed cherry blossoms

More concrete problems

Break for small-group exercise

We’re going to need to build some concepts

By Thursday:

Take-aways