Introduction to the Course

Cosma Shalizi

28 August 2018, 36-467/667

Data over Space and Time

Special Statistical Issues

Course Mechanics

Office hours:

Textbooks

Gidon Eshel, Spatiotemporal Data Analysis

Required, but full text PDF through JSTOR

Expect to read some of this nearly every week

Textbooks

Peter Guttorp, Stochastic Modeling of Scientific Data

Textbooks

Paul Teetor, The R Cookbook

Recommended; consult as needed

Lectures

Assignments

  1. In-class exercises in small groups (15%)
    • In class, on paper, 15% of your grade
    • Will get you started on the homework, or get you over hurdles
    • Expect one of these most class meetings
  2. Weekly homework \(\times 12\) (40%)
    • Data analysis & computing & a little theory
    • Always turned in electronically via Canvas
    • Usually due at 6 pm on Wednesdays most weeks
    • Lowest 2 dropped, no questions asked
    • If you turn in all 12, lowest 3 dropped
    • NO LATE HOMEWORK FOR ANY REASON
  3. Take-home mid-term (20%) and final exams (25%)
    • Midterm due 11 October; final due 14 December
    • One week (at least) to work on each
    • Electronic submission via Canvas

R Markdown

Collaboration, Cheating & Plagiarism

Any questions?

What are the big issues?

Problems:

  1. Basic statistical theory is about independent, identically distributed (IID) data.
  2. But we (usually) only see one realization of a whole process.
  3. Every observation is dependent on every other observation.
  4. Basic statistical theory then says that \(n=1\) and can’t draw any inferences.

How are we going to deal with these issues?

Why is this worth knowing?

So what are we going to cover?

What we will not be covering

Exploratory data analysis

Start with something vivid, which is also HW 1

Cherry blossoms in Kyoto

Cherries at the Hirano shrine in Kyoto (David Montasco on flickr)

Flowering of cherry trees has been a central part of Japanese art & culture for well over a millennium

Hanami

Kitao Shigemasa, Sangatsu, Asukayam Hanami = Third Lunar Month, Blossom Viewing at Asuka Hill, c. 1776, via Library of Congress

Notice the date in the title!

This is data!

Cherry blossoms track climate

Snow at the Hirano shrine (yopparainokobito on flickr)

A data set

A data set

kyoto <- read.csv("http://www.stat.cmu.edu/~cshalizi/dst/18/data/kyoto.csv")
plot(Flowering.DOY ~ Year.AD, data=kyoto, type="l",
     ylab="Day in year of full flowering", xlab="Year (AD)",
     main="Cherry blossoms in Kyoto")

Problems

Smoothed cherry blossoms

Both curves come from averaging nearby values

More concrete problems

Break for small-group exercise

Which of the two curves do you think is better? Or are they both bad? Or, if you can’t tell, what more information would you need?

We’re going to need to build some concepts

By Thursday:

Take-aways


  1. Calvino’s Cosmicomics is a precious part of our common cultural heritage.

  2. The Pillow Book of Sei Shonagon is a precious part of our common cultural heritage.