Exploratory Data Analysis / Plotting

36-467/667, Fall 2020

3 September 2020 (Lecture 2)

House-keeping

Kinds of spatio-temporal data

  1. Data are some kind of measurement for regions of space or intervals of time (“region” data, “areal” data)
    • Often called “time series” for temporal data
    • Often equally-sized time intervals or areas (every day or week, every square mile)
    • But not always (every county)
  2. Data are the specific coordinates where some event occurs (“point” data)
    • E.g., times and locations of earthquakes
    • E.g., times at which a nerve cell “spiked”
    • E.g., locations of trees
    • E.g., times and locations of (reported) crimes
    • Point data can come with “marks”, qualitative or quantitative:
      • Magnitude of earthquake
      • Species of tree
      • Type of crime
      • (Spikes are pretty much identical so doesn’t make sense to mark them)

Plotting over time: Time series

Back to Kyoto

Back to Kyoto

Back to Kyoto

Plotting over time: point data / events

Plotting over time: events

Earthquakes, in a rectangle that’s (roughly) Japan, with a magnitude of at least 5.5 on the Richter scale, since January 1 2000

(This has a lot of wasted vertical space!)

Plotting over time: events

Plotting over time: tricks: index numbers

Plotting over time: tricks: indexing

“Yukon Lynx Family” by David Cartier, Sr.

Plotting over time: tricks: indexing

library(datasets)
data(lynx)

Plotting over time: tricks: indexing

lynx.index.numbers <- 100 * lynx/lynx[1]
plot(lynx.index.numbers, main = "Relative abundance of lynxes", ylab = "Index (1821==100)")

Plotting over time: tricks: differencing

Differencing

Differencing:

nrow(pop)
## [1] 828
delta.pop <- diff(pop$y)
length(delta.pop)
## [1] 827
pop$rate.of.change <- 12 * c(delta.pop, NA)

(Why 12? Why pad with the NA?)

Differencing:

plot(rate.of.change ~ year, data = pop, type = "l", xlab = "year", main = "US population rate of change", 
    ylab = "people/yr")

What do you think happened in 2010?

Plotting over time: tricks: accumulating

Plotting over time: tricks: relative time

Relative time

Plotting over time: tricks: cycle time

Cycle time

Cycle time

What’s the difference between winter and summer?

Dynamics: scatterplots

Dynamics

Relationships between two variables

(Spurious Correlations)

Relationships between two variables

Plotting over space: maps

Plotting over space: maps: complications

Mapping earthquakes

Exercise: Modify color or size of the points to indicate magnitude

Plotting over space: maps: projections

Plotting over space: relationships between variables

Spatial relationships

XKCD no. 1138, “Heatmaps”

Spatial Relationships

Kieran Healy, “America’s Ur-Choropleths”

Spatio-temporal data

Other forms of EDA

Conventional EDA, Sub-Divided

(You can persuade R to show both of these in one plot)

Final thought

Take-aways

After-notes

References

Acharya, Avidit, Matthew Blackwell, and Maya Sen. 2018. Deep Roots: How Slavery Still Shapes Southern Politics. Princeton, New Jersey: Princeton University Press.

Healy, Kieran. 2018. Data Visualization: A Practical Introduction. Princeton, New Jersey: Princeton University Press.

Hough, Susan. 2009. Predicting the Unpredictable: The Tumultuous Science of Earthquake Prediction. Princeton, New Jersey: Princeton University Press.

Luen, Brad, and Philip B. Stark. 2008. “Testing Earthquake Predictions.” In Probability and Statistics; Essays in Honor of David a. Freedman, edited by Deborah Nolan and Terry Speed, 302–15. Brentwood, Ohio: Institute of Mathematical Statistics. https://doi.org/10.1214/193940307000000509.

McCleary, John. 2006. A First Course in Topology: Continuity and Dimension. Providence, Rhode Island: American Mathematical Society.

Schutz, Bernard F. 1980. Geometrical Methods of Mathematical Physics. Cambridge, England: Cambridge University Press.

Sigmund, Karl. 1996. Games of Life: Explorations in Ecology, Evolution and Behavior. London: Penguin.

Stigler, Stephen M. 1986. The History of Statistics: The Measurement of Uncertainty Before 1900. Cambridge, Massachusetts: Harvard University Press.

Wilkerson, Isabel. 2010. The Warmth of Other Suns: The Epic Story of America’s Great Migration. New York: Random House.


  1. To be precise, this is the simplest kind of index number. When measuring the economy, we often want to come up with a sort of summary of multiple pieces of information which will give us an over-all sense of change of time, and these are also called “index numbers”. For instance, if we want to measure inflation, we have to look at the prices of, in principle, all the different sorts of goods and services that people buy, and how the prices of (comparable) goods change over time. But different goods will have prices changing at different rates, so we need to some how combine that into one number. The “consumer price index” (CPI), which is what people usually mean when they talk about inflation, is a weighted average of the index numbers for a “basket” of goods and services, with the weights reflecting how consumers spent their money at the start of the time period. (“Chained” indices adjust the weights every so often.) Other choices of weights would give different indices — there are actually two CPIs, one for urban consumers and one rural consumers; a “producer price index” geared to what businesses rather than households spend money on; and so on. Sometimes, when econometricians talk about the “index number problem”, they mean the problem of how to pick the weights; sometimes the mean the fact that no set of weights is ideal for all purposes.

  2. This is actually a kind of profound mathematical fact. Two sets of points \(A\) and \(B\) have the same dimension if, and only if, we can find a continuous function \(f\) from \(A\) to \(B\) with a continuous inverse \(f^{-1}\). The sphere is three dimensional and the page is two dimensional, so we can’t. If you want to follow up on this thought, and be convinced of the theorem, I strongly recommend McCleary (2006) and Schutz (1980) (which isn’t just for physicists).

  3. “Small” here means “small compared to the curvature of the Earth”. Pretend that the Earth is perfectly spherical. (It isn’t; working out exactly how it departs from being a sphere [“geodesy”] was very important to the development of statistics (Stigler 1986).) Pick two points on the surface of the Earth, \(A\) and \(B\), and now imagine drawing lines from the center of the Earth \(C\) to those points. The angle \(\angle ACB\) has a certain magnitude, call it \(\theta\) in radians. (Remember \(2\pi\) radians = \(360\) degrees.) The distance between \(A\) and \(B\) if we measure along the surface of the Earth (“as the crow flies”) is \(r\theta\), where \(r = \overline{AC} = \overline{BC}\) is the radius of the Earth. But the distance between \(A\) and \(B\) on a straight line, \(\overline{AB}\), which would cut through the surface of the Earth, is \(2 r \sin{(\theta/2)}\). (Find the mid-point between \(A\) and \(B\) on the line between them, say \(D\). Draw the triangles \(\triangle ACD\) and \(\triangle BCD\). These are both right triangles with hypotenuse \(r\), and the angles \(\angle ACD\) and \(\angle BCD\) are both \(\theta/2\), so the sides opposite those angles (\(\overline{AD}\) and \(\overline{BD}\)) have length \(r \sin{(\theta/2)}\). For small angles \(x\), \(\sin{x} \approx x\), so for small angles, the distance along the surface of the Earth, \(r\theta\), is approximately the same as the straight-line distance, \(\approx 2r \theta/2 = r\theta\), and we don’t really care that the Earth is round. When it starts to matter depends on how much inaccuracy we’re willing to tolerate. If we’re OK with a 1% error distances, we can go out to a \(\theta\) where \(2\sin{(\theta/2)}/\theta = 0.99\), which numerically is about 0.4906 radians. Since the radius of the Earth is 6378 kilometers, this corresponds to a surface distance of 3129 kilometers. But, as the proverb says, getting 99% of the way across the ocean still means drowning; if we want an accuracy of \(0.01\)%, then we’re looking at an angle of only 0.04901 radians, and so only 312.6 kilometers along the surface.