Exploratory Data Analysis / Plotting

36-467/667, Fall 2020

3 September 2020 (Lecture 2)

House-keeping

Signed up for Canvas and Piazza?
Read the policies and Turabian? Done HW 0?
Downloaded HW 1 and the data set?

Kinds of spatio-temporal data

Data are some kind of measurement for regions of space or intervals of time (“region” data, “areal” data)
- Often called “time series” for temporal data
- Often equally-sized time intervals or areas (every day or week, every square mile)
- But not always (every county)
Data are the specific coordinates where some event occurs (“point” data)
- E.g., times and locations of earthquakes
- E.g., times at which a nerve cell “spiked”
- E.g., locations of trees
- E.g., times and locations of (reported) crimes
- Point data can come with “marks”, qualitative or quantitative:
  - Magnitude of earthquake
  - Species of tree
  - Type of crime
  - (Spikes are pretty much identical so doesn’t make sense to mark them)

Plotting over time: Time series

Use a scatterplot: time on horizontal axis, value on vertical (always)
- Often a good idea to draw lines connecting values
  - Be careful when there are NAs!
Worry about the vertical scale
- Are you exggerating small changes?
- Are you downplaying meaningful changes?
- “Adjust the vertical scale to the range of the data” and “Make sure the vertical scale includes zero” are both good rules of thumb but can conflict
- \(\therefore\) You need to think
Often nice: rug plot on vertical axis to show distribution
- Sometimes nice to add to horizontal to show where data is present

Back to Kyoto

Vertical axis doesn’t go to 0
- cherry trees have never flowered in Kyoto on January 1st, and never will)
Used type="l" so it only draws connecting lines, with gaps for NAs
- No marking for points with NAs on both sides!

Back to Kyoto

Now we see all the data values, even the isolated ones
Many repeated values, so rug gives more a sense of the range than of the distribution
- Add a little visual noise through jitter()

Plotting over time: point data / events

Use a timeline
- Spread out the events
- If marked, use shape, color or size to indicate the mark
- Continuous color and size better for quantitative marks than shape because there’s a more intuitive mapping

Plotting over time: events

Earthquakes, in a rectangle that’s (roughly) Japan, with a magnitude of at least 5.5 on the Richter scale, since January 1 2000

(This has a lot of wasted vertical space!)

Plotting over time: events

Alternative: plot the cumulative number of events from the beginning of the data up to time \(t\)
- Sometimes called the counting process

There’s a jump at each event
- There’s a big jump if many events happen at once

Plotting over time: tricks: index numbers

Have measurements \(X(t_1), X(t_2), \ldots X(t_n)\)
Pick one time, \(t^*\), and then plot \(X(t_i)/X(t^*)\)
- Often \(t^*=t_1\) but not necessarily
These ratios are called index numbers¹
- Often multiplied by 100
Lets you avoid thinking about the actual units and look at just magnitude relative to some baseline
- Often useful when the actual magnitude isn’t illuminating or we want to make comparisons across different time series

Plotting over time: tricks: indexing

“Yukon Lynx Family” by David Cartier, Sr.

Plotting over time: tricks: indexing

library(datasets)
data(lynx)

A famous data set showing the number of lynxes trapped each year in the Mackenzie River region of Canada
Ecologists don’t actually care about this number, but about how many lynxes there were
Assume total number of lynxes \(\propto\) number of trapped lynxes
- But we don’t know the proportion!
- (Yes, people have looked in to whether the assumption makes sense here)
Use an index number, basing it off say the first year in the data:

Plotting over time: tricks: indexing

lynx.index.numbers <- 100 * lynx/lynx[1]
plot(lynx.index.numbers, main = "Relative abundance of lynxes", ylab = "Index (1821==100)")

Plotting over time: tricks: differencing

Rate of change is \[ \Delta X(t_i) = \frac{X(t_i) - X(t_{i-1})}{t_i - t_{i-1}} \]
The numerator is the first difference of the series
- What’d the second difference be?
If observations are equally spaced in time, can just look at \(X(t) - X(t-1)\)
This is like looking at the derivative, \(\frac{dX}{dt}\)
In R, we take differences with diff() (and it gives us somethign 1 shorter than the original vector)

Differencing

Start with the population of the US over time:

Differencing:

nrow(pop)

## [1] 828

delta.pop <- diff(pop$y)
length(delta.pop)

## [1] 827

pop$rate.of.change <- 12 * c(delta.pop, NA)

(Why 12? Why pad with the NA?)

Differencing:

plot(rate.of.change ~ year, data = pop, type = "l", xlab = "year", main = "US population rate of change", 
    ylab = "people/yr")

What do you think happened in 2010?

Plotting over time: tricks: accumulating

Go the other way: add up rates of change
Like taking the integral
R: cumsum() (for cumulative summation)

Plotting over time: tricks: relative time

Instead of making one plot with one line over time, make multiple lines where we re-start when something happens, and plot time relative to those events
- Times at which those events happen are sometimes called “epochs”

Relative time

Make the epoch-defining events earthquakes with magnitude at least 7
Look \(\pm 30\) days on either side

Plotting over time: tricks: cycle time

Like relative time, but the “event” happens on a regular cycle, say annually

Cycle time

What’s the difference between winter and summer?

Dynamics: scatterplots

Plot \(X(t+1)\) vs \(X(t)\)

Dynamics

Of course you might also try plotting \(X(t+1)\) vs \(((X(t), X(t-1))\)…

Relationships between two variables

You often see plots with two variables with different vertical scales

(Spurious Correlations)

This can be OK but it’s often not a great idea, because you can make the relationship look very different by changing the axes
- Imagine having the red vertical axis go down to 0!
Usually a better idea to use a scatterplot of one variable against another
- More obvious if you’re pulling a fast one with the axes
- Use lines to connect successive measurements
  - Or use numbers as plotting symbols, etc.

Relationships between two variables

There are also issues when each variable is correlated with itself
- Say \(X(t)\) is \(+\)vely correlated with \(X(t+1)\) and so are \(Y(t)\) and \(Y(t+1)\)
- Then if \(X(t)\) is a above average/below, we expect \(X(t+1)\) to also be above/below average
- The same for \(Y(t)\) and \(Y(t+1)\)
- \(\therefore\) knowing if \(X(t)\) is above average helps us guess if \(Y(t)\) is above average
We will come back to dealing with auto-correlation later in the course

Plotting over space: maps

Coordinates of the plot match the coordinates in space
For point data: end of story
- Unless there are marks; use shape, color or size
For region data: you need to indicate the value of \(X(r)\)
- Put a point at the coordinate: use shape, color or size
- Show the whole region: use color
  - (or lines for fill-in; more old-fashioned)
  - The jargon name for a map that shades each region to represent a value is a choropleth (Greek choros “region” + plethos “many”) (I tend to mis-spell it as ch_l_oropleth…)

Plotting over space: maps: complications

If you have data on a regular grid, especially a rectangular grid, R has many nice tools for making your plots
- E.g., image()
If you have data for irregularly-shaped areas (counties, states, countries, …), then you need:
1. A computer-friendly way of naming each area
  - For the US, the are standardized “FIPS” codes giving numerical identifiers for states and counties
2. A definition for the shape and location of each area, usually a polygon
  - The maps package may offer what you need
3. A good way to fill in each polygon with the right color or texture from the data set
There are very elaborate geographic information system (GIS) software packages for doing stuff like this
For straightforward examples I recommend looking at Healy (2018), ch. 7 (online)

Mapping earthquakes

Exercise: Modify color or size of the points to indicate magnitude

Plotting over space: maps: projections

The Earth is round
The page (or screen) is flat
No coordinate system for a sphere goes on to a flat page without distortion and discontinuity²
Not a big deal on a small³ scale
On a large scale, you need to pick how to “project” the round Earth on to the flat page
Different projection schemes all have different virtues/drawbacks
- Using longitude and latitude as \(x\) and \(y\) is easy and familiar
- Other projections preserve area
- Yet other projections preserve orientation
  - the “Mercator” projection that makes Greenland look bigger than Africa (unless you flip the north and south poles)
- Nothing preserves distance
If you want to calculate distance between points using latitude and longitude, you’ve got to do some trigonometry
- Many R packages have functions to do the trig for you

Plotting over space: relationships between variables

You’ll often see side-by-side maps
Usually not a great way to establish a relationship
- Partly because of the multiple-scales issue (as with time series)
- Partly because of auto-correlation issues (as with time series)
- Partly because you’re often just showing they’re both tracking population (or whatever)
Recommendation: make a scatter plot
- And consider regressing-out things like population, or at least normalizing by population
- Again, we’ll come back to how to deal with auto-correlation

Spatial relationships

XKCD no. 1138, “Heatmaps”

Spatio-temporal data

Make a 3D plot, one axis being time
- Often useful for point data
- Or when there’s some sort of “background” you can make invisible
- Not so good for coloring every area over time, usually
Make a movie
- Lots of packages for animations
Make lots of side-by-side maps
- Arrange them in order of time!
Make lots of side-by-side time series plots
- Might try putting each plot in the right spatial location but that can get crowded, especially if areas have different sizes

Other forms of EDA

Everything we knew how to do with non-spatial non-temporal data is still valid: boxplots, histograms, tables, scatterplots showing relations between variables
Often useful to sub-divide into different regions of time or space, and do the EDA for each one

Conventional EDA, Sub-Divided

(You can persuade R to show both of these in one plot)

Final thought

Nearby values are usually similar and differences are noisy and accidental
We can get rid of noise by averaging
Plots are often improved by smoothing
We’ll spend next week looking at that

Take-aways

Know your data type (points vs. regions)
Make plots
Plots over time are pretty simple:
- Horizontal coordinate shows time, vertical coordinate shows value
- Some tricks can highlight relative change and/or repeating patterns
- Use a scatter-plot rather than two vertical axes
Plots over space are usually maps:
- 2D plot coordinates show spatial coordinates
- Values are shown by a 3rd axis or by color
- You may need to remember that the Earth is round
- You may need to remember that most of the Earth has few people on it
Don’t forget your old EDAs, but do try sub-dividing the data to look for variation in time or space

After-notes

On the way earthquakes cluster in time and space but are still hard to predict, see Hough (2009)
- So far as I know, the best way to predict earthquakes is still the rule proposed by Luen and Stark (2008), which is basically “they either happen at random, or near a big quake, shortly after a big quake”
- On the 2011 Tohoku earthquake, and the subsequent disaster at the Fukushima nuclear power plant, the Wikipedia article is surprisingly decent
- It’s also worth tracking down Peter Galison’s film Containment
The lynxes: we’ll be revisiting them multiple times over the course, but if you’re impatient, look at Sigmund (1996), ch. 3
On the “Great Migration” of blacks from the South to cities in the North (and West), Wilkerson (2010) is a great popular history
- Some of what I was saying about the geographic roots of the “black belt” across the South, and its continuing influence, comes via Acharya, Blackwell, and Sen (2018)

References

Acharya, Avidit, Matthew Blackwell, and Maya Sen. 2018. Deep Roots: How Slavery Still Shapes Southern Politics. Princeton, New Jersey: Princeton University Press.

Healy, Kieran. 2018. Data Visualization: A Practical Introduction. Princeton, New Jersey: Princeton University Press.

Hough, Susan. 2009. Predicting the Unpredictable: The Tumultuous Science of Earthquake Prediction. Princeton, New Jersey: Princeton University Press.

Luen, Brad, and Philip B. Stark. 2008. “Testing Earthquake Predictions.” In Probability and Statistics; Essays in Honor of David a. Freedman, edited by Deborah Nolan and Terry Speed, 302–15. Brentwood, Ohio: Institute of Mathematical Statistics. https://doi.org/10.1214/193940307000000509.

McCleary, John. 2006. A First Course in Topology: Continuity and Dimension. Providence, Rhode Island: American Mathematical Society.

Schutz, Bernard F. 1980. Geometrical Methods of Mathematical Physics. Cambridge, England: Cambridge University Press.

Sigmund, Karl. 1996. Games of Life: Explorations in Ecology, Evolution and Behavior. London: Penguin.

Stigler, Stephen M. 1986. The History of Statistics: The Measurement of Uncertainty Before 1900. Cambridge, Massachusetts: Harvard University Press.

Wilkerson, Isabel. 2010. The Warmth of Other Suns: The Epic Story of America’s Great Migration. New York: Random House.

To be precise, this is the simplest kind of index number. When measuring the economy, we often want to come up with a sort of summary of multiple pieces of information which will give us an over-all sense of change of time, and these are also called “index numbers”. For instance, if we want to measure inflation, we have to look at the prices of, in principle, all the different sorts of goods and services that people buy, and how the prices of (comparable) goods change over time. But different goods will have prices changing at different rates, so we need to some how combine that into one number. The “consumer price index” (CPI), which is what people usually mean when they talk about inflation, is a weighted average of the index numbers for a “basket” of goods and services, with the weights reflecting how consumers spent their money at the start of the time period. (“Chained” indices adjust the weights every so often.) Other choices of weights would give different indices — there are actually two CPIs, one for urban consumers and one rural consumers; a “producer price index” geared to what businesses rather than households spend money on; and so on. Sometimes, when econometricians talk about the “index number problem”, they mean the problem of how to pick the weights; sometimes the mean the fact that no set of weights is ideal for all purposes.↩
This is actually a kind of profound mathematical fact. Two sets of points \(A\) and \(B\) have the same dimension if, and only if, we can find a continuous function \(f\) from \(A\) to \(B\) with a continuous inverse \(f^{-1}\). The sphere is three dimensional and the page is two dimensional, so we can’t. If you want to follow up on this thought, and be convinced of the theorem, I strongly recommend McCleary (2006) and Schutz (1980) (which isn’t just for physicists).↩
“Small” here means “small compared to the curvature of the Earth”. Pretend that the Earth is perfectly spherical. (It isn’t; working out exactly how it departs from being a sphere [“geodesy”] was very important to the development of statistics (Stigler 1986).) Pick two points on the surface of the Earth, \(A\) and \(B\), and now imagine drawing lines from the center of the Earth \(C\) to those points. The angle \(\angle ACB\) has a certain magnitude, call it \(\theta\) in radians. (Remember \(2\pi\) radians = \(360\) degrees.) The distance between \(A\) and \(B\) if we measure along the surface of the Earth (“as the crow flies”) is \(r\theta\), where \(r = \overline{AC} = \overline{BC}\) is the radius of the Earth. But the distance between \(A\) and \(B\) on a straight line, \(\overline{AB}\), which would cut through the surface of the Earth, is \(2 r \sin{(\theta/2)}\). (Find the mid-point between \(A\) and \(B\) on the line between them, say \(D\). Draw the triangles \(\triangle ACD\) and \(\triangle BCD\). These are both right triangles with hypotenuse \(r\), and the angles \(\angle ACD\) and \(\angle BCD\) are both \(\theta/2\), so the sides opposite those angles (\(\overline{AD}\) and \(\overline{BD}\)) have length \(r \sin{(\theta/2)}\). For small angles \(x\), \(\sin{x} \approx x\), so for small angles, the distance along the surface of the Earth, \(r\theta\), is approximately the same as the straight-line distance, \(\approx 2r \theta/2 = r\theta\), and we don’t really care that the Earth is round. When it starts to matter depends on how much inaccuracy we’re willing to tolerate. If we’re OK with a 1% error distances, we can go out to a \(\theta\) where \(2\sin{(\theta/2)}/\theta = 0.99\), which numerically is about 0.4906 radians. Since the radius of the Earth is 6378 kilometers, this corresponds to a surface distance of 3129 kilometers. But, as the proverb says, getting 99% of the way across the ocean still means drowning; if we want an accuracy of \(0.01\)%, then we’re looking at an angle of only 0.04901 radians, and so only 312.6 kilometers along the surface.↩

Exploratory Data Analysis / Plotting

House-keeping

Kinds of spatio-temporal data

Plotting over time: Time series

Back to Kyoto

Back to Kyoto

Back to Kyoto

Plotting over time: point data / events

Plotting over time: events

Plotting over time: events

Plotting over time: tricks: index numbers

Plotting over time: tricks: indexing

Plotting over time: tricks: indexing

Plotting over time: tricks: indexing

Plotting over time: tricks: differencing

Differencing

Differencing:

Differencing:

Plotting over time: tricks: accumulating

Plotting over time: tricks: relative time

Relative time

Plotting over time: tricks: cycle time

Cycle time

Cycle time

Dynamics: scatterplots

Dynamics

Relationships between two variables

Relationships between two variables

Plotting over space: maps

Plotting over space: maps: complications

Mapping earthquakes

Plotting over space: maps: projections

Plotting over space: relationships between variables

Spatial relationships

Spatial Relationships

Spatio-temporal data

Other forms of EDA

Conventional EDA, Sub-Divided

Final thought

Take-aways

After-notes

References