Visualizing longitudinal data and time series

The graphs below don’t have proper titles, axis labels, legends, etc. Please take care to do this on your own graphs.

PhDs Awarded by Field Over Time

In this demo, we’ll first work with a dataset on the number of PhD degrees awarded in the US from TidyTuesday.

# Read in the tidytuesday data
library(tidyverse)
phd_field <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-02-19/phd_by_field.csv")

## 
## ── Column specification ────────────────────────────────────────────────────────────────
## cols(
##   broad_field = col_character(),
##   major_field = col_character(),
##   field = col_character(),
##   year = col_double(),
##   n_phds = col_double()
## )

phd_field

## # A tibble: 3,370 x 5
##    broad_field   major_field                   field                         year n_phds
##    <chr>         <chr>                         <chr>                        <dbl>  <dbl>
##  1 Life sciences Agricultural sciences and na… Agricultural economics        2008    111
##  2 Life sciences Agricultural sciences and na… Agricultural and horticultu…  2008     28
##  3 Life sciences Agricultural sciences and na… Agricultural animal breeding  2008      3
##  4 Life sciences Agricultural sciences and na… Agronomy and crop science     2008     68
##  5 Life sciences Agricultural sciences and na… Animal nutrition              2008     41
##  6 Life sciences Agricultural sciences and na… Animal science, poultry or …  2008     18
##  7 Life sciences Agricultural sciences and na… Animal sciences, other        2008     77
##  8 Life sciences Agricultural sciences and na… Environmental science         2008    182
##  9 Life sciences Agricultural sciences and na… Fishing and fisheries scien…  2008     52
## 10 Life sciences Agricultural sciences and na… Food science                  2008     96
## # … with 3,360 more rows

Let’s start by grabbing the rows corresponding to Statistics PhDs. While there are a number of ways to do this, we can grab field containing “statistics” (including biostatistics) with the str_detect() function.

stats_phds <- phd_field %>%
  filter(str_detect(tolower(field), "statistics"))

What are the different fields that were captured?

table(stats_phds$field)

## 
##                        Biometrics and biostatistics 
##                                                  10 
##            Educational statistics, research methods 
##                                                  10 
## Management information systems, business statistics 
##                                                  10 
##                 Mathematics and statistics, general 
##                                                  10 
##                   Mathematics and statistics, other 
##                                                  10 
##                            Statistics (mathematics) 
##                                                  10 
##                        Statistics (social sciences) 
##                                                  10

To start, let’s just summarize the number of PhDs by year:

stat_phd_year_summary <- stats_phds %>%
  group_by(year) %>%
  summarize(n_phds = sum(n_phds))

Now, we’ll make the typical scatterplot display with n_phds on the y-axis and year on the x-axis:

stat_phd_year_summary %>%
  ggplot(aes(x = year, y = n_phds)) +
  geom_point() +
  theme_bw() +
  labs(x = "Year", y = "Number of PhDs",
       title = "Number of Statistics-related PhDs awarded over time")

We should fix our x-axis here and make the breaks more informative. In this case, I’ll change it so each year is labeled (that may not be appropriate for every visual but it works out here).

stat_phd_year_summary %>%
  ggplot(aes(x = year, y = n_phds)) +
  geom_point() +
  # Modify the x-axis to make the axis breaks at the unique years and show their
  # respective labels
  scale_x_continuous(breaks = unique(stat_phd_year_summary$year),
                     labels = unique(stat_phd_year_summary$year)) +
  theme_bw() +
  labs(x = "Year", y = "Number of PhDs",
       title = "Number of Statistics-related PhDs awarded over time")

To emphasize the ordering of the year along the x-axis, we’ll add a line connecting the points to emphasize the order:

stat_phd_year_summary %>%
  ggplot(aes(x = year, y = n_phds)) +
  geom_point() +
  geom_line() +
  scale_x_continuous(breaks = unique(stat_phd_year_summary$year),
                     labels = unique(stat_phd_year_summary$year)) +
  theme_bw() +
  labs(x = "Year", y = "Number of PhDs",
       title = "Number of Statistics-related PhDs awarded over time")

We can drop the points, leaving only the connecting lines to emphasize trends:

stat_phd_year_summary %>%
  ggplot(aes(x = year, y = n_phds)) +
  geom_line() +
  scale_x_continuous(breaks = unique(stat_phd_year_summary$year),
                     labels = unique(stat_phd_year_summary$year)) +
  theme_bw() +
  labs(x = "Year", y = "Number of PhDs",
       title = "Number of Statistics-related PhDs awarded over time")

Another common way to display trends is by filling in the area under the line. However, this is only appropriate when the y-axis starts at 0! It’s also redundant use of ink so just be careful when deciding whether or not to fill the area. We can fill the area under the line with the geom_area() aesthetic - but note that it changes the y-axis by default to start at 0:

stat_phd_year_summary %>%
  ggplot(aes(x = year, y = n_phds)) +
  # Fill the area under the line
  geom_area(fill = "darkblue", alpha = 0.5) +
  geom_line() +
  scale_x_continuous(breaks = unique(stat_phd_year_summary$year),
                     labels = unique(stat_phd_year_summary$year)) +
  theme_bw() +
  labs(x = "Year", y = "Number of PhDs",
       title = "Number of Statistics-related PhDs awarded over time")

You can also make this plot using the ggridges package.

Plotting and labeling several lines

We’ll now switch to displaying the different Statistics fields separately with the stats_phds dataset. First, we should NOT display multiple time series with just points as follows:

stats_phds %>%
  ggplot(aes(x = year, y = n_phds, color = field)) +
  geom_point() +
  scale_x_continuous(breaks = unique(stat_phd_year_summary$year),
                     labels = unique(stat_phd_year_summary$year)) +
  theme_bw() +
  theme(legend.position = "bottom",
        # Adjust the size of the legend's text
        legend.text = element_text(size = 5),
        legend.title = element_text(size = 6)) +
  labs(x = "Year", y = "Number of PhDs",
       title = "Number of Statistics-related PhDs awarded over time",
       color = "Field")

It’s much simpler to just display the lines to compare the trends:

stats_phds %>%
  ggplot(aes(x = year, y = n_phds, color = field)) +
  geom_line() +
  scale_x_continuous(breaks = unique(stat_phd_year_summary$year),
                     labels = unique(stat_phd_year_summary$year)) +
  theme_bw() +
  theme(legend.position = "bottom",
        # Adjust the size of the legend's text
        legend.text = element_text(size = 5),
        legend.title = element_text(size = 6)) +
  labs(x = "Year", y = "Number of PhDs",
       title = "Number of Statistics-related PhDs awarded over time",
       color = "Field")

The legend is pretty cluttered though, instead we can directly label the displayed lines using the ggrepel package. We first need to create a dataset with just the final values (which in this case corresponds to year == 2017), and then add labels for these values. To make the labels visible, we need to increase our x-axis limits. Note that this is a “hack”, but you will rely on hacks to customize visuals in the future… The following code chunk demonstrates how to do this:

stats_phds_2017 <- stats_phds %>%
  filter(year == 2017)

# Access the ggrepel package:
# install.packages("ggrepel")
library(ggrepel)
stats_phds %>%
  ggplot(aes(x = year, y = n_phds, color = field)) +
  geom_line() +
  # Add the labels:
  geom_text_repel(data = stats_phds_2017,
                  aes(label = field),
                  size = 2, 
                  # Drop the segment connection:
                  segment.color = NA, 
                  # Move labels up or down based on overlap
                  direction = "y",
                  # Try to align the labels horizontally on the left hand side
                  hjust = "left") +
  scale_x_continuous(breaks = unique(stat_phd_year_summary$year),
                     labels = unique(stat_phd_year_summary$year),
                     # Update the limits so that there is some padding on the
                     # x-axis but don't label the new maximum
                     limits = c(min(stat_phd_year_summary$year),
                                max(stat_phd_year_summary$year) + 3)) +
  theme_bw() +
  # Drop the legend
  theme(legend.position = "none") +
  labs(x = "Year", y = "Number of PhDs",
       title = "Number of Statistics-related PhDs awarded over time",
       color = "Field")

Next, let’s switch to back to the original dataset phd_field. What happens if we plot a line for every field attempting to use the color aesthetic to separate them?

phd_field %>%
  ggplot(aes(x = year, y = n_phds, color = field)) +
  geom_line() +
  scale_x_continuous(breaks = unique(stat_phd_year_summary$year),
                     labels = unique(stat_phd_year_summary$year)) +
  theme_bw() +
  theme(legend.position = "none") +
  labs(x = "Year", y = "Number of PhDs",
       title = "Number of Statistics-related PhDs awarded over time",
       color = "Field")

## Warning: Removed 270 row(s) containing missing values (geom_path).

The plot above is obviously a disaster… When we are dealing with potentially way too many categories, we can instead highlight lines of interest while setting the background lines to gray, so we can still see background trends. We need to use the group aesthetic to split the gray lines from each other. Plus, we should adjust the alpha due to the overlap. The following code chunk demonstrates how to do this for highlighting the “Statistics (mathematics)” and “Biometrics and biostatistics” lines. We essentially create separate plot layers by filtering on the field variable:

# First display the background lines using the full dataset with those two fields 
# filtered out:
phd_field %>%
  # The following line says: NOT (field in c("Biometrics and biostatistics", "Statistics (mathematics)"))
  filter(!(field %in% c("Biometrics and biostatistics", 
                        "Statistics (mathematics)"))) %>%
  ggplot() +
  # Add the background lines - need to specify the group to be the field
  geom_line(aes(x = year, y = n_phds, group = field),
            color = "gray", size = .5, alpha = .5) +
  # Now add the layer with the lines of interest:
  geom_line(data = filter(phd_field,
                          # Note this is just the opposite of the above since ! is removed
                          field %in% c("Biometrics and biostatistics", 
                                       "Statistics (mathematics)")),
            aes(x = year, y = n_phds, color = field),
            # Make the size larger
            size = .75, alpha = 1) +
  scale_x_continuous(breaks = unique(stat_phd_year_summary$year),
                     labels = unique(stat_phd_year_summary$year)) +
  theme_bw() +
  theme(legend.position = "bottom", 
        # Drop the panel lines making the gray difficult to see
        panel.grid = element_blank()) +
  labs(x = "Year", y = "Number of PhDs",
       title = "Number of Statistics-related PhDs awarded over time",
       color = "Field")

## Warning: Removed 270 row(s) containing missing values (geom_path).

Florence Nightingale’s rose diagrams

Another way to visualize time series data is to display it in a cycle pattern, using polar coordinates, as done by Florence Nightingale’s famous rose diagram. We can recreate the rose diagram by accessing the data in the HistData package. We’ll first load and print out the first so many rows of the data below:

library(HistData)

## Warning: package 'HistData' was built under R version 4.0.2

head(Nightingale)

##         Date Month Year  Army Disease Wounds Other Disease.rate Wounds.rate Other.rate
## 1 1854-04-01   Apr 1854  8571       1      0     5          1.4         0.0        7.0
## 2 1854-05-01   May 1854 23333      12      0     9          6.2         0.0        4.6
## 3 1854-06-01   Jun 1854 28333      11      0     6          4.7         0.0        2.5
## 4 1854-07-01   Jul 1854 28722     359      0    23        150.0         0.0        9.6
## 5 1854-08-01   Aug 1854 30246     828      1    30        328.5         0.4       11.9
## 6 1854-09-01   Sep 1854 30290     788     81    70        312.2        32.1       27.7

To recreate the plot, we’ll need to first make a longer version of the dataset with the Disease, Wounds, and Other columns separated into three rows. To do that, we’ll use the pivot_longer() function after just selecting the columns of interest for our plot:

crimean_war_data <- Nightingale %>%
  dplyr::select(Date, Month, Year, Disease, Wounds, Other) %>%
  # Now pivot those columns to take up separate rows:
  pivot_longer(Disease:Other,
               names_to = "cause", values_to = "count")

Next, we’ll make a label column matching Nightingale’s plot based on the Date column. We can condition on being above or below certain dates in a natural way:

crimean_war_data <- crimean_war_data %>%
  mutate(time_period = ifelse(Date <= as.Date("1855-03-01"),
                              "April 1854 to March 1855", 
                              "April 1855 to March 1856"))

And finally we can go ahead and display the rose diagram facetted by the time period (using similar colors to Nightingale):

crimean_war_data %>% 
  ggplot(aes(x = Month, y = count)) + 
  geom_col(aes(fill = cause), width = 1, 
           position = "identity", alpha = 0.5) + 
  coord_polar() + 
  facet_wrap(~ time_period, ncol = 2) +
  scale_fill_manual(values = c("skyblue3", "grey30", "firebrick")) +
  scale_y_sqrt() +
  theme_void() +
  # All of this below is to just customize the theme in a way that we are
  # close to resembling the original plot (ie lets make it look old!)
  theme(axis.text.x = element_text(size = 9),
        strip.text = element_text(size = 11),
        legend.position = "bottom",
        plot.background = element_rect(fill = alpha("cornsilk", 0.5)),
        plot.margin = unit(c(10, 10, 10, 10), "pt"),
        plot.title = element_text(vjust = 5)) +
  labs(title = "Diagram of the Causes of Mortality in the Army in the East")

This looks pretty close to the original diagram, except the order of the months does not match the original. We can of course change that by reordering the factor variable:

crimean_war_data %>% 
  # Manually relevel it to match the original plot
  mutate(Month = fct_relevel(Month, 
                             "Jul", "Aug", "Sep", "Oct", "Nov",
                             "Dec", "Jan", "Feb", "Mar", "Apr", "May", "Jun")) %>%
  ggplot(aes(x = Month, y = count)) + 
  geom_col(aes(fill = cause), width = 1, 
           position = "identity", alpha = 0.5) + 
  coord_polar() + 
  facet_wrap(~ time_period, ncol = 2) +
  scale_fill_manual(values = c("skyblue3", "grey30", "firebrick")) +
  scale_y_sqrt() +
  theme_void() +
  # All of this below is to just customize the theme in a way that we are
  # close to resembling the original plot (ie lets make it look old!)
  theme(axis.text.x = element_text(size = 9),
        strip.text = element_text(size = 11),
        legend.position = "bottom",
        plot.background = element_rect(fill = alpha("cornsilk", 0.5)),
        plot.margin = unit(c(10, 10, 10, 10), "pt"),
        plot.title = element_text(vjust = 5)) +
  labs(title = "Diagram of the Causes of Mortality in the Army in the East")

How does this compare to just a simple line graph?

crimean_war_data %>% 
  ggplot(aes(x = Date, y = count, color = cause)) + 
  geom_line() +
  # Add a reference line at the cutoff point
  geom_vline(xintercept = as.Date("1855-03-01"), linetype = "dashed",
             color = "gray") +
  scale_color_manual(values = c("skyblue3", "grey30", "firebrick")) +
  scale_y_sqrt() +
  theme_bw() +
  theme(legend.position = "bottom") +
  labs(title = "Diagram of the Causes of Mortality in the Army in the East",
       y = "sqrt(counts)", x = "Date")

We can customize the x-axis further using scale_x_date():

crimean_war_data %>% 
  ggplot(aes(x = Date, y = count, color = cause)) + 
  geom_line() +
  # Add a reference line at the cutoff point
  geom_vline(xintercept = as.Date("1855-03-01"), linetype = "dashed",
             color = "gray") +
  scale_color_manual(values = c("skyblue3", "grey30", "firebrick")) +
  scale_y_sqrt() +
  # Format to use abbreviate month %b with year %Y
  scale_x_date(date_labels = "%b %Y") +
  theme_bw() +
  theme(legend.position = "bottom") +
  labs(title = "Diagram of the Causes of Mortality in the Army in the East",
       y = "sqrt(counts)", x = "Date")

Which one do you prefer? Maybe filling the area under the lines would be better here…

Time series data

Structure and basic plots

A time series measures a single variable over many points in time. Time intervals may be regularly or irregularly spaced, but we’ll only consider regularly-spaced data for now. We’ll focus on visualizing a single variable over time, since there are already so many choices and information to work with.

For this demo, we’re going to work with a dataset that’s actually already loaded when you start R: It’s defined under co2, and known as the “Mauna Loa Atmospheric CO2 Concentration” dataset. This dataset contains 468 monthly measurements of CO2 concentration from 1959 to 1997.

When you start R, if you type in co2, this is what you’ll see:

co2

##         Jan    Feb    Mar    Apr    May    Jun    Jul    Aug    Sep    Oct    Nov
## 1959 315.42 316.31 316.50 317.56 318.13 318.00 316.39 314.65 313.68 313.18 314.66
## 1960 316.27 316.81 317.42 318.87 319.87 319.43 318.01 315.74 314.00 313.68 314.84
## 1961 316.73 317.54 318.38 319.31 320.42 319.61 318.42 316.63 314.83 315.16 315.94
## 1962 317.78 318.40 319.53 320.42 320.85 320.45 319.45 317.25 316.11 315.27 316.53
## 1963 318.58 318.92 319.70 321.22 322.08 321.31 319.58 317.61 316.05 315.83 316.91
## 1964 319.41 320.07 320.74 321.40 322.06 321.73 320.27 318.54 316.54 316.71 317.53
## 1965 319.27 320.28 320.73 321.97 322.00 321.71 321.05 318.71 317.66 317.14 318.70
## 1966 320.46 321.43 322.23 323.54 323.91 323.59 322.24 320.20 318.48 317.94 319.63
## 1967 322.17 322.34 322.88 324.25 324.83 323.93 322.38 320.76 319.10 319.24 320.56
## 1968 322.40 322.99 323.73 324.86 325.40 325.20 323.98 321.95 320.18 320.09 321.16
## 1969 323.83 324.26 325.47 326.50 327.21 326.54 325.72 323.50 322.22 321.62 322.69
## 1970 324.89 325.82 326.77 327.97 327.91 327.50 326.18 324.53 322.93 322.90 323.85
## 1971 326.01 326.51 327.01 327.62 328.76 328.40 327.20 325.27 323.20 323.40 324.63
## 1972 326.60 327.47 327.58 329.56 329.90 328.92 327.88 326.16 324.68 325.04 326.34
## 1973 328.37 329.40 330.14 331.33 332.31 331.90 330.70 329.15 327.35 327.02 327.99
## 1974 329.18 330.55 331.32 332.48 332.92 332.08 331.01 329.23 327.27 327.21 328.29
## 1975 330.23 331.25 331.87 333.14 333.80 333.43 331.73 329.90 328.40 328.17 329.32
## 1976 331.58 332.39 333.33 334.41 334.71 334.17 332.89 330.77 329.14 328.78 330.14
## 1977 332.75 333.24 334.53 335.90 336.57 336.10 334.76 332.59 331.42 330.98 332.24
## 1978 334.80 335.22 336.47 337.59 337.84 337.72 336.37 334.51 332.60 332.38 333.75
## 1979 336.05 336.59 337.79 338.71 339.30 339.12 337.56 335.92 333.75 333.70 335.12
## 1980 337.84 338.19 339.91 340.60 341.29 341.00 339.39 337.43 335.72 335.84 336.93
## 1981 339.06 340.30 341.21 342.33 342.74 342.08 340.32 338.26 336.52 336.68 338.19
## 1982 340.57 341.44 342.53 343.39 343.96 343.18 341.88 339.65 337.81 337.69 339.09
## 1983 341.20 342.35 342.93 344.77 345.58 345.14 343.81 342.21 339.69 339.82 340.98
## 1984 343.52 344.33 345.11 346.88 347.25 346.62 345.22 343.11 340.90 341.18 342.80
## 1985 344.79 345.82 347.25 348.17 348.74 348.07 346.38 344.51 342.92 342.62 344.06
## 1986 346.11 346.78 347.68 349.37 350.03 349.37 347.76 345.73 344.68 343.99 345.48
## 1987 347.84 348.29 349.23 350.80 351.66 351.07 349.33 347.92 346.27 346.18 347.64
## 1988 350.25 351.54 352.05 353.41 354.04 353.62 352.22 350.27 348.55 348.72 349.91
## 1989 352.60 352.92 353.53 355.26 355.52 354.97 353.75 351.52 349.64 349.83 351.14
## 1990 353.50 354.55 355.23 356.04 357.00 356.07 354.67 352.76 350.82 351.04 352.69
## 1991 354.59 355.63 357.03 358.48 359.22 358.12 356.06 353.92 352.05 352.11 353.64
## 1992 355.88 356.63 357.72 359.07 359.58 359.17 356.94 354.92 352.94 353.23 354.09
## 1993 356.63 357.10 358.32 359.41 360.23 359.55 357.53 355.48 353.67 353.95 355.30
## 1994 358.34 358.89 359.95 361.25 361.67 360.94 359.55 357.49 355.84 356.00 357.59
## 1995 359.98 361.03 361.66 363.48 363.82 363.30 361.94 359.50 358.11 357.80 359.61
## 1996 362.09 363.29 364.06 364.76 365.45 365.01 363.70 361.54 359.51 359.65 360.80
## 1997 363.23 364.06 364.61 366.40 366.84 365.68 364.52 362.57 360.24 360.83 362.49
##         Dec
## 1959 315.43
## 1960 316.03
## 1961 316.85
## 1962 317.53
## 1963 318.20
## 1964 318.55
## 1965 319.25
## 1966 320.87
## 1967 321.80
## 1968 322.74
## 1969 323.95
## 1970 324.96
## 1971 325.85
## 1972 327.39
## 1973 328.48
## 1974 329.41
## 1975 330.59
## 1976 331.52
## 1977 333.68
## 1978 334.78
## 1979 336.56
## 1980 338.04
## 1981 339.44
## 1982 340.32
## 1983 342.82
## 1984 344.04
## 1985 345.38
## 1986 346.72
## 1987 348.78
## 1988 351.18
## 1989 352.37
## 1990 354.07
## 1991 354.89
## 1992 355.33
## 1993 356.78
## 1994 359.05
## 1995 360.74
## 1996 362.38
## 1997 364.34

These numbers are expressed in parts per million (ppm) - if you’ve taken a chemistry class (which I haven’t), I assure you know more about this than I do, but ppm is a very common measure for pollutants and contaminants.

The co2 object is actually defined with a class we haven’t seen yet, ts (time series):

class(co2)

## [1] "ts"

As a result, it contains extra attributes about the times associated with each value, and base R graphs can plot this automatically (we don’t need to specify the time range manually):

plot(co2)

This is typically called a line plot. Here’s how you would plot the same data in the gg style (note that you need the package ggfortify to do this) without constructing the typical long-table format we’re used to:

library(ggfortify)
autoplot(co2)

In time series, we are interested in checking for trends (does the variable tend to increase or decrease over time?), seasonality (are there tendencies that regularly occur? If so, at what intervals do they occur?), general variability (i.e., variation beyond average trends and seasonality), and outliers (unusual spikes or valleys). Which of these do we see in the co2 data?

In order to show you more general-purpose plotting methods, we’ll treat co2 as numeric instead of ts from now on. This is actually the form that many time series data take (they’re usually not in the ts format), so this will also show you some of the formatting/structuring issues you’ll have to work with in the wild.

In the code below, we create a dataset with 1:468 as the obs_i variable (x-axis variable) and the CO2 concentration as the co2_val variable (y-axis) variable). Then, we use our usual ggplot to plot the data:

library(tidyverse)
co2_tbl <- tibble(co2_val = as.numeric(co2)) %>%
  mutate(obs_i = 1:n())
co2_tbl %>%
  ggplot(aes(x = obs_i, y = co2_val)) + 
  geom_line() + 
  labs(x = "Time index", y = "CO2 (ppm)")

Note the x-axis: This will not be helpful/informative to readers. It’s a common mistake for people to just put something like “index” on the x-axis, but they don’t even tell you what they are indexing. Be sure to not make this mistake! To properly structure time series data and label it correctly, you’ll need to know how to work with Date type objects in R (which we will discuss next…).

Formatting `Date` data in R

The following code redefines creates a new column in our dataset to be the monthly dates 1/1/1959, 2/1/1959, … , 12/1/1997 using the as.Date() function (given the description of the time range in help(co2)):

co2_tbl <- co2_tbl %>%
  # We can use the seq() function with dates which is pretty useful!
  mutate(obs_date = seq(as.Date("1959/1/1"), by = "month",
                        length.out = n()))
co2_tbl %>%
  ggplot(aes(x = obs_date, y = co2_val)) + 
  geom_line() + 
  labs(x = "Year", y = "CO2 (ppm)")

Unfortunately, the default format for dates in as.Date() is Year/Month/Day. If you prefer another format, such as the common Month/Day/Year (as I used above in the previous paragraph), you need to include the format argument within as.Date(), as such (note that the Y is capitalized - yes, as.Date() is that picky):

co2_tbl <- co2_tbl %>%
  mutate(obs_date = seq(as.Date("1/1/1959", format = "%m/%d/%Y"), 
                        by = "month",
                        length.out = n()))
co2_tbl %>%
  ggplot(aes(x = obs_date, y = co2_val)) + 
  geom_line() + 
  labs(x = "Year", y = "CO2 (ppm)")

Note that all we needed to do was convert the obs_date variable to a Date class, and then we could use ggplot as is. This is because ggplot knows to use a special date scale for the x-axis when x has class Date. As a result, we can easily play with the breaks on the date axis using scale_x_date(). For example:

For a subset of the data, maybe we only want ticks every 4 months, using date_breaks.
We can specify the format of the date with date_labels. (See Details section of ?strftime for the formatting options. Here, we choose abbreviated month %b and full year %Y.)

co2_tbl[1:26,] %>%
  ggplot(aes(x = obs_date, y = co2_val)) + 
  geom_line() + 
  scale_x_date(date_breaks = "4 months", date_labels = "%b %Y") +
  labs(x = "Year", y = "CO2 (ppm)") +
  # Modify the x-axis text 
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Moving window averages

One of the things we look for in time series data are trends. But measurements can be quite noisy, making it hard to see the underlying trend. For example: Most investors probably shouldn’t react by selling or buying every time the stock market jumps up or down a little bit, but it might make sense to respond to a longer-term underlying trend.

Moving averages are a great way to get an idea of underlying trends. The ggseas package has a function called stat_rollapplyr() for computing moving averages. Below is an example of computing (and displaying) a moving average using this function:

library(ggseas)
co2_tbl %>%
  ggplot(aes(x = obs_date, y = co2_val)) + 
  geom_line(color="red") +
  stat_rollapplyr(width = 12, align = "right") +
  labs(x = "Year", y = "CO2 (ppm)", 
       title = "Width = 12")

Let’s look at the stat_rollapplyr() line more closely. Setting width = 12 and align = "right" states, “For each point, plot the average of that point and the 11 points behind it.” Since this is monthly data, this is the average of a year’s worth of data. Given this, think about why there is no trend line for the first 11 points on the left-hand side of the graph.

To get a better idea of what happens when we change the width argument, below is an example of setting width = 2 and width = 24.

library(patchwork)
wid2 <- co2_tbl %>%
  ggplot(aes(x = obs_date, y = co2_val)) + 
  geom_line(color="red") +
  stat_rollapplyr(width = 2, align = "right") +
  labs(x = "Year", y = "CO2 (ppm)", 
       title = "Width = 2")

wid24 <- co2_tbl %>%
  ggplot(aes(x = obs_date, y = co2_val)) + 
  geom_line(color="red") +
  stat_rollapplyr(width = 24, align = "right") +
  labs(x = "Year", y = "CO2 (ppm)", 
       title = "Width = 24")
wid2 + wid24

Setting width to be smaller gives us a line that fits the data much better, but it’s also much less smooth. (In fact, setting width = 1 is literally the data itself.) Meanwhile, setting width to be larger gives us a much smoother fit, but it doesn’t fit the data as precisely. Ultimately, choosing how much smoothness you want for your visual is very much an art: You want the trend to be smooth enough that you get an idea of the overall trend, but you also don’t want to deviate from your data too much. Some choices are better than others: For example, below is a trend line whose width is obviously too large (obviously because the trend line doesn’t fit the data well at all).

co2_tbl %>%
  ggplot(aes(x = obs_date, y = co2_val)) + 
  geom_line(color="red") +
  stat_rollapplyr(width = 100, align = "right") +
  labs(x = "Year", y = "CO2 (ppm)", 
       title = "Width = 100")

## Warning: Removed 99 row(s) containing missing values (geom_path).

One last thing to note is that moving averages are extremely similar to loess smoothing. Moving averages compute the average outcome within a sliding window; meanwhile, loess computes a weighted linear regression within a sliding window. Indeed, loess can be used for understand the overall trend of a time series as well:

moving_ave_plot <- co2_tbl %>%
  ggplot(aes(x = obs_date, y = co2_val)) + 
  geom_line(color = "red") +
  stat_rollapplyr(width = 12, align = "right") +
  labs(x = "Year", y = "CO2 (ppm)", 
       title = "Width = 12")

loess_plot <- co2_tbl %>%
  ggplot(aes(x = obs_date, y = co2_val)) + 
  geom_line(color = "red") +
  geom_smooth(method = "loess") +
  labs(x = "Year", y = "CO2 (ppm)", title = "Loess")

moving_ave_plot + loess_plot

## Warning: Removed 11 row(s) containing missing values (geom_path).

## `geom_smooth()` using formula 'y ~ x'

Moving averages actually literally are loess, but with the formula y ~ 1 instead of y ~ x (i.e., an intercept-only model), and using a rectangular kernel (which is the NOT the default in loess() in R).

Adding lines and rectangles to a time series

If a specific event in time has a notable effect on your time series data, it is common to mark that event with a vertical line, or with rectangular shading if that event occurs over a period of time. For example, it’s common for financial data to denote recessions with gray rectangular shading (this is often called “recession shading”), because economic recessions have notable impacts on financial data.

To add a vertical line to your time series, you just use geom_vline(). For example, here is a plot of the co2_tbl data with a vertical line for the beginning of 1980:

co2_tbl %>%
  ggplot(aes(x = obs_date, y = co2_val)) + 
  geom_line(color = "red") +
  stat_rollapplyr(width = 12, align = "right") +
  geom_vline(aes(xintercept = as.Date("1980-01-01"))) +
  labs(x = "Year", y = "CO2 (ppm)", 
       title = "CO2 Emissions Over Time")

Note that the xintercept has to have a very particular format: It has to be a date that has the same dating format as your data. For example, here is the exact same code, but with the more common Month-Day-Year format for the date; the line will not display (presumably because it thinks I’m trying to add a vertical line for Year 1):

co2_tbl %>%
  ggplot(aes(x = obs_date, y = co2_val)) + 
  geom_line(color = "red") +
  stat_rollapplyr(width = 12, align = "right") +
  geom_vline(aes(xintercept = as.Date("01-01-1980"))) +
  labs(x = "Year", y = "CO2 (ppm)", 
       title = "CO2 Emissions Over Time")

Furthermore, note that you cannot just put an integer (e.g., the year) into geom_vline() when plotting a time series in this way. For example, here is what happens when I try to just input the number 1980 (thinking I’ll draw a line at the year 1980):

co2_tbl %>%
  ggplot(aes(x = obs_date, y = co2_val)) + 
  geom_line(color = "red") +
  stat_rollapplyr(width = 12, align = "right") +
  geom_vline(aes(xintercept = 1980)) +
  labs(x = "Year", y = "CO2 (ppm)", 
       title = "CO2 Emissions Over Time")

Meanwhile, to add rectangular shading to a time series plot, you use geom_rect(), which we haven’t seen before. However, it’s very intuitive: Within aes(), you just need to specify an xmin, xmax, ymin, and ymax. For example, below is our time series with gray shading between 1980 and 1990:

co2_tbl %>%
  ggplot(aes(x = obs_date, y = co2_val)) + 
  geom_rect(aes(xmin = as.Date("1980-01-01"), xmax = as.Date("1989-12-31"),
                ymin = -Inf, ymax = Inf),
            fill = "grey", alpha = 0.5) +
  geom_line(color = "red") +
  stat_rollapplyr(width = 12, align = "right") +
  geom_vline(aes(xintercept = as.Date("01-01-1980"))) +
  labs(x = "Year", y = "CO2 (ppm)", 
       title = "CO2 Emissions Over Time")

There are just a few important things to keep in mind when using rectangular shading on time series:

Similar to adding a vertical line, you need to make sure that your xmin and xmax are dates, written in the same dating format that’s used in your data.
For time series plots, you typically want the rectangular shading to span the entire y-axis (i.e., from the bottom to the top of the plot). So, you typically set ymin = -Inf and ymax = Inf (as above). Note that you need to specify xmin, xmax, ymin, and ymax within geom_rect(); otherwise you will get an error.
Be sure to use geom_rect() before you draw any lines on your plot. For example, I put geom_rect() before stat_rollapplyr() above. Otherwise, the rectangle will cover the line of your plot:

co2_tbl %>%
  ggplot(aes(x = obs_date, y = co2_val)) +
  geom_line(color = "red") +
  stat_rollapplyr(width = 12, align = "right") +
  geom_rect(aes(xmin = as.Date("1980-01-01"), xmax = as.Date("1989-12-31"),
                ymin = -Inf, ymax = Inf),
            fill = "grey", alpha = 0.5) +
  geom_vline(aes(xintercept = as.Date("01-01-1980"))) +
  labs(x = "Year", y = "CO2 (ppm)", 
       title = "CO2 Emissions Over Time")

Lags and autocorrelation functions

Now we will walk through how to compute autocorrelations using lags.

Imagine computing the autocorrelation at each lag (1, 2, 3, …). For example, the autocorrelation at lag 1 is the correlation between the time series itself and the time series shifted back one timepoint. Computing the autocorrelation at each lag will thus give us a vector of autocorrelations, commonly called the autocorrelation function (ACF).

You can think of the ACF as taking in a lag \(\ell\) and spitting out a correlation between -1 and 1. The higher the correlation, the more dependent the time series measurements are on each other; and the larger the \(\ell\) with high correlation, the more this dependence “lasts” over time.

Base R has the acf() function which can be used to compute and plot the ACF for a large number of lags.

acf(co2_tbl$co2_val)

And the ggplot version is:

auto_corr <- acf(co2_tbl$co2_val, plot = FALSE)
autoplot(auto_corr)

## Warning: `mutate_()` was deprecated in dplyr 0.7.0.
## Please use `mutate()` instead.
## See vignette('programming') for more help

By default, these plots display 95% confidence intervals, where autocorrelations outside this confidence interval are deemed “unusually large.” These intervals will depend on the data itself (for example, in another application, the intervals may not be \(\pm 0.1\)), and the way these intervals are computed is outside the scope of this demo (but it is easily Googleable for those interested; see the end of this documentfor example).

Because the autocorrelations are way above the confidence interval, we would say that this visual clearly shows that the co2 data is highly autocorrelated, due to the strong global trend. What if we remove the trend, then look at the ACF of the residuals? We can compute moving averages using rollapply (in the zoo package, loaded when you load ggseas).

# Use rollapply to compute a Moving Average, then compute residuals (co2_val - mov_ave)
co2_tbl <- co2_tbl %>%
  mutate(mov_ave = 
           zoo::rollapply(co2_val, width = 12, FUN = "mean", 
                          align = "right", fill = NA),
         res = co2_val - mov_ave)
#Just to understand what this is doing, let's look
#at the first few rows of the data:
co2_tbl[1:13,]

## # A tibble: 13 x 5
##    co2_val obs_i obs_date   mov_ave    res
##      <dbl> <int> <date>       <dbl>  <dbl>
##  1    315.     1 1959-01-01     NA  NA    
##  2    316.     2 1959-02-01     NA  NA    
##  3    316.     3 1959-03-01     NA  NA    
##  4    318.     4 1959-04-01     NA  NA    
##  5    318.     5 1959-05-01     NA  NA    
##  6    318      6 1959-06-01     NA  NA    
##  7    316.     7 1959-07-01     NA  NA    
##  8    315.     8 1959-08-01     NA  NA    
##  9    314.     9 1959-09-01     NA  NA    
## 10    313.    10 1959-10-01     NA  NA    
## 11    315.    11 1959-11-01     NA  NA    
## 12    315.    12 1959-12-01    316. -0.396
## 13    316.    13 1960-01-01    316.  0.373

Just to show that this is indeed the moving average:

co2_tbl %>%
  ggplot(aes(x = obs_date)) +
  geom_line(aes(y = mov_ave)) +
  geom_line(aes(y = co2_val), color = "red") +
  labs(x = "Year", y = "CO2 (ppm)")

## Warning: Removed 11 row(s) containing missing values (geom_path).

Now let’s look at a plot of the residuals (i.e., the observed outcomes minus the moving average), as well as an autocorrelation plot of the residuals:

# Plot the residuals
co2_tbl %>%
  ggplot(aes(x = obs_date, y = res)) +
  geom_line() +
  labs(x = "Year", y = "Residuals of CO2 (ppm)")

## Warning: Removed 11 row(s) containing missing values (geom_path).

# Compute ACF of the residuals
# (except the first 11 cases, which are NA)
autoplot(acf(tail(co2_tbl$res, -11), plot=F))

Now what pattern do we see in the ACF plot? Which lags have the most positive and most negative values? What does this mean for monthly data?

See this link for tips on mimicking the acf() plot in ggplot. For example, you can save the output of acf() and then put it inside a data frame:

co2_acf <- acf(tail(co2_tbl$res, -11))

co2_acf_df <- with(co2_acf, data.frame(lag, acf))
#the resulting data.frame looks like this:
head(co2_acf_df)

##   lag         acf
## 1   0  1.00000000
## 2   1  0.83287000
## 3   2  0.42898732
## 4   3 -0.05635883
## 5   4 -0.47308749
## 6   5 -0.73412921

#Then, we can make a simple "bar" plot:
#(bar is in quotes because it's not displaying a categorical variable)
ggplot(co2_acf_df, aes(x = lag, y = acf)) + geom_bar(stat="identity")

#Alternatively, you could use:
# ggplot(co2_acf_df, aes(x = lag, y = acf)) + geom_col()
#which does the same thing.

Seasonal decomposition

There are three main characteristics of time series:

Trends: The average increase or decrease in a variable over time.
Seasonality: Changes in the variable that regularly happen (e.g., every winter, every hour, etc.)
Noise: Variation in the variable beyond average trends and seasonality.

The idea above shows us that we can remove global trends from the dataset and then look for seasonal trends. The base R function stl() does this explicitly using loess smoothers: “Seasonal decomposition of Time series by Loess”. It actually also separates the seasonal trend from the irregular / residual noise.

In ggseas, we can mimic this by using ggsdc() instead of starting with the ggplot() function. (The argument s.window is a kind of loess smoother span for the seasonal extraction.)

co2_tbl %>%
  ggsdc(aes(obs_date, co2_val), 
        frequency = 12, method = "stl", s.window = 12) + 
  geom_line() +
  labs(x = "Year", y = "CO2 (ppm)")

The sum of “trend” + “seasonal” + “irregular” gives us back the original data, shown in “observed” (the top facet).

Does it look like stl found all interesting global or periodic trends, or is there any signal left in the “irregular” component?

Here is the intuition for how the stl() function works:

The “trend” component is obtained by fitting a loess curve to the time series. The smoothness of this loess is controlled by frequency.
The “seasonal” component is obtained by fitting another loess curve on the residuals (observed - trend). The smoothness of this loess is controlled by s.window.
The “irregular” component is obtained as the remaining residuals (observed - trend - seasonal).

This is very technically not 100% correct, but it’s very, very close to what actually happens within the stl() function; see help(stl) for details.

BONUS: Visualizing longitudinal data and time series