Merging data frames

Suppose you have two data frames X, Y, and you want to combine them

Simplest case: the data frames have exactly the same number of rows, that the rows represent exactly the same units, and you want all columns from both; just use, data.frame(X,Y)
Next best case: you know that the two data frames have the same rows, but you only want certain columns from each; just use, e.g., data.frame(X$col1,X$col5,Y$col3)
Next best case: same number of rows but in different order; put one of them in same order as the other, with order(). Alternatively, use merge()
Worse cases: different numbers of rows … hard to line up rows …

Example: big city drivers

People in larger cities (larger areas) drive more—seems like a reasonable hypothesis, but is it true?

Distance driven, and city population, from http://www.fhwa.dot.gov/policyinformation/statistics/2011/hm71.cfm:

fha = read.csv("http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/fha.csv",
               na.strings="NA", 
               colClasses=c("character","double","double","double"))
nrow(fha)

## [1] 498

head(fha)

##                                   City Population Miles.of.Road
## 1         New York--Newark, NY--NJ--CT   18351295         43893
## 2 Los Angeles--Long Beach--Anaheim, CA   12150996         24877
## 3                      Chicago, IL--IN    8608208         25905
## 4                            Miami, FL    5502379         15641
## 5         Philadelphia, PA--NJ--DE--MD    5441567         19867
## 6    Dallas--Fort Worth--Arlington, TX    5121892         21610
##   Daily.Miles.Traveled
## 1               286101
## 2               270807
## 3               172708
## 4               125899
## 5                99190
## 6               125389

(Continued)

Area and population of “urbanized areas”, from http://www2.census.gov/geo/ua/ua_list_all.txt:

ua = read.csv("http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/ua.dat",
              sep=";")
nrow(ua)

## [1] 3598

head(ua)

##   UACE           NAME   POP    HU AREALAND AREALANDSQMI AREAWATER
## 1   37  Abbeville, LA 19824  8460 29222871        11.28    300497
## 2   64  Abbeville, SC  5243  2578 11315197         4.37     19786
## 3   91 Abbotsford, WI  3966  1616  5363441         2.07     13221
## 4  118   Aberdeen, MS  4666  2050  7416616         2.86     52732
## 5  145   Aberdeen, SD 25977 12114 33002447        12.74    247597
## 6  172   Aberdeen, WA 29856 13139 39997951        15.44   1929689
##   AREAWATERSQMI POPDEN LSADC
## 1          0.12 1757.0    76
## 2          0.01 1200.1    76
## 3          0.01 1915.2    76
## 4          0.02 1629.4    76
## 5          0.10 2038.6    76
## 6          0.75 1933.3    76

Difficulties in navigating the merge

$\approx 500$ cities versus $\approx 4000$ “urbanized areas”
fha orders cities by population, ua is alphabetical by name
Both have place names, but those don’t always agree
Not even common names for the shared columns

Lesson: find the unique identifier

But both use the same census figures for population! And it turns out every settlement (in the top 498) has a unique census population:

length(unique(fha$Population)) == nrow(fha)

## [1] TRUE

ua.pop.top498 = sort(ua$POP, decreasing=TRUE)[1:nrow(fha)]
max(abs(fha$Population - ua.pop.top498))

## [1] 0

First way to merge

Reorder area column in second table by population, append to first table

ind.pop = order(ua$POP, decreasing=TRUE) # Order by population
df1 = data.frame(fha, area=ua$AREALANDSQMI[ind.pop][1:nrow(fha)])
# Neaten up names
colnames(df1) = c("City","Population","Roads","Mileage","Area")
nrow(df1)

## [1] 498

head(df1)

##                                   City Population Roads Mileage    Area
## 1         New York--Newark, NY--NJ--CT   18351295 43893  286101 3450.20
## 2 Los Angeles--Long Beach--Anaheim, CA   12150996 24877  270807 1736.02
## 3                      Chicago, IL--IN    8608208 25905  172708 2442.75
## 4                            Miami, FL    5502379 15641  125899 1238.61
## 5         Philadelphia, PA--NJ--DE--MD    5441567 19867   99190 1981.37
## 6    Dallas--Fort Worth--Arlington, TX    5121892 21610  125389 1779.13

Second way to merge

Use the merge() function

df2 = merge(x=fha, y=ua, by.x="Population", by.y="POP")
nrow(df2)

## [1] 498

tail(df2, 3)

##     Population                                 City Miles.of.Road
## 496    8608208                      Chicago, IL--IN         25905
## 497   12150996 Los Angeles--Long Beach--Anaheim, CA         24877
## 498   18351295         New York--Newark, NY--NJ--CT         43893
##     Daily.Miles.Traveled  UACE                                 NAME
## 496               172708 16264                      Chicago, IL--IN
## 497               270807 51445 Los Angeles--Long Beach--Anaheim, CA
## 498               286101 63217         New York--Newark, NY--NJ--CT
##          HU   AREALAND AREALANDSQMI AREAWATER AREAWATERSQMI POPDEN LSADC
## 496 3459257 6326686332      2442.75 105649916         40.79 3524.0    75
## 497 4217448 4496266014      1736.02  61141327         23.61 6999.3    75
## 498 7263095 8935981360      3450.20 533176599        205.86 5318.9    75

`merge()`

The merge() function tries to merge two data frames according to common columns, as in: merge(x, y, by.x="SomeXCol", by.y="SomeYCol"), to join two data frames x, y, by matching the columns “SomeXCol” and “SomeYCol”

Default (no by.x, by.y specified) is to match all columns with common names
Output will be a new data frame that has all the columns of both data frames
If you know databases, then merge() is doing a JOIN (if you don’t know what that means, you will by the end of the semester!)

Using `order()` and manual tricks versus `merge()`

Reordering is easier to grasp; merge() takes some learning
Reordering is simplest when there’s only one column to merge on; merge() handles many columns
Reorderng is simplest when the data frames are the same size; merge() handles different sizes automatically

So, do bigger cities mean more driving?

# Convert 1,000s of miles to miles
df1$Mileage = 1000*df1$Mileage
# Plot daily miles per person vs. area
plot(x=df1$Area, y=df1$Mileage/df1$Population, log="x",
     xlab="Miles driven per person per day",
     ylab="City area in square miles")
# Impressively flat regression line
abline(lm(Mileage/Population ~ Area, data=df1), col="red")

Merging Data