Statistical Computing, 36-350
Friday October 28, 2016
Suppose you have two data frames X
, Y
, and you want to combine them
data.frame(X,Y)
data.frame(X$col1,X$col5,Y$col3)
order()
. Alternatively, use merge()
People in larger cities (larger areas) drive more—seems like a reasonable hypothesis, but is it true?
Distance driven, and city population, from http://www.fhwa.dot.gov/policyinformation/statistics/2011/hm71.cfm:
fha = read.csv("http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/fha.csv",
na.strings="NA",
colClasses=c("character","double","double","double"))
nrow(fha)
## [1] 498
head(fha)
## City Population Miles.of.Road
## 1 New York--Newark, NY--NJ--CT 18351295 43893
## 2 Los Angeles--Long Beach--Anaheim, CA 12150996 24877
## 3 Chicago, IL--IN 8608208 25905
## 4 Miami, FL 5502379 15641
## 5 Philadelphia, PA--NJ--DE--MD 5441567 19867
## 6 Dallas--Fort Worth--Arlington, TX 5121892 21610
## Daily.Miles.Traveled
## 1 286101
## 2 270807
## 3 172708
## 4 125899
## 5 99190
## 6 125389
Area and population of “urbanized areas”, from http://www2.census.gov/geo/ua/ua_list_all.txt:
ua = read.csv("http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/ua.dat",
sep=";")
nrow(ua)
## [1] 3598
head(ua)
## UACE NAME POP HU AREALAND AREALANDSQMI AREAWATER
## 1 37 Abbeville, LA 19824 8460 29222871 11.28 300497
## 2 64 Abbeville, SC 5243 2578 11315197 4.37 19786
## 3 91 Abbotsford, WI 3966 1616 5363441 2.07 13221
## 4 118 Aberdeen, MS 4666 2050 7416616 2.86 52732
## 5 145 Aberdeen, SD 25977 12114 33002447 12.74 247597
## 6 172 Aberdeen, WA 29856 13139 39997951 15.44 1929689
## AREAWATERSQMI POPDEN LSADC
## 1 0.12 1757.0 76
## 2 0.01 1200.1 76
## 3 0.01 1915.2 76
## 4 0.02 1629.4 76
## 5 0.10 2038.6 76
## 6 0.75 1933.3 76
fha
orders cities by population, ua
is alphabetical by nameBut both use the same census figures for population! And it turns out every settlement (in the top 498) has a unique census population:
length(unique(fha$Population)) == nrow(fha)
## [1] TRUE
ua.pop.top498 = sort(ua$POP, decreasing=TRUE)[1:nrow(fha)]
max(abs(fha$Population - ua.pop.top498))
## [1] 0
Reorder area column in second table by population, append to first table
ind.pop = order(ua$POP, decreasing=TRUE) # Order by population
df1 = data.frame(fha, area=ua$AREALANDSQMI[ind.pop][1:nrow(fha)])
# Neaten up names
colnames(df1) = c("City","Population","Roads","Mileage","Area")
nrow(df1)
## [1] 498
head(df1)
## City Population Roads Mileage Area
## 1 New York--Newark, NY--NJ--CT 18351295 43893 286101 3450.20
## 2 Los Angeles--Long Beach--Anaheim, CA 12150996 24877 270807 1736.02
## 3 Chicago, IL--IN 8608208 25905 172708 2442.75
## 4 Miami, FL 5502379 15641 125899 1238.61
## 5 Philadelphia, PA--NJ--DE--MD 5441567 19867 99190 1981.37
## 6 Dallas--Fort Worth--Arlington, TX 5121892 21610 125389 1779.13
Use the merge()
function
df2 = merge(x=fha, y=ua, by.x="Population", by.y="POP")
nrow(df2)
## [1] 498
tail(df2, 3)
## Population City Miles.of.Road
## 496 8608208 Chicago, IL--IN 25905
## 497 12150996 Los Angeles--Long Beach--Anaheim, CA 24877
## 498 18351295 New York--Newark, NY--NJ--CT 43893
## Daily.Miles.Traveled UACE NAME
## 496 172708 16264 Chicago, IL--IN
## 497 270807 51445 Los Angeles--Long Beach--Anaheim, CA
## 498 286101 63217 New York--Newark, NY--NJ--CT
## HU AREALAND AREALANDSQMI AREAWATER AREAWATERSQMI POPDEN LSADC
## 496 3459257 6326686332 2442.75 105649916 40.79 3524.0 75
## 497 4217448 4496266014 1736.02 61141327 23.61 6999.3 75
## 498 7263095 8935981360 3450.20 533176599 205.86 5318.9 75
merge()
The merge()
function tries to merge two data frames according to common columns, as in: merge(x, y, by.x="SomeXCol", by.y="SomeYCol")
, to join two data frames x
, y
, by matching the columns “SomeXCol” and “SomeYCol”
by.x
, by.y
specified) is to match all columns with common namesmerge()
is doing a JOIN (if you don’t know what that means, you will by the end of the semester!)order()
and manual tricks versus merge()
merge()
takes some learningmerge()
handles many columnsmerge()
handles different sizes automatically# Convert 1,000s of miles to miles
df1$Mileage = 1000*df1$Mileage
# Plot daily miles per person vs. area
plot(x=df1$Area, y=df1$Mileage/df1$Population, log="x",
xlab="Miles driven per person per day",
ylab="City area in square miles")
# Impressively flat regression line
abline(lm(Mileage/Population ~ Area, data=df1), col="red")