Statistical Computing, 36-350
Friday October 28, 2016
Suppose you have two data frames X, Y, and you want to combine them
data.frame(X,Y)data.frame(X$col1,X$col5,Y$col3)order(). Alternatively, use merge()People in larger cities (larger areas) drive more—seems like a reasonable hypothesis, but is it true?
Distance driven, and city population, from http://www.fhwa.dot.gov/policyinformation/statistics/2011/hm71.cfm:
fha = read.csv("http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/fha.csv",
               na.strings="NA", 
               colClasses=c("character","double","double","double"))
nrow(fha)## [1] 498head(fha)##                                   City Population Miles.of.Road
## 1         New York--Newark, NY--NJ--CT   18351295         43893
## 2 Los Angeles--Long Beach--Anaheim, CA   12150996         24877
## 3                      Chicago, IL--IN    8608208         25905
## 4                            Miami, FL    5502379         15641
## 5         Philadelphia, PA--NJ--DE--MD    5441567         19867
## 6    Dallas--Fort Worth--Arlington, TX    5121892         21610
##   Daily.Miles.Traveled
## 1               286101
## 2               270807
## 3               172708
## 4               125899
## 5                99190
## 6               125389Area and population of “urbanized areas”, from http://www2.census.gov/geo/ua/ua_list_all.txt:
ua = read.csv("http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/ua.dat",
              sep=";")
nrow(ua)## [1] 3598head(ua)##   UACE           NAME   POP    HU AREALAND AREALANDSQMI AREAWATER
## 1   37  Abbeville, LA 19824  8460 29222871        11.28    300497
## 2   64  Abbeville, SC  5243  2578 11315197         4.37     19786
## 3   91 Abbotsford, WI  3966  1616  5363441         2.07     13221
## 4  118   Aberdeen, MS  4666  2050  7416616         2.86     52732
## 5  145   Aberdeen, SD 25977 12114 33002447        12.74    247597
## 6  172   Aberdeen, WA 29856 13139 39997951        15.44   1929689
##   AREAWATERSQMI POPDEN LSADC
## 1          0.12 1757.0    76
## 2          0.01 1200.1    76
## 3          0.01 1915.2    76
## 4          0.02 1629.4    76
## 5          0.10 2038.6    76
## 6          0.75 1933.3    76fha orders cities by population, ua is alphabetical by nameBut both use the same census figures for population! And it turns out every settlement (in the top 498) has a unique census population:
length(unique(fha$Population)) == nrow(fha)## [1] TRUEua.pop.top498 = sort(ua$POP, decreasing=TRUE)[1:nrow(fha)]
max(abs(fha$Population - ua.pop.top498))## [1] 0Reorder area column in second table by population, append to first table
ind.pop = order(ua$POP, decreasing=TRUE) # Order by population
df1 = data.frame(fha, area=ua$AREALANDSQMI[ind.pop][1:nrow(fha)])
# Neaten up names
colnames(df1) = c("City","Population","Roads","Mileage","Area")
nrow(df1)## [1] 498head(df1)##                                   City Population Roads Mileage    Area
## 1         New York--Newark, NY--NJ--CT   18351295 43893  286101 3450.20
## 2 Los Angeles--Long Beach--Anaheim, CA   12150996 24877  270807 1736.02
## 3                      Chicago, IL--IN    8608208 25905  172708 2442.75
## 4                            Miami, FL    5502379 15641  125899 1238.61
## 5         Philadelphia, PA--NJ--DE--MD    5441567 19867   99190 1981.37
## 6    Dallas--Fort Worth--Arlington, TX    5121892 21610  125389 1779.13Use the merge() function
df2 = merge(x=fha, y=ua, by.x="Population", by.y="POP")
nrow(df2)## [1] 498tail(df2, 3)##     Population                                 City Miles.of.Road
## 496    8608208                      Chicago, IL--IN         25905
## 497   12150996 Los Angeles--Long Beach--Anaheim, CA         24877
## 498   18351295         New York--Newark, NY--NJ--CT         43893
##     Daily.Miles.Traveled  UACE                                 NAME
## 496               172708 16264                      Chicago, IL--IN
## 497               270807 51445 Los Angeles--Long Beach--Anaheim, CA
## 498               286101 63217         New York--Newark, NY--NJ--CT
##          HU   AREALAND AREALANDSQMI AREAWATER AREAWATERSQMI POPDEN LSADC
## 496 3459257 6326686332      2442.75 105649916         40.79 3524.0    75
## 497 4217448 4496266014      1736.02  61141327         23.61 6999.3    75
## 498 7263095 8935981360      3450.20 533176599        205.86 5318.9    75merge()The merge() function tries to merge two data frames according to common columns, as in: merge(x, y, by.x="SomeXCol", by.y="SomeYCol"), to join two data frames x, y, by matching the columns “SomeXCol” and “SomeYCol”
by.x, by.y specified) is to match all columns with common namesmerge() is doing a JOIN (if you don’t know what that means, you will by the end of the semester!)order() and manual tricks versus merge()merge() takes some learningmerge() handles many columnsmerge() handles different sizes automatically# Convert 1,000s of miles to miles
df1$Mileage = 1000*df1$Mileage
# Plot daily miles per person vs. area
plot(x=df1$Area, y=df1$Mileage/df1$Population, log="x",
     xlab="Miles driven per person per day",
     ylab="City area in square miles")
# Impressively flat regression line
abline(lm(Mileage/Population ~ Area, data=df1), col="red")