Statistical Computing, 36-350
Monday October 14, 2019
set.seed()
Reading in and reordering data
All along, we’ve already been reading in data from the outside, using:
readLines()
: reading in lines of text from a file or webpage; returns vector of stringsread.table()
: read in a data table from a file or webpage; returns data frameread.csv()
: like the above, but assumes comma separated data; returns data frameThis week we’ll focus on read.table()
, read.csv()
and their counterparts write.table()
, write.csv()
, respectively
Reminder that we’ve seen two ways to read/write in data in specialized R formats:
readRDS()
, saveRDS()
: functions for reading/writing single R objects from/to a fileload()
, save()
: functions for reading/writing any number of R objects from/to a fileAdvantage: these can be a lot more memory efficient than what we’ll cover this week. Disadvantage: they’re limited in scope in the sense that they can only communicate with R
read.table()
, read.csv()
Have a table full of data, just not in the R file format? Then read.table()
is the function for you. It works as in:
read.table(file=file.name, sep=" ")
, to read data from a local file on your computer called file.name
, assuming (say) space separated dataread.table(file=webpage.link, sep="\t")
, to read data from a webpage up at webpage.link
, assuming (say) tab separated dataThe function read.csv()
is just a shortcut for using read.table()
with sep=","
. (But note: these two actually differ on some of the default inputs!)
# This data table is comma separated, so we can use read.csv()
strike.df =
read.csv("http://www.stat.cmu.edu/~ryantibs/statcomp-F19/data/strikes.csv")
head(strike.df)
## country year strike.volume unemployment inflation left.parliament
## 1 Australia 1951 296 1.3 19.8 43.0
## 2 Australia 1952 397 2.2 17.2 43.0
## 3 Australia 1953 360 2.5 4.3 43.0
## 4 Australia 1954 3 1.7 0.7 47.0
## 5 Australia 1955 326 1.4 2.0 38.5
## 6 Australia 1956 352 1.8 6.3 38.5
## centralization density
## 1 0.3748588 NA
## 2 0.3751829 NA
## 3 0.3745076 NA
## 4 0.3710170 NA
## 5 0.3752675 NA
## 6 0.3716072 NA
## country year strike.volume unemployment
## "factor" "integer" "integer" "numeric"
## inflation left.parliament centralization density
## "numeric" "numeric" "numeric" "numeric"
# This data table is tab separated, so let's specify sep="\t"
anss.df = read.table(
file="http://www.stat.cmu.edu/~ryantibs/statcomp-F19/data/anss.dat",
sep="\t")
head(anss.df)
## V1 V2 V3 V4 V5 V6
## 1 Date Time Lat Lon Depth Mag
## 2 2002/01/01 10:39:06.82 -55.2140 -129.0000 10.00 6.00
## 3 2002/01/01 11:29:22.73 6.3030 125.6500 138.10 6.30
## 4 2002/01/02 14:50:33.49 -17.9830 178.7440 665.80 6.20
## 5 2002/01/02 17:22:48.76 -17.6000 167.8560 21.00 7.20
## 6 2002/01/03 07:05:27.67 36.0880 70.6870 129.30 6.20
## V1 V2 V3 V4 V5 V6
## "factor" "factor" "factor" "factor" "factor" "factor"
# Oops! It comes with column names, so let's set header=TRUE
anss.df = read.table(
file="http://www.stat.cmu.edu/~ryantibs/statcomp-F19/data/anss.dat",
sep="\t", header=TRUE)
head(anss.df)
## Date Time Lat Lon Depth Mag
## 1 2002/01/01 10:39:06.82 -55.214 -129.000 10.0 6.0
## 2 2002/01/01 11:29:22.73 6.303 125.650 138.1 6.3
## 3 2002/01/02 14:50:33.49 -17.983 178.744 665.8 6.2
## 4 2002/01/02 17:22:48.76 -17.600 167.856 21.0 7.2
## 5 2002/01/03 07:05:27.67 36.088 70.687 129.3 6.2
## 6 2002/01/03 10:17:36.30 -17.664 168.004 10.0 6.6
## Date Time Lat Lon Depth Mag
## "factor" "factor" "numeric" "numeric" "numeric" "numeric"
The following inputs apply to either read.table()
or read.csv()
(though these two functions actually have different default inputs in general—e.g., header
defaults to TRUE in read.csv()
but FALSE in read.table()
)
header
: boolean, TRUE is the first line should be interpreted as column namessep
: string, specifies what separates the entries; empty string "" is interpreted to mean any white spacequote
: string, specifies what set of characters signify the beginning and end of quotes; empty string "" disables quotes altogetherOther helpful inputs: skip
, row.names
, col.names
. You can read about them in the help file for read.table()
# This data table is tab separated, and it comes with column names
sprint.m.df = read.table(
file="http://www.stat.cmu.edu/~ryantibs/statcomp-F19/data/sprint.m.dat",
sep="\t", header=TRUE)
## Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : line 981 did not have 8 elements
# Oops! It turns out it has some apostrophe marks in the City column,
# so we have to set quote=""
sprint.m.df = read.table(
file="http://www.stat.cmu.edu/~ryantibs/statcomp-F19/data/sprint.m.dat",
sep="\t", header=TRUE, quote="")
head(sprint.m.df)
## Rank Time Wind Name Country Birthdate City Date
## 1 1 9.58 0.9 Usain Bolt JAM 21.08.86 Berlin 16.08.2009
## 2 2 9.63 1.5 Usain Bolt JAM 21.08.86 London 05.08.2012
## 3 3 9.69 0.0 Usain Bolt JAM 21.08.86 Beijing 16.08.2008
## 4 3 9.69 2.0 Tyson Gay USA 09.08.82 Shanghai 20.09.2009
## 5 3 9.69 -0.1 Yohan Blake JAM 26.12.89 Lausanne 23.08.2012
## 6 6 9.71 0.9 Tyson Gay USA 09.08.82 Berlin 16.08.2009
## [1] "Villeneuve d'Ascq"
write.table()
, write.csv()
To write a data frame (or matrix) to a text file, use write.table()
or write.csv()
. These are the counterparts to read.table()
and read.csv()
, and they work as in:
write.table(my.df, file="my.df.txt", sep=" ", quote=FALSE)
, to write my.dat
to the text file “my.df.txt” (to be created in your working directory), with (say) space separation, and no quotes around the entrieswrite.csv(my.df, file="my.df.csv", quote=FALSE)
, to write my.dat
to the text file “my.df.csv” (to be created in your working directory), with comma separation, and no quotesNote that quote=FALSE
, signifying that no quotes should be put around the printed data entries, seems always preferable (default is quote=TRUE
). Also, setting row.names=FALSE
and col.names=FALSE
will disable the printing of row and column names (defaults are row.names=TRUE
and col.names=TRUE
)
Sometimes it’s convenient to reorder our data, say the rows of our data frame (or matrix). Recall:
order()
takes in a vector, and returns the vector of indices that put this vector in increasing orderdecreasing=TRUE
in order()
to get decreasing order# The sprint data has its rows ordered by the fastest 100m time to the slowest.
# Suppose we wanted to reorder, from slowest to fastest
i.slow = order(sprint.m.df$Time, decreasing=TRUE) # By decreasing sprint time
sprint.m.df.slow = sprint.m.df[i.slow,] # Reorder rows by decreasing sprint time
head(sprint.m.df.slow)
## Rank Time Wind Name Country Birthdate City
## 2682 2691 10.09 1.8 Mel Lattany USA 10.08.59 Colorado Springs
## 2683 2691 10.09 -0.9 Mel Lattany USA 10.08.59 Zürich
## 2684 2691 10.09 1.3 Carl Lewis USA 01.07.61 Walnut
## 2685 2691 10.09 0.6 Calvin Smith USA 08.01.61 Athens
## 2686 2691 10.09 -1.7 Carl Lewis USA 01.07.61 Indianapolis
## 2687 2691 10.09 -0.9 Calvin Smith USA 08.01.61 Zürich
## Date
## 2682 30.07.1978
## 2683 19.08.1981
## 2684 25.04.1982
## 2685 14.05.1982
## 2686 02.07.1982
## 2687 18.08.1982
# Suppose we wanted to reorder the rows by sprinter name, alphabetically
i.name = order(sprint.m.df$Name) # By sprinter name
sprint.m.df.name = sprint.m.df[i.name,] # Reorder rows by name
head(sprint.m.df.name)
## Rank Time Wind Name Country Birthdate City
## 1373 1281 10.03 0.5 Aaron Armstrong TTO 14.10.77 Port of Spain
## 1552 1456 10.04 1.0 Aaron Armstrong TTO 14.10.77 Port of Spain
## 2326 2145 10.07 1.0 Aaron Armstrong TTO 14.10.77 Port of Spain
## 568 491 9.96 2.0 Aaron Brown CAN 27.05.92 Montverde
## 1114 1011 10.01 1.8 Aaron Brown CAN 27.05.92 Montverde
## 1836 1683 10.05 1.9 Aaron Brown CAN 27.05.92 Eugene
## Date
## 1373 20.06.2009
## 1552 25.06.2005
## 2326 13.08.2011
## 568 11.06.2016
## 1114 11.06.2016
## 1836 05.06.2013
Merging data
Suppose you have two data frames X
, Y
, and you want to combine them
data.frame(X,Y)
data.frame(X$col1,X$col5,Y$col3)
order()
. Alternatively, use merge()
People in larger cities (larger areas) drive more—seems like a reasonable hypothesis, but is it true?
Distance driven, and city population, from http://www.fhwa.dot.gov/policyinformation/statistics/2011/hm71.cfm:
fha = read.csv(
file="http://www.stat.cmu.edu/~ryantibs/statcomp-F19/data/fha.csv",
colClasses=c("character","double","double","double"), na.strings="NA")
nrow(fha)
## [1] 498
## City Population Miles.of.Road
## 1 New York--Newark, NY--NJ--CT 18351295 43893
## 2 Los Angeles--Long Beach--Anaheim, CA 12150996 24877
## 3 Chicago, IL--IN 8608208 25905
## 4 Miami, FL 5502379 15641
## 5 Philadelphia, PA--NJ--DE--MD 5441567 19867
## 6 Dallas--Fort Worth--Arlington, TX 5121892 21610
## Daily.Miles.Traveled
## 1 286101
## 2 270807
## 3 172708
## 4 125899
## 5 99190
## 6 125389
Area and population of “urbanized areas”, from http://www2.census.gov/geo/ua/ua_list_all.txt:
## [1] 3598
## UACE NAME POP HU AREALAND AREALANDSQMI AREAWATER
## 1 37 Abbeville, LA 19824 8460 29222871 11.28 300497
## 2 64 Abbeville, SC 5243 2578 11315197 4.37 19786
## 3 91 Abbotsford, WI 3966 1616 5363441 2.07 13221
## 4 118 Aberdeen, MS 4666 2050 7416616 2.86 52732
## 5 145 Aberdeen, SD 25977 12114 33002447 12.74 247597
## 6 172 Aberdeen, WA 29856 13139 39997951 15.44 1929689
## AREAWATERSQMI POPDEN LSADC
## 1 0.12 1757.0 76
## 2 0.01 1200.1 76
## 3 0.01 1915.2 76
## 4 0.02 1629.4 76
## 5 0.10 2038.6 76
## 6 0.75 1933.3 76
fha
orders cities by population, ua
is alphabetical by nameBut both use the same census figures for population! And it turns out every settlement (in the top 498) has a unique census population:
## [1] TRUE
## [1] 0
Reorder area column in second table by population, append to first table
ind.pop = order(ua$POP, decreasing=TRUE) # Order by population
df1 = data.frame(fha, area=ua$AREALANDSQMI[ind.pop][1:nrow(fha)])
# Neaten up names
colnames(df1) = c("City","Population","Roads","Mileage","Area")
nrow(df1)
## [1] 498
## City Population Roads Mileage Area
## 1 New York--Newark, NY--NJ--CT 18351295 43893 286101 3450.20
## 2 Los Angeles--Long Beach--Anaheim, CA 12150996 24877 270807 1736.02
## 3 Chicago, IL--IN 8608208 25905 172708 2442.75
## 4 Miami, FL 5502379 15641 125899 1238.61
## 5 Philadelphia, PA--NJ--DE--MD 5441567 19867 99190 1981.37
## 6 Dallas--Fort Worth--Arlington, TX 5121892 21610 125389 1779.13
Use the merge()
function
## [1] 498
## Population City Miles.of.Road
## 496 8608208 Chicago, IL--IN 25905
## 497 12150996 Los Angeles--Long Beach--Anaheim, CA 24877
## 498 18351295 New York--Newark, NY--NJ--CT 43893
## Daily.Miles.Traveled UACE NAME
## 496 172708 16264 Chicago, IL--IN
## 497 270807 51445 Los Angeles--Long Beach--Anaheim, CA
## 498 286101 63217 New York--Newark, NY--NJ--CT
## HU AREALAND AREALANDSQMI AREAWATER AREAWATERSQMI POPDEN LSADC
## 496 3459257 6326686332 2442.75 105649916 40.79 3524.0 75
## 497 4217448 4496266014 1736.02 61141327 23.61 6999.3 75
## 498 7263095 8935981360 3450.20 533176599 205.86 5318.9 75
merge()
The merge()
function tries to merge two data frames according to common columns, as in: merge(x, y, by.x="SomeXCol", by.y="SomeYCol")
, to join two data frames x
, y
, by matching the columns “SomeXCol” and “SomeYCol”
by.x
, by.y
specified) is to match all columns with common namesmerge()
is doing a JOINorder()
and manual tricks versus merge()
merge()
takes some learningmerge()
handles many columnsmerge()
handles different sizes automatically# Convert 1,000s of miles to miles
df1$Mileage = 1000*df1$Mileage
# Plot daily miles per person vs. area
plot(x=df1$Area, y=df1$Mileage/df1$Population, log="x",
xlab="Miles driven per person per day",
ylab="City area in square miles")
# Impressively flat regression line
abline(lm(Mileage/Population ~ Area, data=df1), col="red")
readRDS()
, load()
read.table()
, read.csv()
read.table()
, read.csv()
order()
, rev()
, and proper indexingmerge()
; but can do it manually using reordering tricks