set.seed()
Reading in and reordering data
All along, we’ve already been reading in data from the outside, using:
readLines()
: reading in lines of text from a file or webpage; returns vector of stringsread.table()
: read in a data table from a file or webpage; returns data frameread.csv()
: like the above, but assumes comma separated data; returns data frameThis week we’ll focus on read.table()
, read.csv()
and their counterparts write.table()
, write.csv()
, respectively
Reminder that we’ve already learned two ways to read/write in data in specialized R formats:
readRDS()
, saveRDS()
: functions for reading/writing single R objects from/to a fileload()
, save()
: functions for reading/writing any number of R objects from/to a fileAdvantage: these can be a lot more memory efficient than what we’ll cover this week. Disadvantage: they’re limited in scope in the sense that they can only communicate with R
read.table()
, read.csv()
Have a table full of data, just not in the R file format? Then read.table()
is the function for you. It works as in:
read.table(file=file.name, sep=" ")
, to read data from a local file on your computer called file.name
, assuming (say) space separated dataread.table(file=webpage.link, sep="\t")
, to read data from a webpage up at webpage.link
, assuming (say) tab separated dataThe function read.csv()
is just a shortcut for using read.table()
with sep=","
. (But note: these two actually differ on some of the default inputs!)
# This data table is comma separated, so we can use read.csv()
strike.df =
read.csv("http://www.stat.cmu.edu/~ryantibs/statcomp-S18/data/strikes.csv")
head(strike.df)
## country year strike.volume unemployment inflation left.parliament
## 1 Australia 1951 296 1.3 19.8 43.0
## 2 Australia 1952 397 2.2 17.2 43.0
## 3 Australia 1953 360 2.5 4.3 43.0
## 4 Australia 1954 3 1.7 0.7 47.0
## 5 Australia 1955 326 1.4 2.0 38.5
## 6 Australia 1956 352 1.8 6.3 38.5
## centralization density
## 1 0.3748588 NA
## 2 0.3751829 NA
## 3 0.3745076 NA
## 4 0.3710170 NA
## 5 0.3752675 NA
## 6 0.3716072 NA
sapply(strike.df, class)
## country year strike.volume unemployment
## "character" "integer" "integer" "numeric"
## inflation left.parliament centralization density
## "numeric" "numeric" "numeric" "numeric"
# This data table is tab separated, so let's specify sep="\t"
anss.df = read.table(
file="http://www.stat.cmu.edu/~ryantibs/statcomp-S18/data/anss.dat",
sep="\t")
head(anss.df)
## V1 V2 V3 V4 V5 V6
## 1 Date Time Lat Lon Depth Mag
## 2 2002/01/01 10:39:06.82 -55.2140 -129.0000 10.00 6.00
## 3 2002/01/01 11:29:22.73 6.3030 125.6500 138.10 6.30
## 4 2002/01/02 14:50:33.49 -17.9830 178.7440 665.80 6.20
## 5 2002/01/02 17:22:48.76 -17.6000 167.8560 21.00 7.20
## 6 2002/01/03 07:05:27.67 36.0880 70.6870 129.30 6.20
sapply(anss.df, class)
## V1 V2 V3 V4 V5 V6
## "character" "character" "character" "character" "character" "character"
# Oops! It comes with column names, so let's set header=TRUE
anss.df = read.table(
file="http://www.stat.cmu.edu/~ryantibs/statcomp-S18/data/anss.dat",
sep="\t", header=TRUE)
head(anss.df)
## Date Time Lat Lon Depth Mag
## 1 2002/01/01 10:39:06.82 -55.214 -129.000 10.0 6.0
## 2 2002/01/01 11:29:22.73 6.303 125.650 138.1 6.3
## 3 2002/01/02 14:50:33.49 -17.983 178.744 665.8 6.2
## 4 2002/01/02 17:22:48.76 -17.600 167.856 21.0 7.2
## 5 2002/01/03 07:05:27.67 36.088 70.687 129.3 6.2
## 6 2002/01/03 10:17:36.30 -17.664 168.004 10.0 6.6
sapply(anss.df, class)
## Date Time Lat Lon Depth Mag
## "character" "character" "numeric" "numeric" "numeric" "numeric"
The following inputs apply to either read.table()
or read.csv()
(though these two functions actually have different default inputs in general—e.g., header
defaults to TRUE in read.table()
but FALSE in read.csv()
)
header
: boolean, TRUE is the first line should be interpreted as column namessep
: string, specifies what separates the entries; empty string “” is interpreted to mean any white spacequote
: string, specifies what set of characters signify the beginning and end of quotes; empty string “” disables quotes altogetherOther helpful inputs: skip
, row.names
, col.names
. You can read about them in the help file for read.table()
# This data table is tab separated, and it comes with column names
sprint.df = read.table(
file="http://www.stat.cmu.edu/~ryantibs/statcomp-S18/data/sprint.dat",
sep="\t", header=TRUE)
## Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : line 981 did not have 8 elements
# Oops! It turns out it has some apostrophe marks in the City column,
# so we have to set quote=""
sprint.df = read.table(
file="http://www.stat.cmu.edu/~ryantibs/statcomp-S18/data/sprint.dat",
sep="\t", header=TRUE, quote="")
head(sprint.df)
## Rank Time Wind Name Country Birthdate City Date
## 1 1 9.58 0.9 Usain Bolt JAM 21.08.86 Berlin 16.08.2009
## 2 2 9.63 1.5 Usain Bolt JAM 21.08.86 London 05.08.2012
## 3 3 9.69 0.0 Usain Bolt JAM 21.08.86 Beijing 16.08.2008
## 4 3 9.69 2.0 Tyson Gay USA 09.08.82 Shanghai 20.09.2009
## 5 3 9.69 -0.1 Yohan Blake JAM 26.12.89 Lausanne 23.08.2012
## 6 6 9.71 0.9 Tyson Gay USA 09.08.82 Berlin 16.08.2009
sapply(sprint.df, class)
## Rank Time Wind Name Country Birthdate
## "integer" "numeric" "numeric" "character" "character" "character"
## City Date
## "character" "character"
unique(grep("'", sprint.df$City, value=TRUE)) # This is the troublemaker
## [1] "Villeneuve d'Ascq"
write.table()
, write.csv()
To write a data frame (or matrix) to a text file, use write.table()
or write.csv()
. These are the counterparts to read.table()
and read.csv()
, and they work as in:
write.table(my.df, file="my.df.txt", sep=" ", quote=FALSE)
, to write my.dat
to the text file “my.df.txt” (to be created in your working directory), with (say) space separation, and no quotes around the entrieswrite.csv(my.df, file="my.df.csv", quote=FALSE)
, to write my.dat
to the text file “my.df.csv” (to be created in your working directory), with comma separation, and no quotesNote that quote=FALSE
, signifying that no quotes should be put around the printed data entries, seems always preferable (default is quote=TRUE
). Also, setting row.names=FALSE
and col.names=FALSE
will disable the printing of row and column names (defaults are row.names=TRUE
and col.names=TRUE
)
Sometimes it’s convenient to reorder our data, say the rows of our data frame (or matrix). Recall:
order()
takes in a vector, and returns the vector of indices that put this vector in increasing orderdecreasing=TRUE
in order()
to get decreasing order# The sprint data has its rows ordered by the fastest 100m time to the slowest.
# Suppose we wanted to reorder, from slowest to fastest
i.slow = order(sprint.df$Time, decreasing=TRUE) # By decreasing sprint time
sprint.df.slow = sprint.df[i.slow,] # Reorder rows by decreasing sprint time
head(sprint.df.slow)
## Rank Time Wind Name Country Birthdate City
## 2682 2691 10.09 1.8 Mel Lattany USA 10.08.59 Colorado Springs
## 2683 2691 10.09 -0.9 Mel Lattany USA 10.08.59 Zürich
## 2684 2691 10.09 1.3 Carl Lewis USA 01.07.61 Walnut
## 2685 2691 10.09 0.6 Calvin Smith USA 08.01.61 Athens
## 2686 2691 10.09 -1.7 Carl Lewis USA 01.07.61 Indianapolis
## 2687 2691 10.09 -0.9 Calvin Smith USA 08.01.61 Zürich
## Date
## 2682 30.07.1978
## 2683 19.08.1981
## 2684 25.04.1982
## 2685 14.05.1982
## 2686 02.07.1982
## 2687 18.08.1982
# Suppose we wanted to reorder the rows by sprinter name, alphabetically
i.name = order(sprint.df$Name) # By sprinter name
sprint.df.name = sprint.df[i.name,] # Reorder rows by name
head(sprint.df.name)
## Rank Time Wind Name Country Birthdate City
## 1373 1281 10.03 0.5 Aaron Armstrong TTO 14.10.77 Port of Spain
## 1552 1456 10.04 1.0 Aaron Armstrong TTO 14.10.77 Port of Spain
## 2326 2145 10.07 1.0 Aaron Armstrong TTO 14.10.77 Port of Spain
## 568 491 9.96 2.0 Aaron Brown CAN 27.05.92 Montverde
## 1114 1011 10.01 1.8 Aaron Brown CAN 27.05.92 Montverde
## 1836 1683 10.05 1.9 Aaron Brown CAN 27.05.92 Eugene
## Date
## 1373 20.06.2009
## 1552 25.06.2005
## 2326 13.08.2011
## 568 11.06.2016
## 1114 11.06.2016
## 1836 05.06.2013
Merging data
Suppose you have two data frames X
, Y
, and you want to combine them
data.frame(X,Y)
data.frame(X$col1,X$col5,Y$col3)
order()
. Alternatively, use merge()
People in larger cities (larger areas) drive more—seems like a reasonable hypothesis, but is it true?
Distance driven, and city population, from http://www.fhwa.dot.gov/policyinformation/statistics/2011/hm71.cfm:
fha = read.csv(
file="http://www.stat.cmu.edu/~ryantibs/statcomp-S18/data/fha.csv",
colClasses=c("character","double","double","double"), na.strings="NA")
nrow(fha)
## [1] 498
head(fha)
## City Population Miles.of.Road
## 1 New York--Newark, NY--NJ--CT 18351295 43893
## 2 Los Angeles--Long Beach--Anaheim, CA 12150996 24877
## 3 Chicago, IL--IN 8608208 25905
## 4 Miami, FL 5502379 15641
## 5 Philadelphia, PA--NJ--DE--MD 5441567 19867
## 6 Dallas--Fort Worth--Arlington, TX 5121892 21610
## Daily.Miles.Traveled
## 1 286101
## 2 270807
## 3 172708
## 4 125899
## 5 99190
## 6 125389
Area and population of “urbanized areas”, from http://www2.census.gov/geo/ua/ua_list_all.txt:
ua = read.csv(file="http://www.stat.cmu.edu/~ryantibs/statcomp-S18/data/ua.dat",
sep=";")
nrow(ua)
## [1] 3598
head(ua)
## UACE NAME POP HU AREALAND AREALANDSQMI AREAWATER
## 1 37 Abbeville, LA 19824 8460 29222871 11.28 300497
## 2 64 Abbeville, SC 5243 2578 11315197 4.37 19786
## 3 91 Abbotsford, WI 3966 1616 5363441 2.07 13221
## 4 118 Aberdeen, MS 4666 2050 7416616 2.86 52732
## 5 145 Aberdeen, SD 25977 12114 33002447 12.74 247597
## 6 172 Aberdeen, WA 29856 13139 39997951 15.44 1929689
## AREAWATERSQMI POPDEN LSADC
## 1 0.12 1757.0 76
## 2 0.01 1200.1 76
## 3 0.01 1915.2 76
## 4 0.02 1629.4 76
## 5 0.10 2038.6 76
## 6 0.75 1933.3 76
But both use the same census figures for population! And it turns out every settlement (in the top 498) has a unique census population:
length(unique(fha$Population)) == nrow(fha)
## [1] TRUE
ua.pop.top498 = sort(ua$POP, decreasing=TRUE)[1:nrow(fha)]
max(abs(fha$Population - ua.pop.top498))
## [1] 0
Reorder area column in second table by population, append to first table
ind.pop = order(ua$POP, decreasing=TRUE) # Order by population
df1 = data.frame(fha, area=ua$AREALANDSQMI[ind.pop][1:nrow(fha)])
# Neaten up names
colnames(df1) = c("City","Population","Roads","Mileage","Area")
nrow(df1)
## [1] 498
head(df1)
## City Population Roads Mileage Area
## 1 New York--Newark, NY--NJ--CT 18351295 43893 286101 3450.20
## 2 Los Angeles--Long Beach--Anaheim, CA 12150996 24877 270807 1736.02
## 3 Chicago, IL--IN 8608208 25905 172708 2442.75
## 4 Miami, FL 5502379 15641 125899 1238.61
## 5 Philadelphia, PA--NJ--DE--MD 5441567 19867 99190 1981.37
## 6 Dallas--Fort Worth--Arlington, TX 5121892 21610 125389 1779.13
Use the merge()
function
df2 = merge(x=fha, y=ua, by.x="Population", by.y="POP")
nrow(df2)
## [1] 498
tail(df2, 3)
## Population City Miles.of.Road
## 496 8608208 Chicago, IL--IN 25905
## 497 12150996 Los Angeles--Long Beach--Anaheim, CA 24877
## 498 18351295 New York--Newark, NY--NJ--CT 43893
## Daily.Miles.Traveled UACE NAME
## 496 172708 16264 Chicago, IL--IN
## 497 270807 51445 Los Angeles--Long Beach--Anaheim, CA
## 498 286101 63217 New York--Newark, NY--NJ--CT
## HU AREALAND AREALANDSQMI AREAWATER AREAWATERSQMI POPDEN LSADC
## 496 3459257 6326686332 2442.75 105649916 40.79 3524.0 75
## 497 4217448 4496266014 1736.02 61141327 23.61 6999.3 75
## 498 7263095 8935981360 3450.20 533176599 205.86 5318.9 75
merge()
The merge()
function tries to merge two data frames according to common columns, as in: merge(x, y, by.x="SomeXCol", by.y="SomeYCol")
, to join two data frames x
, y
, by matching the columns “SomeXCol” and “SomeYCol”
by.x
, by.y
specified) is to match all columns with common namesmerge()
is doing a JOIN (if you don’t know what that means, you will by the end of the semester!)order()
and manual tricks versus merge()
merge()
takes some learningmerge()
handles many columnsmerge()
handles different sizes automatically# Convert 1,000s of miles to miles
df1$Mileage = 1000*df1$Mileage
# Plot daily miles per person vs. area
plot(x=df1$Area, y=df1$Mileage/df1$Population, log="x",
xlab="Miles driven per person per day",
ylab="City area in square miles")
# Impressively flat regression line
abline(lm(Mileage/Population ~ Area, data=df1), col="red")
readRDS()
, load()
read.table()
, read.csv()
read.table()
, read.csv()
; helps sometimes take a look at the original data files to see their structureorder()
, rev()
, and proper indexingmerge()
; but can do it manually using reordering tricks