Statistical Computing, 36-350
Wednesday October 26, 2016
All along, we’ve already been reading in data from the outside, using:
readLines()
: reading in lines of text from a file or webpage; returns vector of stringsread.table()
: read in a data table from a file or webpage; returns data frameread.csv()
: like the above, but assumes comma separated data; returns data frameToday we’ll focus on read.table()
, read.csv()
and their counterparts write.table()
, write.csv()
, respectively
Reminder that we’ve already learned two ways to read/write in data in specialized R formats:
readRDS()
, saveRDS()
: functions for reading/writing single R objects from/to a fileload()
, save()
: functions for reading/writing any number of R objects from/to a fileAdvantage: these can be a lot more memory efficient than what we’ll cover today. Disadvantage: they’re limited in scope in the sense that they can only communicate with R; to read in data from the “outside world”, or write data that other programs can deal with, we’ll need to use read.table()
, write.table()
, etc.
read.table()
, read.csv()
Have a table full of data, just not in the R file format? Then read.table()
is the function for you. It works as in:
read.table(file=file.name, sep=" ")
, to read data from a local file on your computer called file.name
, assuming (say) space separated dataread.table(file=webpage.link, sep="\t")
, to read data from a webpage up at webpage.link
, assuming (say) tab separated dataThe function read.csv()
is just a shortcut for using read.table()
with sep=","
. (But note: these two actually differ on some of the default inputs!)
# This data table is comma separated, so we can use read.csv()
strike.df = read.csv("http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/strikes.csv")
class(strike.df)
## [1] "data.frame"
head(strike.df)
## country year strike.volume unemployment inflation left.parliament
## 1 Australia 1951 296 1.3 19.8 43.0
## 2 Australia 1952 397 2.2 17.2 43.0
## 3 Australia 1953 360 2.5 4.3 43.0
## 4 Australia 1954 3 1.7 0.7 47.0
## 5 Australia 1955 326 1.4 2.0 38.5
## 6 Australia 1956 352 1.8 6.3 38.5
## centralization density
## 1 0.3748588 NA
## 2 0.3751829 NA
## 3 0.3745076 NA
## 4 0.3710170 NA
## 5 0.3752675 NA
## 6 0.3716072 NA
The following inputs apply to either read.table()
or read.csv()
(though these two functions actually have different default inputs in general—e.g., header
defaults to TRUE in read.table()
but FALSE in read.csv()
)
header
: boolean, TRUE is the first line should be interpreted as column namessep
: string, specifies what separates the entries; empty string “” is interpreted to mean any white spacequote
: string, specifies what set of characters signify the beginning and end of quotes; empty string “” disables quotes altogetherOther helpful inputs: skip
, row.names
, col.names
. You can read about them in the help file for read.table()
# This data table is tab separated, and it comes with column names
anss.df = read.table("http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/anss.dat",
sep="\t", header=TRUE)
class(anss.df)
## [1] "data.frame"
head(anss.df)
## Date Time Lat Lon Depth Mag
## 1 2002/01/01 10:39:06.82 -55.214 -129.000 10.0 6.0
## 2 2002/01/01 11:29:22.73 6.303 125.650 138.1 6.3
## 3 2002/01/02 14:50:33.49 -17.983 178.744 665.8 6.2
## 4 2002/01/02 17:22:48.76 -17.600 167.856 21.0 7.2
## 5 2002/01/03 07:05:27.67 36.088 70.687 129.3 6.2
## 6 2002/01/03 10:17:36.30 -17.664 168.004 10.0 6.6
# This data table is tab separated, it comes with column names, and also has
# some apostrophe marks in the City column, so we have to set quote="". Also
# note that we've disabled the row names
sprint.df = read.table("http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/sprint.dat",
sep="\t", header=TRUE, quote="")
class(sprint.df)
## [1] "data.frame"
head(sprint.df)
## Rank Time Wind Name Country Birthdate City Date
## 1 1 9.58 0.9 Usain Bolt JAM 21.08.86 Berlin 16.08.2009
## 2 2 9.63 1.5 Usain Bolt JAM 21.08.86 London 05.08.2012
## 3 3 9.69 0.0 Usain Bolt JAM 21.08.86 Beijing 16.08.2008
## 4 3 9.69 2.0 Tyson Gay USA 09.08.82 Shanghai 20.09.2009
## 5 3 9.69 -0.1 Yohan Blake JAM 26.12.89 Lausanne 23.08.2012
## 6 6 9.71 0.9 Tyson Gay USA 09.08.82 Berlin 16.08.2009
unique(grep("'", sprint.df$City, value=TRUE)) # This is the troublemaker
## [1] "Villeneuve d'Ascq"
write.table()
, write.csv()
To write a data frame (or matrix) to a text file, use write.table()
or write.csv()
. These are the counterparts to read.table()
and read.csv()
, and they work as in:
write.table(my.df, file="my.df.txt", sep=" ", quote=FALSE)
, to write my.dat
to the text file “my.df.txt” (to be created in your working directory), with (say) space separation, and no quotes around the entrieswrite.csv(my.df, file="my.df.csv", quote=FALSE)
, to write my.dat
to the text file “my.df.csv” (to be created in your working directory), with comma separation, and no quotesNote that quote=FALSE
, signifying that no quotes should be put around the printed data entries, seems always preferable (default is quote=TRUE
). Also, setting row.names=FALSE
and col.names=FALSE
will disable the printing of row and column names (defaults are row.names=TRUE
and col.names=TRUE
)
Sometimes it’s convenient to reorder our data, say the rows of our data frame (or matrix). Recall:
order()
takes in a vector, and returns the vector of indices that put this vector in increasing orderdecreasing=TRUE
in order()
to get decreasing order# The sprint data has its rows ordered by the fastest 100m time to the slowest.
# Suppose we wanted to reorder, from slowest to fastest
i.slow = order(sprint.df$Time, decreasing=TRUE) # By decreasing sprint time
sprint.df.slow = sprint.df[i.slow,] # Reorder rows by decreasing sprint time
head(sprint.df.slow)
## Rank Time Wind Name Country Birthdate City
## 2524 2532 10.09 1.8 Mel Lattany USA 10.08.59 Colorado Springs
## 2525 2532 10.09 -0.9 Mel Lattany USA 10.08.59 Zürich
## 2526 2532 10.09 1.3 Carl Lewis USA 01.07.61 Walnut
## 2527 2532 10.09 0.6 Calvin Smith USA 08.01.61 Athens
## 2528 2532 10.09 -1.7 Carl Lewis USA 01.07.61 Indianapolis
## 2529 2532 10.09 -0.9 Calvin Smith USA 08.01.61 Zürich
## Date
## 2524 30.07.1978
## 2525 19.08.1981
## 2526 25.04.1982
## 2527 14.05.1982
## 2528 02.07.1982
## 2529 18.08.1982
# Suppose we wanted to reorder the rows by sprinter name, alphabetically
i.name = order(sprint.df$Name) # By sprinter name
sprint.df.name = sprint.df[i.name,] # Reorder rows by name
head(sprint.df.name)
## Rank Time Wind Name Country Birthdate City
## 1294 1202 10.03 0.5 Aaron Armstrong TTO 14.10.77 Port of Spain
## 1462 1366 10.04 1.0 Aaron Armstrong TTO 14.10.77 Port of Spain
## 2192 2011 10.07 1.0 Aaron Armstrong TTO 14.10.77 Port of Spain
## 550 473 9.96 2.0 Aaron Brown CAN 27.05.92 Montverde
## 1057 954 10.01 1.8 Aaron Brown CAN 27.05.92 Montverde
## 1734 1581 10.05 1.9 Aaron Brown CAN 27.05.92 Eugene
## Date
## 1294 20.06.2009
## 1462 25.06.2005
## 2192 13.08.2011
## 550 11.06.2016
## 1057 11.06.2016
## 1734 05.06.2013