Reading in data from the outside

All along, we’ve already been reading in data from the outside, using:

readLines(): reading in lines of text from a file or webpage; returns vector of strings
read.table(): read in a data table from a file or webpage; returns data frame
read.csv(): like the above, but assumes comma separated data; returns data frame

Today we’ll focus on read.table(), read.csv() and their counterparts write.table(), write.csv(), respectively

Reading in data from a previous R session

Reminder that we’ve already learned two ways to read/write in data in specialized R formats:

readRDS(), saveRDS(): functions for reading/writing single R objects from/to a file
load(), save(): functions for reading/writing any number of R objects from/to a file

Advantage: these can be a lot more memory efficient than what we’ll cover today. Disadvantage: they’re limited in scope in the sense that they can only communicate with R; to read in data from the “outside world”, or write data that other programs can deal with, we’ll need to use read.table(), write.table(), etc.

`read.table()`, `read.csv()`

Have a table full of data, just not in the R file format? Then read.table() is the function for you. It works as in:

read.table(file=file.name, sep=" "), to read data from a local file on your computer called file.name, assuming (say) space separated data
read.table(file=webpage.link, sep="\t"), to read data from a webpage up at webpage.link, assuming (say) tab separated data

The function read.csv() is just a shortcut for using read.table() with sep=",". (But note: these two actually differ on some of the default inputs!)

Easy example of reading in a data table

# This data table is comma separated, so we can use read.csv()
strike.df = read.csv("http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/strikes.csv")
class(strike.df)

## [1] "data.frame"

head(strike.df)

##     country year strike.volume unemployment inflation left.parliament
## 1 Australia 1951           296          1.3      19.8            43.0
## 2 Australia 1952           397          2.2      17.2            43.0
## 3 Australia 1953           360          2.5       4.3            43.0
## 4 Australia 1954             3          1.7       0.7            47.0
## 5 Australia 1955           326          1.4       2.0            38.5
## 6 Australia 1956           352          1.8       6.3            38.5
##   centralization density
## 1      0.3748588      NA
## 2      0.3751829      NA
## 3      0.3745076      NA
## 4      0.3710170      NA
## 5      0.3752675      NA
## 6      0.3716072      NA

Helpful input arguments

The following inputs apply to either read.table() or read.csv() (though these two functions actually have different default inputs in general—e.g., header defaults to TRUE in read.table() but FALSE in read.csv())

header: boolean, TRUE is the first line should be interpreted as column names
sep: string, specifies what separates the entries; empty string “” is interpreted to mean any white space
quote: string, specifies what set of characters signify the beginning and end of quotes; empty string “” disables quotes altogether

Other helpful inputs: skip, row.names, col.names. You can read about them in the help file for read.table()

More examples of reading in data tables

# This data table is tab separated, and it comes with column names
anss.df = read.table("http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/anss.dat",
                     sep="\t", header=TRUE)
class(anss.df)

## [1] "data.frame"

head(anss.df)

##         Date        Time     Lat      Lon Depth Mag
## 1 2002/01/01 10:39:06.82 -55.214 -129.000  10.0 6.0
## 2 2002/01/01 11:29:22.73   6.303  125.650 138.1 6.3
## 3 2002/01/02 14:50:33.49 -17.983  178.744 665.8 6.2
## 4 2002/01/02 17:22:48.76 -17.600  167.856  21.0 7.2
## 5 2002/01/03 07:05:27.67  36.088   70.687 129.3 6.2
## 6 2002/01/03 10:17:36.30 -17.664  168.004  10.0 6.6

# This data table is tab separated, it comes with column names, and also has 
# some apostrophe marks in the City column, so we have to set quote="". Also
# note that we've disabled the row names
sprint.df = read.table("http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/sprint.dat",
                       sep="\t", header=TRUE, quote="")
class(sprint.df)

## [1] "data.frame"

head(sprint.df)

##   Rank Time Wind        Name Country Birthdate     City       Date
## 1    1 9.58  0.9  Usain Bolt     JAM  21.08.86   Berlin 16.08.2009
## 2    2 9.63  1.5  Usain Bolt     JAM  21.08.86   London 05.08.2012
## 3    3 9.69  0.0  Usain Bolt     JAM  21.08.86  Beijing 16.08.2008
## 4    3 9.69  2.0   Tyson Gay     USA  09.08.82 Shanghai 20.09.2009
## 5    3 9.69 -0.1 Yohan Blake     JAM  26.12.89 Lausanne 23.08.2012
## 6    6 9.71  0.9   Tyson Gay     USA  09.08.82   Berlin 16.08.2009

unique(grep("'", sprint.df$City, value=TRUE)) # This is the troublemaker

## [1] "Villeneuve d'Ascq"

`write.table()`, `write.csv()`

To write a data frame (or matrix) to a text file, use write.table() or write.csv(). These are the counterparts to read.table() and read.csv(), and they work as in:

write.table(my.df, file="my.df.txt", sep=" ", quote=FALSE), to write my.dat to the text file “my.df.txt” (to be created in your working directory), with (say) space separation, and no quotes around the entries
write.csv(my.df, file="my.df.csv", quote=FALSE), to write my.dat to the text file “my.df.csv” (to be created in your working directory), with comma separation, and no quotes

Note that quote=FALSE, signifying that no quotes should be put around the printed data entries, seems always preferable (default is quote=TRUE). Also, setting row.names=FALSE and col.names=FALSE will disable the printing of row and column names (defaults are row.names=TRUE and col.names=TRUE)

Reordering data

Sometimes it’s convenient to reorder our data, say the rows of our data frame (or matrix). Recall:

The function order() takes in a vector, and returns the vector of indices that put this vector in increasing order
Set the input decreasing=TRUE in order() to get decreasing order
We can apply compute an appropriate vector of indices, and then use this on rows of our data frame to reorder all of the columns simultaneously

Examples of reordering

# The sprint data has its rows ordered by the fastest 100m time to the slowest.
# Suppose we wanted to reorder, from slowest to fastest
i.slow = order(sprint.df$Time, decreasing=TRUE) # By decreasing sprint time
sprint.df.slow = sprint.df[i.slow,] # Reorder rows by decreasing sprint time
head(sprint.df.slow)

##      Rank  Time Wind         Name Country Birthdate             City
## 2524 2532 10.09  1.8  Mel Lattany     USA  10.08.59 Colorado Springs
## 2525 2532 10.09 -0.9  Mel Lattany     USA  10.08.59      Z&uuml;rich
## 2526 2532 10.09  1.3   Carl Lewis     USA  01.07.61           Walnut
## 2527 2532 10.09  0.6 Calvin Smith     USA  08.01.61           Athens
## 2528 2532 10.09 -1.7   Carl Lewis     USA  01.07.61     Indianapolis
## 2529 2532 10.09 -0.9 Calvin Smith     USA  08.01.61      Z&uuml;rich
##            Date
## 2524 30.07.1978
## 2525 19.08.1981
## 2526 25.04.1982
## 2527 14.05.1982
## 2528 02.07.1982
## 2529 18.08.1982

# Suppose we wanted to reorder the rows by sprinter name, alphabetically
i.name = order(sprint.df$Name) # By sprinter name
sprint.df.name = sprint.df[i.name,] # Reorder rows by name
head(sprint.df.name)

##      Rank  Time Wind            Name Country Birthdate          City
## 1294 1202 10.03  0.5 Aaron Armstrong     TTO  14.10.77 Port of Spain
## 1462 1366 10.04  1.0 Aaron Armstrong     TTO  14.10.77 Port of Spain
## 2192 2011 10.07  1.0 Aaron Armstrong     TTO  14.10.77 Port of Spain
## 550   473  9.96  2.0     Aaron Brown     CAN  27.05.92     Montverde
## 1057  954 10.01  1.8     Aaron Brown     CAN  27.05.92     Montverde
## 1734 1581 10.05  1.9     Aaron Brown     CAN  27.05.92        Eugene
##            Date
## 1294 20.06.2009
## 1462 25.06.2005
## 2192 13.08.2011
## 550  11.06.2016
## 1057 11.06.2016
## 1734 05.06.2013

Reading in and Reordering Data