Reading in and Reordering Data

Statistical Computing, 36-350

Wednesday October 26, 2016

Reading in data from the outside

All along, we’ve already been reading in data from the outside, using:

Today we’ll focus on read.table(), read.csv() and their counterparts write.table(), write.csv(), respectively

Reading in data from a previous R session

Reminder that we’ve already learned two ways to read/write in data in specialized R formats:

Advantage: these can be a lot more memory efficient than what we’ll cover today. Disadvantage: they’re limited in scope in the sense that they can only communicate with R; to read in data from the “outside world”, or write data that other programs can deal with, we’ll need to use read.table(), write.table(), etc.

read.table(), read.csv()

Have a table full of data, just not in the R file format? Then read.table() is the function for you. It works as in:

The function read.csv() is just a shortcut for using read.table() with sep=",". (But note: these two actually differ on some of the default inputs!)

Easy example of reading in a data table

# This data table is comma separated, so we can use read.csv()
strike.df = read.csv("http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/strikes.csv")
class(strike.df)
## [1] "data.frame"
head(strike.df)
##     country year strike.volume unemployment inflation left.parliament
## 1 Australia 1951           296          1.3      19.8            43.0
## 2 Australia 1952           397          2.2      17.2            43.0
## 3 Australia 1953           360          2.5       4.3            43.0
## 4 Australia 1954             3          1.7       0.7            47.0
## 5 Australia 1955           326          1.4       2.0            38.5
## 6 Australia 1956           352          1.8       6.3            38.5
##   centralization density
## 1      0.3748588      NA
## 2      0.3751829      NA
## 3      0.3745076      NA
## 4      0.3710170      NA
## 5      0.3752675      NA
## 6      0.3716072      NA

Helpful input arguments

The following inputs apply to either read.table() or read.csv() (though these two functions actually have different default inputs in general—e.g., header defaults to TRUE in read.table() but FALSE in read.csv())

Other helpful inputs: skip, row.names, col.names. You can read about them in the help file for read.table()

More examples of reading in data tables

# This data table is tab separated, and it comes with column names
anss.df = read.table("http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/anss.dat",
                     sep="\t", header=TRUE)
class(anss.df)
## [1] "data.frame"
head(anss.df)
##         Date        Time     Lat      Lon Depth Mag
## 1 2002/01/01 10:39:06.82 -55.214 -129.000  10.0 6.0
## 2 2002/01/01 11:29:22.73   6.303  125.650 138.1 6.3
## 3 2002/01/02 14:50:33.49 -17.983  178.744 665.8 6.2
## 4 2002/01/02 17:22:48.76 -17.600  167.856  21.0 7.2
## 5 2002/01/03 07:05:27.67  36.088   70.687 129.3 6.2
## 6 2002/01/03 10:17:36.30 -17.664  168.004  10.0 6.6
# This data table is tab separated, it comes with column names, and also has 
# some apostrophe marks in the City column, so we have to set quote="". Also
# note that we've disabled the row names
sprint.df = read.table("http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/sprint.dat",
                       sep="\t", header=TRUE, quote="")
class(sprint.df)
## [1] "data.frame"
head(sprint.df)
##   Rank Time Wind        Name Country Birthdate     City       Date
## 1    1 9.58  0.9  Usain Bolt     JAM  21.08.86   Berlin 16.08.2009
## 2    2 9.63  1.5  Usain Bolt     JAM  21.08.86   London 05.08.2012
## 3    3 9.69  0.0  Usain Bolt     JAM  21.08.86  Beijing 16.08.2008
## 4    3 9.69  2.0   Tyson Gay     USA  09.08.82 Shanghai 20.09.2009
## 5    3 9.69 -0.1 Yohan Blake     JAM  26.12.89 Lausanne 23.08.2012
## 6    6 9.71  0.9   Tyson Gay     USA  09.08.82   Berlin 16.08.2009
unique(grep("'", sprint.df$City, value=TRUE)) # This is the troublemaker
## [1] "Villeneuve d'Ascq"

write.table(), write.csv()

To write a data frame (or matrix) to a text file, use write.table() or write.csv(). These are the counterparts to read.table() and read.csv(), and they work as in:

Note that quote=FALSE, signifying that no quotes should be put around the printed data entries, seems always preferable (default is quote=TRUE). Also, setting row.names=FALSE and col.names=FALSE will disable the printing of row and column names (defaults are row.names=TRUE and col.names=TRUE)

Reordering data

Sometimes it’s convenient to reorder our data, say the rows of our data frame (or matrix). Recall:

Examples of reordering

# The sprint data has its rows ordered by the fastest 100m time to the slowest.
# Suppose we wanted to reorder, from slowest to fastest
i.slow = order(sprint.df$Time, decreasing=TRUE) # By decreasing sprint time
sprint.df.slow = sprint.df[i.slow,] # Reorder rows by decreasing sprint time
head(sprint.df.slow)
##      Rank  Time Wind         Name Country Birthdate             City
## 2524 2532 10.09  1.8  Mel Lattany     USA  10.08.59 Colorado Springs
## 2525 2532 10.09 -0.9  Mel Lattany     USA  10.08.59      Zürich
## 2526 2532 10.09  1.3   Carl Lewis     USA  01.07.61           Walnut
## 2527 2532 10.09  0.6 Calvin Smith     USA  08.01.61           Athens
## 2528 2532 10.09 -1.7   Carl Lewis     USA  01.07.61     Indianapolis
## 2529 2532 10.09 -0.9 Calvin Smith     USA  08.01.61      Zürich
##            Date
## 2524 30.07.1978
## 2525 19.08.1981
## 2526 25.04.1982
## 2527 14.05.1982
## 2528 02.07.1982
## 2529 18.08.1982
# Suppose we wanted to reorder the rows by sprinter name, alphabetically
i.name = order(sprint.df$Name) # By sprinter name
sprint.df.name = sprint.df[i.name,] # Reorder rows by name
head(sprint.df.name)
##      Rank  Time Wind            Name Country Birthdate          City
## 1294 1202 10.03  0.5 Aaron Armstrong     TTO  14.10.77 Port of Spain
## 1462 1366 10.04  1.0 Aaron Armstrong     TTO  14.10.77 Port of Spain
## 2192 2011 10.07  1.0 Aaron Armstrong     TTO  14.10.77 Port of Spain
## 550   473  9.96  2.0     Aaron Brown     CAN  27.05.92     Montverde
## 1057  954 10.01  1.8     Aaron Brown     CAN  27.05.92     Montverde
## 1734 1581 10.05  1.9     Aaron Brown     CAN  27.05.92        Eugene
##            Date
## 1294 20.06.2009
## 1462 25.06.2005
## 2192 13.08.2011
## 550  11.06.2016
## 1057 11.06.2016
## 1734 05.06.2013