Lecture 8: Getting Data

22 September 2014

In Previous Episodes

Seen functions to load data in passing
Learned about string manipulation and regexp

Agenda

Getting data into and out of the system when it's already in R format
Import and export when the data is already very structured and machine-readable
Dealing with less structured data
Web scraping

Reading Data from R

You can load and save R objects
- R has its own format for this, which is shared across operating systems
- It's an open, documented format if you really want to pry into it
save(thing, file="name") saves thing in a file called name (conventional extension: rda or Rda)
load("name") loads the object or objects stored in the file called name, with their old names

gmp <- read.table("http://www.stat.cmu.edu/~cshalizi/statcomp/14/lectures/06/gmp.dat")
gmp$pop <- round(gmp$gmp/gmp$pcgmp)
save(gmp,file="gmp.Rda")
rm(gmp)
exists("gmp")

## [1] FALSE

not_gmp <- load(file="gmp.Rda")
colnames(gmp)

## [1] "MSA"   "gmp"   "pcgmp" "pop"

not_gmp

## [1] "gmp"

We can load or save more than one object at once; this is how RStudio will load your whole workspace when you're starting, and offer to save it when you're done
Many packages come with saved data objects; there's the convenience function data() to load them

 data(cats,package="MASS")
summary(cats)

##  Sex         Bwt            Hwt       
##  F:47   Min.   :2.00   Min.   : 6.30  
##  M:97   1st Qu.:2.30   1st Qu.: 8.95  
##         Median :2.70   Median :10.10  
##         Mean   :2.72   Mean   :10.63  
##         3rd Qu.:3.02   3rd Qu.:12.12  
##         Max.   :3.90   Max.   :20.50

Note: data() returns the name of the loaded data file!

Non-R Data Tables

Tables full of data, just not in the R file format
Main function: read.table()
- Presumes space-separated fields, one line per row
- Main argument is the file name or URL
- Returns a dataframe
- Lots of options for things like field separator, column names, forcing or guessing column types, skipping lines at the start of the file…
read.csv() is a short-cut to set the options for reading comma-separated value (CSV) files
- Spreadsheets will usually read and write CSV

Writing Dataframes

Counterpart functions write.table(), write.csv() write a dataframe into a file
Drawback: takes a lot more disk space than what you get from load or save
Advantage: can communicate with other programs, or even edit manually

Less Friendly Data Formats

The foreign package on CRAN has tools for reading data files from lots of non-R statistical software
Spreadsheets are special

Spreadsheets Considered Harmful

Spreadsheets look like they should be dataframes
Real spreadsheets are full of ugly irregularities
- Values or formulas?
- Headers, footers, side-comments, notes
- Columns change meaning half-way down
- Whole separate programming languages apparently intended to mostly to spread malware
Ought-to-be-notorious source of errors in both industry (1, 2) and science (e.g., Reinhart and Rogoff)

Spreadsheets, If You Have To

Save the spreadsheet as a CSV; read.csv()
Save the spreadsheet as a CSV; edit in a text editor; read.csv()
Use read.xls() from the gdata package
Tries very hard to work like read.csv(), can take a URL or filename
Can skip down to the first line that matches some pattern, select different sheets, etc.
You may still need to do a lot of tidying up after

require(gdata, quietly=TRUE)

## gdata: read.xls support for 'XLS' (Excel 97-2004) files ENABLED.
## 
## gdata: read.xls support for 'XLSX' (Excel 2007+) files ENABLED.
## 
## Attaching package: 'gdata'
## 
## The following object is masked from 'package:stats':
## 
##     nobs
## 
## The following object is masked from 'package:utils':
## 
##     object.size

setwd("~/Downloads/")
gmp_2008_2013 <- read.xls("gdp_metro0914.xls",pattern="U.S.")
head(gmp_2008_2013)

##       U.S..metropolitan.areas X13.269.057 X12.994.636 X13.461.662
## 1                 Abilene, TX       5,725       5,239       5,429
## 2                   Akron, OH      28,663      27,761      28,616
## 3                  Albany, GA       4,795       4,957       4,928
## 4                  Albany, OR       3,235       3,064       3,050
## 5 Albany-Schenectady-Troy, NY      40,365      42,454      42,969
## 6             Albuquerque, NM      37,359      38,110      38,801
##   X13.953.082 X14.606.938 X15.079.920 .......
## 1       5,761       6,143       6,452     252
## 2      29,425      31,012      31,485      80
## 3       4,938       5,122       5,307     290
## 4       3,170       3,294       3,375     363
## 5      43,663      45,330      46,537      58
## 6      39,967      41,301      41,970      64

Semi-Structured Files, Odd Formats

Files with metadata (e.g., earthquake catalog)
Non-tabular arrangement
Generally, write function to read in one (or a few) lines and split it into some nicer format
- Generally involves a lot of regexps
- Functions are easier to get right than code blocks in loops

In Praise of Capture Groups

Parentheses don't just group for quantifiers; they also create capture groups, which the regexp engine remembers
Can be referred to later (\1, \2, etc.)
Can also be used to simplify getting stuff out
Examples in the handout on regexps, but let's reinforce the point

Scraping the Rich

Remember that the lines giving net worth looked like

    <td class="worth">$72 B</td>

    <td class="worth">$5,3 B</td>

One regexp which catches this:

richhtml <- readLines("http://www.stat.cmu.edu/~cshalizi/statcomp/14/labs/03/rich.html")
worth_pattern <- "\\$[0-9,]+ B"
worth_lines <- grep(worth_pattern, richhtml)
length(worth_lines)

## [1] 100

(that last to check we have the right number of matches)

Just using this gives us strings, including the markers we used to pin down where the information was:

worth_matches <- regexpr(worth_pattern, richhtml)
worths <- regmatches(richhtml, worth_matches)
head(worths)

## [1] "$72 B"   "$58,5 B" "$41 B"   "$36 B"   "$36 B"   "$35,4 B"

Now we'd need to get rid of the anchoring $ and B; we could use substr, but…

Adding a capture group doesn't change what we match:

worth_capture <- worth_pattern <- "\\$([0-9,]+) B"
capture_lines <- grep(worth_capture, richhtml)
identical(worth_lines, capture_lines)

## [1] TRUE

but it does have an advantage

Using `regexec`

worth_matches <- regmatches(richhtml[capture_lines], 
  regexec(worth_capture, richhtml[capture_lines]))
worth_matches[1:2]

## [[1]]
## [1] "$72 B" "72"   
## 
## [[2]]
## [1] "$58,5 B" "58,5"

List with 1 element per matching line, giving the whole match and then each paranethesized matching sub-expression

Functions make the remaining manipulation easier:

second_element <- function(x) { return(x[2]) }
worth_strings <- sapply(worth_matches, second_element)
comma_to_dot <- function(x) {
  return(gsub(pattern=",",replacement=".",x))
}
worths <- as.numeric(sapply(worth_strings, comma_to_dot))
head(worths)

## [1] 72.0 58.5 41.0 36.0 36.0 35.4

Exercise: Write one function which takes a single line, gets the capture group, and converts it to a number

Web Scraping

Take a webpage designed for humans to read
Have the computer extract the information we actually want
Iterate as appropriate

Take in unstructured pages, return rigidly formatted data

Being More Explicit in Step 2

The information we want is somewhere in the page, possibly in the HTML
There are usually markers surrounding it, probably in the HTML
We now know how to pick apart HTML using regular expressions

Figure out exactly what we want from the page
Understand how the information is organized on the page
- What does a human use to find it?
- Where do those cues appear in the HTML source?
Write a function to automate information extraction
- Generally, this means regexps
- Parenthesized capture groups are helpful
- The function may need to iterate
- You may need more than one function
Once you've got it working for one page, iterate over relevant pages

Example: Book Networks

Famous example from Vladis Krebs

Two books are linked if they're bought together at Amazon
Amazon gives this information away (to try to drive sales)
How would we replicate this?

[http://www.amazon.com/dp/0387747303/]

Do we want "frequently bought together", or "customers who bought this also bought that"? Or even "what else do customers buy after viewing this"?
- Let's say "customers who bought this also bought that"
Now look carefully at the HTML
- There are over 14,000 lines in the HTML file for this page; you'll need a text editor
- Fortunately most of it's irrelevant

<div class="shoveler" id="purchaseShvl">
    <div class="shoveler-heading">
        <h2>Customers Who Bought This Item Also Bought</h2>
    </div>

<div class="shoveler-pagination" style="display:none">

<span>&nbsp;</span>
<span>
Page <span class="page-number"></span>  of  <span class="num-pages"></span> 
<span class="start-over"><span class="a-text-separator"></span><a href="#" onclick="return false;" class="start-over-link">Start over</a></span>
</span>
</div>

    <div class="shoveler-button-wrapper" id="purchaseButtonWrapper">
        <a class="back-button" href="#Back" style="display:none" onclick="return false;"><span class="auiTestSprite s_shvlBack"><span>Back</span></span></a>
        <div class="shoveler-content">
            <ul tabindex="-1">

Here's the first of the also-bought books:

<li>
  <div class="new-faceout p13nimp"  id="purchase_0387981403" data-asin="0387981403" data-ref="pd_sim_b_1">
    
<a href="/ggplot2-Elegant-Graphics-Data-Analysis/dp/0387981403/ref=pd_sim_b_1?ie=UTF8&refRID=1HZ0VDHEFFX3EM2WNWRH"  class="sim-img-title" > <div class="product-image">
                       <img src="http://ecx.images-amazon.com/images/I/31I22xsT%2BXL._SL500_PIsitb-sticker-arrow-big,TopRight,35,-73_OU01_SS100_.jpg" width="100" alt="" height="100" border="0" />
                    </div>
                    <span title="ggplot2: Elegant Graphics for Data Analysis (Use R!)">ggplot2: Elegant Graphics for Data &#133;</span> </a>

    <div class="byline">
        <span class="carat">&#8250</span>

We could extract the ISBN from this, and then go on to the next book, and so forth…

<div id="purchaseSimsData" class="sims-data"
style="display:none" data-baseAsin="0387747303"
data-deviceType="desktop" data-featureId="pd_sim" data-isAUI="1" data-pageId="0387747303" data-pageRequestId="1HZ0VDHEFFX3EM2WNWRH" data-reftag="pd_sim_b" data-vt="0387747303"
data-wdg="book_display_on_website"
data-widgetName="purchase">0387981403,0596809158,1593273843,1449316956,
0387938362,144931208X,0387790535,0387886974,0470973927,0387759689,
1439810184,1461413648,1461471370,1782162143,1441998896,1429224622,
1612903436,1441996494,1461468485,1617291560,1439831769,0321888030,1449319793,
1119962846,0521762936,1446200469,1449358659,1935182390,0123814855,1599941651,
0387759352,1461476178,0387773169,0387922970,0073523321,141297514X,1439840954,
1612900275,1449339735,052168689X,0387781706,1584884509,0387848576,1420068725,
1441915753,1466572841,1107422221,111844714X,0716762196,0133412938,1482203537,
0963488406,1466586966,0470463635,1493909827,1420079336,0321898656,1461422981,
158488424X,1441926127,1466570229,1590475348,1430266406,0071794565,0071623663,
111866146X,1441977864,1782160604,1449340377,1449309038,0963488414,0137444265,
1461406846,0073014664,1449370780,144197864X,3642201911,0534243126,1461443423,
158488651X,1449357105,1118208781,1420099604,1107057132,1449355730,1118356853,
1449361323,0470890819,0387245448,0521518148,0521169828,1584888490,1461464455,
0387781889,0387759581,0387717617,0123748569,188652923X,0155061399,0201076160</div>

In this case there's a big block which gives us the ISBNs of all the also-bought books

Strategy:

Load the page as text
Search for the regexp which begins this block, contains at least one ISBN, and then ends
Extract the sequence of ISBNs as a string, split on comma
Record in a dataframe that Data Manipulation's ISBN is also bought with each of those ISBNs
Snowball sampling: Go to the webpage of each of those books and repeat
- Stop when we get tired…
- Or when Amazon gets annoyed with us

More considerations on web-scraping

You should really look at the site's robots.txt file and respect it
See [https://github.com/hadley/rvest] for a prototype of a package to automate a lot of the work of scraping webpages

Summary

Loading and saving R objects is very easy
Reading and writing dataframes is pretty easy
Extracting data from unstructured sources is about using regexps appropriately
- Maybe not easy, but at least feasible