- Seen functions to load data in passing
- Learned about string manipulation and regexp
22 September 2014
save(thing, file="name")
saves thing
in a file called name
(conventional extension: rda
or Rda
)load("name")
loads the object or objects stored in the file called name
, with their old namesgmp <- read.table("http://www.stat.cmu.edu/~cshalizi/statcomp/14/lectures/06/gmp.dat") gmp$pop <- round(gmp$gmp/gmp$pcgmp) save(gmp,file="gmp.Rda") rm(gmp) exists("gmp")
## [1] FALSE
not_gmp <- load(file="gmp.Rda") colnames(gmp)
## [1] "MSA" "gmp" "pcgmp" "pop"
not_gmp
## [1] "gmp"
data()
to load themdata(cats,package="MASS") summary(cats)
## Sex Bwt Hwt ## F:47 Min. :2.00 Min. : 6.30 ## M:97 1st Qu.:2.30 1st Qu.: 8.95 ## Median :2.70 Median :10.10 ## Mean :2.72 Mean :10.63 ## 3rd Qu.:3.02 3rd Qu.:12.12 ## Max. :3.90 Max. :20.50
Note: data()
returns the name of the loaded data file!
read.table()
read.csv()
is a short-cut to set the options for reading comma-separated value (CSV) files
write.table()
, write.csv()
write a dataframe into a fileload
or save
foreign
package on CRAN has tools for reading data files from lots of non-R statistical softwareread.csv()
read.csv()
read.xls()
from the gdata
packageread.csv()
, can take a URL or filenamerequire(gdata, quietly=TRUE)
## gdata: read.xls support for 'XLS' (Excel 97-2004) files ENABLED. ## ## gdata: read.xls support for 'XLSX' (Excel 2007+) files ENABLED. ## ## Attaching package: 'gdata' ## ## The following object is masked from 'package:stats': ## ## nobs ## ## The following object is masked from 'package:utils': ## ## object.size
setwd("~/Downloads/") gmp_2008_2013 <- read.xls("gdp_metro0914.xls",pattern="U.S.") head(gmp_2008_2013)
## U.S..metropolitan.areas X13.269.057 X12.994.636 X13.461.662 ## 1 Abilene, TX 5,725 5,239 5,429 ## 2 Akron, OH 28,663 27,761 28,616 ## 3 Albany, GA 4,795 4,957 4,928 ## 4 Albany, OR 3,235 3,064 3,050 ## 5 Albany-Schenectady-Troy, NY 40,365 42,454 42,969 ## 6 Albuquerque, NM 37,359 38,110 38,801 ## X13.953.082 X14.606.938 X15.079.920 ....... ## 1 5,761 6,143 6,452 252 ## 2 29,425 31,012 31,485 80 ## 3 4,938 5,122 5,307 290 ## 4 3,170 3,294 3,375 363 ## 5 43,663 45,330 46,537 58 ## 6 39,967 41,301 41,970 64
\1
, \2
, etc.)Remember that the lines giving net worth looked like
<td class="worth">$72 B</td>
or
<td class="worth">$5,3 B</td>
One regexp which catches this:
richhtml <- readLines("http://www.stat.cmu.edu/~cshalizi/statcomp/14/labs/03/rich.html") worth_pattern <- "\\$[0-9,]+ B" worth_lines <- grep(worth_pattern, richhtml) length(worth_lines)
## [1] 100
(that last to check we have the right number of matches)
Just using this gives us strings, including the markers we used to pin down where the information was:
worth_matches <- regexpr(worth_pattern, richhtml) worths <- regmatches(richhtml, worth_matches) head(worths)
## [1] "$72 B" "$58,5 B" "$41 B" "$36 B" "$36 B" "$35,4 B"
Now we'd need to get rid of the anchoring $
and B
; we could use substr
, but…
Adding a capture group doesn't change what we match:
worth_capture <- worth_pattern <- "\\$([0-9,]+) B" capture_lines <- grep(worth_capture, richhtml) identical(worth_lines, capture_lines)
## [1] TRUE
but it does have an advantage
regexec
worth_matches <- regmatches(richhtml[capture_lines], regexec(worth_capture, richhtml[capture_lines])) worth_matches[1:2]
## [[1]] ## [1] "$72 B" "72" ## ## [[2]] ## [1] "$58,5 B" "58,5"
List with 1 element per matching line, giving the whole match and then each paranethesized matching sub-expression
Functions make the remaining manipulation easier:
second_element <- function(x) { return(x[2]) } worth_strings <- sapply(worth_matches, second_element) comma_to_dot <- function(x) { return(gsub(pattern=",",replacement=".",x)) } worths <- as.numeric(sapply(worth_strings, comma_to_dot)) head(worths)
## [1] 72.0 58.5 41.0 36.0 36.0 35.4
Exercise: Write one function which takes a single line, gets the capture group, and converts it to a number
Take in unstructured pages, return rigidly formatted data
Famous example from Vladis Krebs
Two books are linked if they're bought together at Amazon
Amazon gives this information away (to try to drive sales)
How would we replicate this?
<div class="shoveler" id="purchaseShvl"> <div class="shoveler-heading"> <h2>Customers Who Bought This Item Also Bought</h2> </div> <div class="shoveler-pagination" style="display:none"> <span> </span> <span> Page <span class="page-number"></span> of <span class="num-pages"></span> <span class="start-over"><span class="a-text-separator"></span><a href="#" onclick="return false;" class="start-over-link">Start over</a></span> </span> </div> <div class="shoveler-button-wrapper" id="purchaseButtonWrapper"> <a class="back-button" href="#Back" style="display:none" onclick="return false;"><span class="auiTestSprite s_shvlBack"><span>Back</span></span></a> <div class="shoveler-content"> <ul tabindex="-1">
Here's the first of the also-bought books:
<li> <div class="new-faceout p13nimp" id="purchase_0387981403" data-asin="0387981403" data-ref="pd_sim_b_1"> <a href="/ggplot2-Elegant-Graphics-Data-Analysis/dp/0387981403/ref=pd_sim_b_1?ie=UTF8&refRID=1HZ0VDHEFFX3EM2WNWRH" class="sim-img-title" > <div class="product-image"> <img src="http://ecx.images-amazon.com/images/I/31I22xsT%2BXL._SL500_PIsitb-sticker-arrow-big,TopRight,35,-73_OU01_SS100_.jpg" width="100" alt="" height="100" border="0" /> </div> <span title="ggplot2: Elegant Graphics for Data Analysis (Use R!)">ggplot2: Elegant Graphics for Data …</span> </a> <div class="byline"> <span class="carat">›</span>
We could extract the ISBN from this, and then go on to the next book, and so forth…
<div id="purchaseSimsData" class="sims-data" style="display:none" data-baseAsin="0387747303" data-deviceType="desktop" data-featureId="pd_sim" data-isAUI="1" data-pageId="0387747303" data-pageRequestId="1HZ0VDHEFFX3EM2WNWRH" data-reftag="pd_sim_b" data-vt="0387747303" data-wdg="book_display_on_website" data-widgetName="purchase">0387981403,0596809158,1593273843,1449316956, 0387938362,144931208X,0387790535,0387886974,0470973927,0387759689, 1439810184,1461413648,1461471370,1782162143,1441998896,1429224622, 1612903436,1441996494,1461468485,1617291560,1439831769,0321888030,1449319793, 1119962846,0521762936,1446200469,1449358659,1935182390,0123814855,1599941651, 0387759352,1461476178,0387773169,0387922970,0073523321,141297514X,1439840954, 1612900275,1449339735,052168689X,0387781706,1584884509,0387848576,1420068725, 1441915753,1466572841,1107422221,111844714X,0716762196,0133412938,1482203537, 0963488406,1466586966,0470463635,1493909827,1420079336,0321898656,1461422981, 158488424X,1441926127,1466570229,1590475348,1430266406,0071794565,0071623663, 111866146X,1441977864,1782160604,1449340377,1449309038,0963488414,0137444265, 1461406846,0073014664,1449370780,144197864X,3642201911,0534243126,1461443423, 158488651X,1449357105,1118208781,1420099604,1107057132,1449355730,1118356853, 1449361323,0470890819,0387245448,0521518148,0521169828,1584888490,1461464455, 0387781889,0387759581,0387717617,0123748569,188652923X,0155061399,0201076160</div>
In this case there's a big block which gives us the ISBNs of all the also-bought books
Strategy:
robots.txt
file and respect it