Today’s agenda: Using regular expressions to extract data from text; text manipulations; getting used to very skewed distributions.
General instructions for labs. Upload an R Markdown file, named
R Markdown setup. Open a new R Markdown file; set the output to HTML mode and click “Knit HTML”. This should produce a web page with the knitting procedure executing your code blocks. You can edit this new file to produce your homework submission. Alternatively, you can start from the lab’s R Markdown file posted on the course website, as a template.
The file http://www.stat.cmu.edu/~ryantibs/statcomp-F15/labs/rich.html contains a listing of the 100 richest people in America, according to Forbes magazine. We will use the file to practice extracting information from webpages.
Use the readLines()
function to load the file into a character vector called richhtml
. How many lines does it contain? What is the total number of characters in the file?
Open the file in a text editor (not as a webpage). Find the entries for Bill Gates and for Elon Musk. Give the text of the lines from the file which record their net worths.
Write a regular expression which should capture a person’s net worth, and save this string as a variable called pattern
. Use the grep()
function, with pattern
, to check that this has exactly 100 matches in richhtml
, and that the expression you created is indeed matching the actual net worths.
The following code block
matches = regexpr(pattern,richhtml)
networths = regmatches(richhtml,matches)
when run with your pattern
variable from problem 3, will extract all of the net worths in the Forbes list. Run it (i.e., remove eval=FALSE
from the Rmd file for this lab!), and check the following.
(Hint: it will also be helpful here look at the result your grep()
call from problem 3, with value=FALSE
; that way, you can inspect the lines right before a matched line to see the identities of the persons…)
networths
from problem 4, to floating point numbers, and store the result in a vairable called networths.num
. Check the following:
networths.num
is indeed a vector, of length 100 and type double
.networths.num
are greater than 1 billion, and less than 100 billion.networths.num
matches the net worth of Bill Gates.networths.num
matching the net worth of Elon Musk.networths.num
vector from problem 4.
networths.num
vector.
fracheld
of length 100. Populate the first entry by the fraction of total wealth held by the richest person (among the 100 total), populate the second entry with the fraction of wealth held by the richest 2 people, the third entry with the fraction of wealth held by the richest 3 people, and so forth. (Hint: you can do this with a for loop. Alternatively, you can use the cumsum()
function.) Create a line plot of fracheld
versus the numbers 1 through 100. Check that this line plot visually matches your answers from a through c.