Today’s agenda: Using regular expressions to extract data from text; text manipulations; getting used to very skewed distributions.
General instructions for labs. Upload an R Markdown file, named
R Markdown setup. Open a new R Markdown file; set the output to HTML mode and click “Knit HTML”. This should produce a web page with the knitting procedure executing your code blocks. You can edit this new file to produce your homework submission. Alternatively, you can start from the lab’s R Markdown file posted on the course website, as a template.
The file http://www.stat.cmu.edu/~ryantibs/statcomp-F15/labs/rich.html contains a listing of the 100 richest people in America, according to Forbes magazine. We will use the file to practice extracting information from webpages.
Use the readLines() function to load the file into a character vector called richhtml. How many lines does it contain? What is the total number of characters in the file?
Open the file in a text editor (not as a webpage). Find the entries for Bill Gates and for Elon Musk. Give the text of the lines from the file which record their net worths.
Write a regular expression which should capture a person’s net worth, and save this string as a variable called pattern. Use the grep() function, with pattern, to check that this has exactly 100 matches in richhtml, and that the expression you created is indeed matching the actual net worths.
The following code block
matches = regexpr(pattern,richhtml)
networths = regmatches(richhtml,matches)
when run with your pattern variable from problem 3, will extract all of the net worths in the Forbes list. Run it (i.e., remove eval=FALSE from the Rmd file for this lab!), and check the following.
(Hint: it will also be helpful here look at the result your grep() call from problem 3, with value=FALSE; that way, you can inspect the lines right before a matched line to see the identities of the persons…)
networths from problem 4, to floating point numbers, and store the result in a vairable called networths.num. Check the following:
networths.num is indeed a vector, of length 100 and type double.networths.num are greater than 1 billion, and less than 100 billion.networths.num matches the net worth of Bill Gates.networths.num matching the net worth of Elon Musk.networths.num vector from problem 4.
networths.num vector.
fracheld of length 100. Populate the first entry by the fraction of total wealth held by the richest person (among the 100 total), populate the second entry with the fraction of wealth held by the richest 2 people, the third entry with the fraction of wealth held by the richest 3 people, and so forth. (Hint: you can do this with a for loop. Alternatively, you can use the cumsum() function.) Create a line plot of fracheld versus the numbers 1 through 100. Check that this line plot visually matches your answers from a through c.