Today’s agenda: Using regular expressions to extract data from text; text manipulations; getting used to very skewed distributions.

General instructions for labs. Upload an R Markdown file, named .Rmd“, to Blackboard. You will give the commands to answer each question in its own code block, which will also produce plots that will be automatically embedded in the output file. Each answer must be supported by written statements as well as any code used. Include the name of your lab partner at the top of the file.

R Markdown setup. Open a new R Markdown file; set the output to HTML mode and click “Knit HTML”. This should produce a web page with the knitting procedure executing your code blocks. You can edit this new file to produce your homework submission. Alternatively, you can start from the lab’s R Markdown file posted on the course website, as a template.

Part I: Rich folks

The file http://www.stat.cmu.edu/~ryantibs/statcomp-F15/labs/rich.html contains a listing of the 100 richest people in America, according to Forbes magazine. We will use the file to practice extracting information from webpages.

  1. Use the readLines() function to load the file into a character vector called richhtml. How many lines does it contain? What is the total number of characters in the file?

  2. Open the file in a text editor (not as a webpage). Find the entries for Bill Gates and for Elon Musk. Give the text of the lines from the file which record their net worths.

  3. Write a regular expression which should capture a person’s net worth, and save this string as a variable called pattern. Use the grep() function, with pattern, to check that this has exactly 100 matches in richhtml, and that the expression you created is indeed matching the actual net worths.

  4. The following code block

    matches = regexpr(pattern,richhtml)
    networths = regmatches(richhtml,matches)

    when run with your pattern variable from problem 3, will extract all of the net worths in the Forbes list. Run it (i.e., remove eval=FALSE from the Rmd file for this lab!), and check the following.

    (Hint: it will also be helpful here look at the result your grep() call from problem 3, with value=FALSE; that way, you can inspect the lines right before a matched line to see the identities of the persons…)

    1. There are 100 net worths.
    2. The largest net worth is that of Bill Gates, and there is only one person worth that much.
    3. The Koch brothers have the same net worth.
    4. There are 60 people whose net worth is higher than that of Elon Musk, and 4 people total (including Elon Musk) with that same net worth.

Part II: Spread of wealth

  1. The Forbes website writes net worths in the form “$7,7 B” to mean \(7.7 \times {10}^{9}\) dollars. Write code to convert from the Forbes format stored in networths from problem 4, to floating point numbers, and store the result in a vairable called networths.num. Check the following:
    1. networths.num is indeed a vector, of length 100 and type double.
    2. All of the entries in networths.num are greater than 1 billion, and less than 100 billion.
    3. The largest entry in networths.num matches the net worth of Bill Gates.
    4. There are 4 entries in networths.num matching the net worth of Elon Musk.
  2. Answer the following using the networths.num vector from problem 4.
    1. What is the median net worth of these 100 people?
    2. What is the mean net worth of these 100 people?
    3. How many of these 100 individuals were worth at least 4 billion dollars? 7 billion? 20 billion?
  3. Again, answer using the networths.num vector.
    1. What is the total net worth of the 100 richest people?
    2. What fraction of that total is held by the 5 richest people? 10 richest people? 20 richest people?
    3. What is the smallest number of people who together hold at least 75 percent of that total wealth?
    4. Create an empty vector called fracheld of length 100. Populate the first entry by the fraction of total wealth held by the richest person (among the 100 total), populate the second entry with the fraction of wealth held by the richest 2 people, the third entry with the fraction of wealth held by the richest 3 people, and so forth. (Hint: you can do this with a for loop. Alternatively, you can use the cumsum() function.) Create a line plot of fracheld versus the numbers 1 through 100. Check that this line plot visually matches your answers from a through c.
    5. There are about 118 million households in the US, with a total net worth of about 85 trillion dollars (http://www.federalreserve.gov/releases/z1/current/z1.pdf). What fraction of that total wealth is held by the 100 richest people on the Forbes list? What is the ratio of the mean net worth of the richest 100 to the net worth of the mean household?