Name:
Andrew ID:
Collaborated with:

This lab is to be completed in class. You can collaborate with your classmates, but you must identify their names above, and you must submit your own lab as an Rmd file on Blackboard, by 11:59pm on the day of the lab.

There are Homework 2 questions dispersed throughout. These must be written up in a separate Rmd document, together with all Homework 2 questions from other labs. Your homework writeup must start as this one: by listing your name, Andrew ID, and who you collaborated with. You must submit your own homework as a knit HTML file on Blackboard, by 6pm on Sunday September 18. This document contains 17 of the 45 total points for Homework 2.

Important remark on compiling the homework: many homework questions depend on variables that are defined in the surrounding lab. It is easiest just to copy and paste the contents of all these labs into one big Rmd file, with your lab solutions and homework solutions filled in, knit it, and submit the HTML file.

Quantifiers, extracting substrings

Below, we read in lines of data from the Advanced National Seismic System (ANSS), on earthquakes of magnitude 6+, between 2002 and 2016. We display the first 15 lines. (You don’t have to do anything yet.)

anss.lines = readLines("http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/anss.html")
head(anss.lines, 15)

##  [1] "<HTML><HEAD><TITLE>NCEDC_Search_Results</TITLE></HEAD><BODY>Your search parameters are:<ul>"     
##  [2] "<li>catalog=ANSS"                                                                                
##  [3] "<li>start_time=2002/01/01,00:00:00"                                                              
##  [4] "<li>end_time=2016/08/01,00:00:00"                                                                
##  [5] "<li>minimum_magnitude=6.0"                                                                       
##  [6] "<li>maximum_magnitude=10"                                                                        
##  [7] "<li>event_type=E"                                                                                
##  [8] "</ul>"                                                                                           
##  [9] "<PRE>"                                                                                           
## [10] "Date       Time             Lat       Lon  Depth   Mag Magt  Nst Gap  Clo  RMS  SRC   Event ID"  
## [11] "----------------------------------------------------------------------------------------------"  
## [12] "2002/01/01 10:39:06.82 -55.2140 -129.0000  10.00  6.00   Mw   78          1.07  NEI 200201014017"
## [13] "2002/01/01 11:29:22.73   6.3030  125.6500 138.10  6.30   Mw  236          0.90  NEI 200201014018"
## [14] "2002/01/02 14:50:33.49 -17.9830  178.7440 665.80  6.20   Mw  215          1.08  NEI 200201024034"
## [15] "2002/01/02 17:22:48.76 -17.6000  167.8560  21.00  7.20   Mw  427          0.90  NEI 200201024041"

This looks like webpage code mixed in with earthquake data. We don’t care about the first 11 lines, and it looks as if the data we want starts on line 12. Importantly, every line of data begins with a date, of the form YYYY/MM/DD, as in “2002/01/01”. Design a regular expression, call it date.pattern, to match to these dates. (Hint: use quantifiers to make date.pattern concise.). Use this and grep() to retrieve the lines of text containing earthquake data. Call the result date.lines. How many lines of data are there? Show the first 2 and the last 2 lines. When was the last earthquake recorded and what was its magnitude?

Hw2 Q5 (1 point). Check that all the lines in date.lines actually start with a date, of the form YYYY/MM/DD, rather than contain a date of this form somewhere in the middle of the text. (Hint: one clean way to do this is with anchoring. Also, it might help to note that you can look for non-matches using invert=TRUE.)

From date.lines, extract just the date strings themselves, and call the resulting vector date.str.vec. (Hint: use regexpr() and regmatches().) Check that the first three are “2002/01/01”, “2002/01/01”, “2002/01/02”, and that the length of date.str.vec matches that of date.lines.

Hw2 Q6 (2 points). Which 5 days witnessed the most earthquakes, and how many were there, these days? (Hint: use table() and sort().) Also, what happened on the day with the most earthquakes: can you find any references to this day in the news?

Hw2 Q7 (2 points). How many earthquakes were there during each year in between 2002 and 2012? What year had the most? (Hint: use substr(), then table().)

Look back at the lines of earthquake data printed at the start of this lab document. The columns for Lat and Lon give the latitude and longitude, respectively, of the earthquake. Importantly, this takes the form X.XXXX, XX.XXXX, XXX.XXXX, or any of these forms with a leading minus sign, where X is a number. Design a regular expression to match these entries, call it geo.pattern. Test it out on the trial string vector below, with grep(), and make sure that you match all the strings. (Hint: build the regular expression, “left to right”, following this logic: an optional minus sign, 1 to 3 digits, a period, then exactly 4 digits.)

trial.str.vec = c("-55.2140", "-129.0000", "6.3030", "125.6500", "-17.9830")

Design a regular expression geo.pattern.sp that captures not only the latitude/longitude pattern (like geo.pattern), but additionally, any number of leading spaces (1 or more). Test it out on the trial string vector below, with regexpr() and regmatches(). Make sure that you match all the strings, and in each case the extracted text is the entire string.

trial.str.vec.sp = c(" -55.2140", " -129.0000", "    6.3030", "  125.6500", "  -17.9830")

Finally, design a regular expression geo.pattern.pair that captures a latitude pattern, then any number of spaces (1 or more), then a longitude pattern. Really, this is just the concatenation of the regexes you already designed, geo.pattern and geo.pattern.sp. Use geo.pattern.pair, with regexpr() and regmatches(), in order to extract the latitude/longitude pairs from each line of earthquake data in date.lines. Call the result lat.lon.pairs, and display the first 3 entries, checking visually that it matches the results printed at the top of this lab.

Splitting substrings

Apply strsplit() to lat.lon.pairs[1:2] with split=" ", and show the results. Explain: why do the two vectors, in the first and second elements of the returned list, have different numbers of strings?
Apply strsplit() to lat.lon.pairs[1:2] with split set equal to a regex that captures any number of spaces (1 or more), and show the results. Check that the two elements of the return list have the same number of strings. Then apply strsplit() to all of lat.lon.pairs with split set equal to the same regex, and save the result as lat.lon.split.
Define lat.lon.vec = unlist(lat.lon.split), to unravel the list into a vector. Display the first 4 elements. Then, cast lat.lon.vec to be a numeric vector, and display the first 4 elements. Lastly, use the matrix() command to turn lat.lon.vec into a matrix with 2 rows and 2346 columns, call the result lat.lon.mat.
Use lat.lon.mat to answer the following questions: what was the biggest latitude of an earthquake in the data? Smallest latitude? Biggest and smallest longitudes? Are these values surprising? (Hint: what are the maximum and minimum possible values of latitude and longitude?)
Uncomment and run the code below on lat.lon.mat to see a visualization of where the earthquakes occurred. Aside from installing the R package “maps” (in case you don’t already have this installed), you don’t have to do anything here, this is just for you to see a neat visualization of the earthquake data. (Do you see the obvious “earthquake belts”?) We’ll cover plotting tools next week.

#library(maps)
#map("world")
#points(lat.lon.mat[2,], lat.lon.mat[1,], pch=20, col="red")

Hw2 Q8 (12 points). Go back to the data in date.lines. Following steps similar to the ones above, in which you extracted the latitude and longitude of earthquakes, exact the depth and magnitude of earthquakes.

To design the appropriate regex, you should proceed in small increments, as above. First design a regex to capture the magnitude—what you know about the magnitude is that it is a number of the form X.XX where the first X is between 6 and 9. Then design a regex to capture any number of leading spaces (1 or more). Then design a regex to capture the depth—what you know about the depth is that it is a number of the form X.XX or XX.XX or XXX.XX. Concatenate these last 3 regexes together into a single regex, call it dep.mag.pattern, to catch the depth/magnitude pairs. This is almost all you need, but the problem is that sometimes (rarely) using dep.mag.pattern will pick up entries that appear “too early” in a given line. For example, uncomment the and examine the following:

#date.lines[1701]

Here our regex dep.mag.pattern would pick up the 9.99 that appears in the latitude column, and improperly treat this as a magnitude measurement. Therefore, you should concatenate together the regex geo.pattern.pair you defined above in the lab, with a regex capturing any number of spaces (1 or more), with deg.mag.pattern. Call this quad.pattern: this will capture the quadruple of latitude, longitude, depth, and magnitude, in each line of data.

Once you have defined quad.pattern, you can then use strsplit(), unlist(), as.numeric(), and matrix() as above to get a matrix with 4 rows and 2346 columns. Its rows will contain the latitude, longitude, depth, and magnitude measurements, in this order.

Use the depth and magnitude measurements to answer the following questions. What is the largest magnitude of an earthquake since 2002? On what day did it occur, and roughly where did it occur (use the latitude/longitude pair to look up the location on a map)? How many earthquakes of magnitude 8.5 or larger were there since 2002? What is the largest depth of an earthquake since 2002? How many earthquakes of depth 300 (meters, the units considered in the data table) or larger were there since 2002? What is the correlation between earthquake depth and earthquake magnitude?

Lab 3w: Quantifiers and Scope, Extractions and Replacements

Statistical Computing, 36-350

Wednesday September 14, 2016

Quantifiers, extracting substrings

Splitting substrings