Name:
Andrew ID:
Collaborated with:

This lab is to be completed in class. You can collaborate with your classmates, but you must identify their names above, and you must submit your own lab as an Rmd file on Blackboard, by 11:59pm on the day of the lab.

There are Homework 2 questions dispersed throughout. These must be written up in a separate Rmd document, together with all Homework 2 questions from other labs. Your homework writeup must start as this one: by listing your name, Andrew ID, and who you collaborated with. You must submit your own homework as a knit HTML file on Blackboard, by 6pm on Sunday September 18. This document contains 17 of the 45 total points for Homework 2.

Important remark on compiling the homework: many homework questions depend on variables that are defined in the surrounding lab. It is easiest just to copy and paste the contents of all these labs into one big Rmd file, with your lab solutions and homework solutions filled in, knit it, and submit the HTML file.

Quantifiers, extracting substrings

anss.lines = readLines("http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/anss.html")
head(anss.lines, 15)
##  [1] "<HTML><HEAD><TITLE>NCEDC_Search_Results</TITLE></HEAD><BODY>Your search parameters are:<ul>"     
##  [2] "<li>catalog=ANSS"                                                                                
##  [3] "<li>start_time=2002/01/01,00:00:00"                                                              
##  [4] "<li>end_time=2016/08/01,00:00:00"                                                                
##  [5] "<li>minimum_magnitude=6.0"                                                                       
##  [6] "<li>maximum_magnitude=10"                                                                        
##  [7] "<li>event_type=E"                                                                                
##  [8] "</ul>"                                                                                           
##  [9] "<PRE>"                                                                                           
## [10] "Date       Time             Lat       Lon  Depth   Mag Magt  Nst Gap  Clo  RMS  SRC   Event ID"  
## [11] "----------------------------------------------------------------------------------------------"  
## [12] "2002/01/01 10:39:06.82 -55.2140 -129.0000  10.00  6.00   Mw   78          1.07  NEI 200201014017"
## [13] "2002/01/01 11:29:22.73   6.3030  125.6500 138.10  6.30   Mw  236          0.90  NEI 200201014018"
## [14] "2002/01/02 14:50:33.49 -17.9830  178.7440 665.80  6.20   Mw  215          1.08  NEI 200201024034"
## [15] "2002/01/02 17:22:48.76 -17.6000  167.8560  21.00  7.20   Mw  427          0.90  NEI 200201024041"

Hw2 Q5 (1 point). Check that all the lines in date.lines actually start with a date, of the form YYYY/MM/DD, rather than contain a date of this form somewhere in the middle of the text. (Hint: one clean way to do this is with anchoring. Also, it might help to note that you can look for non-matches using invert=TRUE.)

Hw2 Q6 (2 points). Which 5 days witnessed the most earthquakes, and how many were there, these days? (Hint: use table() and sort().) Also, what happened on the day with the most earthquakes: can you find any references to this day in the news?

Hw2 Q7 (2 points). How many earthquakes were there during each year in between 2002 and 2012? What year had the most? (Hint: use substr(), then table().)

trial.str.vec = c("-55.2140", "-129.0000", "6.3030", "125.6500", "-17.9830")
trial.str.vec.sp = c(" -55.2140", " -129.0000", "    6.3030", "  125.6500", "  -17.9830")

Splitting substrings

#library(maps)
#map("world")
#points(lat.lon.mat[2,], lat.lon.mat[1,], pch=20, col="red")

Hw2 Q8 (12 points). Go back to the data in date.lines. Following steps similar to the ones above, in which you extracted the latitude and longitude of earthquakes, exact the depth and magnitude of earthquakes.

To design the appropriate regex, you should proceed in small increments, as above. First design a regex to capture the magnitude—what you know about the magnitude is that it is a number of the form X.XX where the first X is between 6 and 9. Then design a regex to capture any number of leading spaces (1 or more). Then design a regex to capture the depth—what you know about the depth is that it is a number of the form X.XX or XX.XX or XXX.XX. Concatenate these last 3 regexes together into a single regex, call it dep.mag.pattern, to catch the depth/magnitude pairs. This is almost all you need, but the problem is that sometimes (rarely) using dep.mag.pattern will pick up entries that appear “too early” in a given line. For example, uncomment the and examine the following:

#date.lines[1701]

Here our regex dep.mag.pattern would pick up the 9.99 that appears in the latitude column, and improperly treat this as a magnitude measurement. Therefore, you should concatenate together the regex geo.pattern.pair you defined above in the lab, with a regex capturing any number of spaces (1 or more), with deg.mag.pattern. Call this quad.pattern: this will capture the quadruple of latitude, longitude, depth, and magnitude, in each line of data.

Once you have defined quad.pattern, you can then use strsplit(), unlist(), as.numeric(), and matrix() as above to get a matrix with 4 rows and 2346 columns. Its rows will contain the latitude, longitude, depth, and magnitude measurements, in this order.

Use the depth and magnitude measurements to answer the following questions. What is the largest magnitude of an earthquake since 2002? On what day did it occur, and roughly where did it occur (use the latitude/longitude pair to look up the location on a map)? How many earthquakes of magnitude 8.5 or larger were there since 2002? What is the largest depth of an earthquake since 2002? How many earthquakes of depth 300 (meters, the units considered in the data table) or larger were there since 2002? What is the correlation between earthquake depth and earthquake magnitude?