Your homework must be submitted in R Markdown format. We will not (indeed, cannot) grade homeworks in other formats. Your responses must be supported by both textual explanations and the code you generate to produce your result. (Just examining your various objects in the “Environment” section of R Studio is insufficient—you must use scripted commands.)
Background: In this homework, we’ll show you how you might become rich yourself by collecting the schedule for the upcoming 2015-2016 NHL season, including the links to Ticketmaster, so that you can corner the resale market and get super rich.
Just kidding! There’s no way with what we’ve taught you so far that you’ll be able to get past the anti-bot measures on Ticketmaster without defeating the CAPTCHA systems that were designed and built right here at CMU. What you can do, though, is reassemble this series of games in an R data frame for machine-readable use. A To do so, you will use regular expressions to extract the useful information from the surrounding HTML code.
readLines()
command to load the file at http://www.stat.cmu.edu/~ryantibs/statcomp-F15/homework/NHL1516.html into a character vector called nhl1516
. Remember to pass this URL directly to the readLines()
function.
Take a look at the webpage NHL1516.html
. You should see the game table on the screen. There are 1230 regular-season games scheduled. Who is playing in the first game? In the final game?
Now, download the file NHL1516.html
to your computer and open in a text editor. What line in the file corresponds to game 1? Which line corresponds to game 1230? How do each of these lines begin?
Our goal is to extract useful information about the games – the date, game time (in Eastern Time), away and home teams. As a first step, write a regular expression to capture the date. Use grep()
to check that this has exactly 1230 matches and that the first and last locations match the first and last games (use this for question 5a).
regexpr()
and regmatches()
, extract all the dates from the text and create a corresponding vector date
. Save this for a further step.\\.
in your expression.Bonus question: Create a regular expression to extract all the away teams coming to play with Pittsburgh Penguins (at home).