Name:
Andrew ID:
Collaborated with:

This lab is to be completed in class. You can collaborate with your classmates, but you must identify their names above, and you must submit your own lab as an Rmd file on Blackboard, by 11:59pm on the day of the lab.

There are Homework 11 questions dispersed throughout. These must be written up in a separate Rmd document, together with all Homework 11 questions from other labs. Your homework writeup must start as this one: by listing your name, Andrew ID, and who you collaborated with. You must submit your own homework as a knit HTML file on Blackboard, by 11:59pm on Tuesday November 22. This document contains 15 of the 30 total points for Homework 11.

Practice with SQL data extraction

We will practice SQL commands using data from the Baseball Databank http://baseball-databank.org. Information on the attributes in the database can be found at http://baseball1.com/files/database/readme58.txt. Download the database file http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/baseball.db, and save in your working directory for this lab. Also, install the packages DBI, RSQLite if you haven’t done so already, and load them into your R session with library(). Also load the package plyr (which should already be installed) into your R session with library().
Using dbDriver(), dbConnect(), set up a connection called con the SQLite database stored in baseball.db. Then, use dbListTables() to list the tables in the database.
Using dbReadTable(), grab the table named “Batting” and save it as a data frame in your R session, called batting. Check that batting is indeed a data frame, and that it has dimension 93955 x 24.
Remove eval=FALSE from the preamble to the next R code chunk below. Then, after each SQL query (each call to dbGetQuery()), explain in words what is being extracted, and write one line of R code (sometimes you might need two lines) to do the equivalent on the batting data frame.

dbGetQuery(con, paste("SELECT playerID, yearID, AB, H, HR",
                      "FROM Batting",
                      "ORDER BY yearID",
                      "LIMIT 10"))

dbGetQuery(con, paste("SELECT playerID, yearID, AB, H, HR",
                      "FROM Batting",
                      "ORDER BY HR DESC",
                      "LIMIT 10"))

dbGetQuery(con, paste("SELECT playerID, yearID, AB, H, HR",
                      "FROM Batting",
                      "WHERE HR > 55",
                      "ORDER BY HR DESC"))

dbGetQuery(con, paste("SELECT playerID, yearID, AB, H, HR",
                      "FROM Batting",
                      "WHERE yearID >= 1990 AND yearID <= 2000",
                      "ORDER BY HR DESC",
                      "LIMIT 10"))

Practice with SQL computations

As before, remove eval=FALSE from the preamble to the next R code chunk. Then, after each SQL query, explain in words what is being extracted, and write one line of R code to do the equivalent on the batting data frame. (Hint: often you’ll have to use na.rm=TRUE to deal with NA values, for example mean(x, na.rm=TRUE) computes the mean of a vector x after removing any NA values.)

dbGetQuery(con, paste("SELECT AVG(HR)",
                      "FROM Batting"))

dbGetQuery(con, paste("SELECT SUM(HR)",
                      "FROM Batting"))

dbGetQuery(con, paste("SELECT playerID, yearID, teamID, MAX(HR)",
                      "FROM Batting"))

dbGetQuery(con, paste("SELECT AVG(HR)",
                      "FROM Batting",
                      "WHERE yearID >= 1990"))

Hw11 Q1 (7 points). Remove eval=FALSE from the preamble to the next R code chunk below. Then, after each SQL query, explain in words what is being extracted, and write one line of R code (sometimes you might need two lines) in which you call one of the d*ply() functions in order to do the equivalent on batting. (Note: the SQL query here uses GROUP BY, which is explained in Wednesday’s mini-lecture “SQL: Queries”, so you may want to work on this homework problem after Wednesday’s class, or read ahead.)

dbGetQuery(con, paste("SELECT teamID, AVG(HR)",
                      "FROM Batting",
                      "WHERE yearID >= 1990",
                      "GROUP BY teamID",
                      "LIMIT 5"))

dbGetQuery(con, paste("SELECT teamID, AVG(HR)",
                      "FROM Batting",
                      "WHERE yearID < 1960",
                      "GROUP BY teamID",
                      "ORDER BY AVG(HR) DESC",
                      "LIMIT 5"))

dbGetQuery(con, paste("SELECT teamID, yearID, AVG(HR)",
                      "FROM Batting",
                      "WHERE yearID == 1991 OR yearID == 1992",
                      "GROUP BY teamID, yearID",
                      "ORDER BY AVG(HR) DESC",
                      "LIMIT 15"))

Hw11 Q2 (8 points). Using dbReadTable(), grab the table named “Salaries” and save it as a data frame in your R session, called salaries. Check that salaries has dimension 21464 x 5, and display its first 5 rows.

Using the salaries data frame and ddply(), compute the payroll (total of salaries) for each team in the year 2010. Display the first 5 rows of the output. Then list the 3 teams with the highest payrolls, and the team with the lowest payroll (ouch!).

Recompute the payroll for each team in 2010, but now with dbGetQuery() and an appropriate SQL query. In particular, the output of dbGetQuery() should be a data frame with two columns, the first giving the team names, and the second the payrolls, just like your output from dadly(). (Hint: your SQL query here will have to use GROUP BY, which is explained in Wednesday’s mini-lecture “SQL: Queries”, so you may want to work on this homework problem after Wednesday’s class, or read ahead.) Display the first 5 rows of your data frame, and verify that the salaries you computed here are the same as those from ddply().

Lab 12m: SQL: Queries

Statistical Computing, 36-350

Monday November 14, 2016

Practice with SQL data extraction

Practice with SQL computations