Name:
Andrew ID:
Collaborated with:
This lab is to be done in class (completed outside of class if need be). You can collaborate with your classmates, but you must identify their names above, and you must submit your own lab as an knitted HTML file on Canvas, by Thursday 10pm, this week.
This week’s agenda: practicing SQLite queries, performing simple computations and joins, and testing our understanding by writing equivalent R code for these database manipulations.
Thanks to Sean Lahman, extensive baseball data is freely available from the 1871 season all the way to the current season. We’re going ot use a SQLite version of the baseball database put together by Jeff Knecht, at https://github.com/jknecht/baseball-archive-sqlite. The most recent SQLite database was recently updated to include the 2016 season. It has been posted to the class website at http://www.stat.cmu.edu/~ryantibs/statcomp-S18/data/lahman2016.sqlite. Download this file (it’s about 50 MB) and save it in the working directory for your lab.
1a. Install the packages DBI
, RSQLite
if you haven’t done so already, and load them into your R session. Also load the package plyr
(which should already be installed) into your R session.
1b. Using dbDriver()
, dbConnect()
, set up a connection called con
the SQLite database stored in lahman2016.sqlite. Then, use dbListTables()
to list the tables in the database.
1c. Using dbReadTable()
, grab the table named “Batting” and save it as a data frame in your R session, called batting
. Check that batting
is indeed a data frame, and that it has dimension 102816 x 24.
1d. Remove eval=FALSE
from the preamble in the R code chunks below. Then, after each SQL query (each call to dbGetQuery()
), explain in words what is being extracted, and write one line of R code (sometimes you might need two lines) to get the same result using the batting
data frame.
dbGetQuery(con, paste("SELECT playerID, yearID, AB, H, HR",
"FROM Batting",
"ORDER BY yearID",
"LIMIT 10"))
dbGetQuery(con, paste("SELECT playerID, yearID, AB, H, HR",
"FROM Batting",
"ORDER BY HR DESC",
"LIMIT 10"))
dbGetQuery(con, paste("SELECT playerID, yearID, AB, H, HR",
"FROM Batting",
"WHERE HR > 55",
"ORDER BY HR DESC"))
dbGetQuery(con, paste("SELECT playerID, yearID, AB, H, HR",
"FROM Batting",
"WHERE yearID >= 1990 AND yearID <= 2000",
"ORDER BY HR DESC",
"LIMIT 10"))
eval=FALSE
from the preamble in the following R code chunks. Then, after each SQL query, explain in words what is being extracted, and write one line of R code to get the same result using the batting
data frame. Hint: often you’ll have to use na.rm=TRUE
to deal with NA values, for example mean(x, na.rm=TRUE)
computes the mean of a vector x
after removing any NA values.dbGetQuery(con, paste("SELECT AVG(HR)",
"FROM Batting"))
dbGetQuery(con, paste("SELECT SUM(HR)",
"FROM Batting"))
dbGetQuery(con, paste("SELECT playerID, yearID, teamID, MAX(HR)",
"FROM Batting"))
dbGetQuery(con, paste("SELECT AVG(HR)",
"FROM Batting",
"WHERE yearID >= 1990"))
batting
data frame. Now your R code should rely on the d*ply()
functions.dbGetQuery(con, paste("SELECT teamID, AVG(HR)",
"FROM Batting",
"WHERE yearID >= 1990",
"GROUP BY teamID",
"LIMIT 5"))
dbGetQuery(con, paste("SELECT teamID, AVG(HR)",
"FROM Batting",
"WHERE yearID < 1960",
"GROUP BY teamID",
"ORDER BY AVG(HR) DESC",
"LIMIT 5"))
dbGetQuery(con, paste("SELECT teamID, yearID, AVG(HR)",
"FROM Batting",
"WHERE yearID == 1991 OR yearID == 1992",
"GROUP BY teamID, yearID",
"ORDER BY AVG(HR) DESC",
"LIMIT 15"))
3a. Using dbGetQuery()
and the “Batting” table, calculate each player’s average number of hits (H
) over the seasons they played, and display the players with the 10 highest hit averages, along with their hit averages. Hint: AVG()
, GROUP BY
, ORDER BY
.
3b. Calculate the same as in the last question, but now display all players whose hit averages are above 170. Hint: HAVING
.
3c. Calculate the same as in the last question, but now display for all players with hit averages above 170—-in addition to the player’s ID and his batting average—the last year in which each player played.
4a. Using JOIN
, merge the “Batting” and “Salaries” tables based on matching the yearID
, playerID
pairs. Display the year, player, salary, and number of hits for the first 10 records.
4b. Building off of the code from the end of lecture, which does something similar, compute the average salaries for the players with the top 10 highest hit averages.
4c. Compute the hit averages for the players with the top 10 highest salaries. Hint: this should only require a very small tweak to the code you wrote for the last question.
4d. Using the “Fielding” table, list the 10 worst (highest) number of errors (E
) committed by a player in a season, only considering the year 1990 and later. In addition to the number of errors, list the year and player ID for each record.
4e. By appropriately merging the “Fielding” and “Salaries” tables, list the salaries for each record that you extracted in the last question. Then, answer the following question: what was the highest salary paid to a player who made at least 30 errors in a season, after 1990?