Name:
Andrew ID:
Collaborated with:

This lab is to be completed in class. You can collaborate with your classmates, but you must identify their names above, and you must submit your own lab as an Rmd file on Blackboard, by 11:59pm on the day of the lab.

There are Homework 11 questions dispersed throughout. These must be written up in a separate Rmd document, together with all Homework 11 questions from other labs. Your homework writeup must start as this one: by listing your name, Andrew ID, and who you collaborated with. You must submit your own homework as a knit HTML file on Blackboard, by 11:59pm on Tuesday November 22. This document contains 15 of the 30 total points for Homework 11.

Practice with SQL computations

As in Monday’s lab, we will practice SQL commands using data from the Baseball Databank http://baseball-databank.org. The database file up at http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/baseball.db should be saved in your working directory for this lab. The packages DBI, RSQLite, plyr should already be installed. Load them into your R session with library(). Using dbDriver(), dbConnect(), set up a connection called con the SQLite database stored in baseball.db.
Using dbListFields(), list the fields available in the “Batting” table.
Using dbGetQuery() and the “Batting” table, calculate each player’s average number of hits (H) per season, and display the players with the 10 highest hit averages, along with their hit averages. (Hint: AVG(), GROUP BY, ORDER BY.)
Calculate the same as in the last question, but now display all players whose hit averages are above 170. (Hint: HAVING.)
Calculate the same as in the last question, but now display for all players with hit averages above 170—-in addition to the player’s ID and his batting average—the final year in which each player played.

Hw11 Bonus. Calculate each player’s hitting percentage, which is their number of hits (H) divided by number of at bats (AB), but only over seasons in which they had at least 100 at bats. Display the players with the top 10 highest average hitting percentages, along with their average hitting percentages.

Practice with SQL join operations

Using JOIN, merge the “Batting” and “Salaries” tables based on matching the yearID, playerID pairs. Display the year, player, salary, and number of hits for the first 10 records.
Building off of the SQL code you wrote earlier in the lab, and the JOIN code you wrote in the last question, compute the average salaries for the players with the top 10 highest hit averages.
Compute the hit averages for the players with the top 10 highest salaries. (Hint: this should only require a very small tweak to the code you wrote for the last question.)

Hw11 Q3 (2 points). Using the “Fielding” table, list the 10 worst (highest) number of errors (E) committed by a player in a season, only considering the year 1990 and later. In addition to the number of errors, list the year and player ID for each record.

Hw11 Q4 (4 points). By appropriately merging the “Fielding” and “Salaries” tables, list the salaries for each record that you extracted in the last question. Then, answer the following question: what was the highest salary paid to a player who made at least 30 errors in a season, after 1990?

Back to payrolls

Hw11 Q5 (9 points). In Hw11 Q2, we investigated the payrolls of teams in 2010. Here we pick up on the same topic. Use a SQL query to compute the total payroll for each team, added up over all years between 1985 and 2010. Your output should be a data frame of dimension 34 x 2, where the columns display the team ID, and total payroll. Check that your data frame has the proper dimensions, and display its first 10 rows. What team has the highest total payroll?

Then use a SQL query to compute the payroll for each team, but now separately for each year in between 1985 and 2010. (Hint: GROUP BY can take two arguments, separated by a comma.) Your output should be a data frame of dimension 738 x 3. The 3 columns should be display the team ID, year, and payroll. Check that your data frame has the proper dimensions, and display its last 20 rows.

Finally, plot the Pittsburgh Pirates’ payroll over time, i.e., over the years 1985 to 2010, with appropriately labeled axes and an appropriate title. What is the trend that you see?

Hw11 Bonus. Pick a few other baseball teams (other than the Pittsburgh Pirates) and plot their payrolls over time, using the data frame you created in the last question. Make sure the axes and titles are set appropriately. What is the general trend that you tsee?

To make these plots more sensible, we need to adjust for inflation. The following reads in consumer price index or CPI, from the years 1985 through 2011. This has been calculated from FRED (the Federal Reserve Economic Data service), using the package fImport.

cpi = as.numeric(read.table("http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/cpi.txt")[,1])

The vector cpi is normalized so that $1=\$1$ in 2011. So an expression like y = x/cpi[1990-1985+1] will convert x = 1990 dollars into y = 2011 dollars.

Calculate the inflation-adjusted payroll of each baseball team over time, using the data frame you created in Hw11 Q5. Then reproduce the plots you made earlier, of the Pittsburgh Pirates’ payroll and other teams’ payrolls over time. How do these inflation-adjusted versions compare to the originals?

Lab 12f: SQL: Computations and Join

Statistical Computing, 36-350

Friday November 18, 2016

Practice with SQL computations

Practice with SQL join operations

Back to payrolls