Name:
Andrew ID:
Collaborated with:
On this homework, you can collaborate with your classmates, but you must identify their names above, and you must submit your own homework as an knitted HTML file on Canvas, by Sunday 10pm, this week.
Thanks to Sean Lahman, extensive baseball data is freely available from the 1871 season all the way to the current season. We’re going ot use a SQLite version of the baseball database put together by Jeff Knecht, at https://github.com/jknecht/baseball-archive-sqlite. The most recent SQLite database was recently updated to include the 2016 season. It has been posted to the class website at http://www.stat.cmu.edu/~ryantibs/statcomp-S18/data/lahman2016.sqlite. Download this file (it’s about 50 MB) and save it in the working directory for your lab.
1a. Load the packages DBI
, RSQLite
, plyr
into your R session. Using dbDriver()
, dbConnect()
, set up a connection called the SQLite database stored in lahman2016.sqlite. Use dbListTables()
to list the tables in the database.
1b. Using dbReadTable()
, grab the table named “Salaries” and save it as a data frame in your R session, called salaries
. Check that salaries
has dimension 26428 x 5, and display its first 5 rows.
1c. Using the salaries
data frame and ddply()
, compute the payroll (total of salaries) for each team in the year 2010. Display the first 5 rows of the output. Then list the 3 teams with the highest payrolls, and the 3 teams with the lowest payroll (ouch!).
1d. Reproduce the answers to Q1c, but now using appropriate SQL queries, using dbGetQuery()
.
2a. Use a SQL query and dbGetQuery()
to compute the total payroll for each team, added up over the years between 1985 and 2016. Hint: dbGetQuery()
actually returns a data frame. You should have a data frame of dimension 46 x 2, and the 2 columns should display the team ID and the payroll. Check that your data frame has the right dimensions and display its first 10 rows. Then, answer: what team has the highest total payroll? The lowest payroll? Where do the Pirates rank?
2b. Use a SQL query and dbGetQuery()
to compute the payroll for each team, separately for each year in between 1985 and 2016. Hint: GROUP BY
can take two arguments, separated by a comma. You should have a data frame of dimension 918 x 3, and the 3 columns should be display the team ID, year, and payroll. Check that your data frame has the proper dimensions, and display its last 10 rows.
2c. Plot the Pittsburgh Pirates’ payroll over time (i.e., over the years 1985 to 2016), with appropriately labeled axes and an appropriate title. What is the trend that you see?
2d. On a single plot, display the payrolls over time (i.e., over the years 1985 to 2016) for 8 teams of your choosing. Make sure that their payroll curves are distinguishable (by color, line type, some combo, you choose). Make sure that the y limit is properly set (so the extremes of all curves are properly contained within the plotting region). Use appropriately labeled axes, an appropriate title, and an informative legend.
Challenge. To make these plots more sensible, we need to adjust for inflation. Find data on the average consumer price index (CPI) over the years 1985 to 2016, and use this to adjust the payrolls for inflation and reproduce your plot from Q2d. Comment on the changes.
3a. Use a SQL query and dbGetQuery()
to calculate the top 10 best batting averages achieved by a player in any season after 1940. Note: batting average is the number of hits (H
) divided by number of at bats (AB
) achieved by a player in a given season, but (let’s say) it is only defined for players that have at least 400 at bats in that season. Your resulting data frame from the SQL query should be 10 x 3, with the 3 columns displaying the playerID, yearID, and batting average.
3b. Compute batting averages as described in Q3a, but now plot a histogram of all of these battings averages (aggregated over all players and all seasons after 1940), with an appropriate title. Use a large value of the breaks
argument to get a good sense of the shape of the histogram. Does this look like a normal distribution to you? What is the estimated mean and the standard deviation? Overlay the normal density curve on top of your histogram, with the appropriate mean and variance, and comment on how it fits. Challenge: perform a rigorous hypothesis test for normality of batting averages here; you might consider using ks.test()
.
3c. For the computed batting averages in the last question, separate out the batting averages before and after 1985. Plot two overlaid histograms, using transparent colors, for the batting averages before and after 1985. Set an appropriate title and informative legend. Do the distributions look different? If so, how? Challenge: perform a rigorous hypothesis test for the difference in distributions here; you might again consider using ks.test()
.
3d. Modifying your SQL query in Q3b so that you also extract, in addition to the batting averages, the number of home runs (for all players and all seasons after 1940). Produce a scatterplot of the number of home runs versus the batting average, with appropriate axes labels and an appropriate title. What does the general trend appear to be? Overlay the least squares regression line on top of your plot. What could go wrong with using this regression line to predict a player’s home run total from their batting average?