Databases versus data frames

Data frames in R are tables in database lingo

R jargon Database jargon
column field
row record
dataframe table
types of the columns table schema
collection of data frames database
conditional indexing SELECT, FROM, WHERE, HAVING
d*ply() GROUP BY
order() ORDER BY
merge() INNER JOIN or just JOIN

SELECT

Main tool in the SQL language: SELECT, which allows you to perform queries on a particular table in a database. It has the form:

SELECT columns or computations
  FROM tabA JOIN tabB USING(key)
  WHERE condition
  GROUP BY columns
  HAVING condition
  ORDER BY column [ASC | DESC]
  LIMIT offset,count;

WHERE, GROUP BY, HAVING, ORDER BY, LIMIT are all optional. Importantly, as we’ll see today, in FROM we can specify more than one table with JOIN

JOIN

In all we’ve seen so far with SELECT, the FROM line has just specified one table. But sometimes we need to combine information from many tables. Use the JOIN option for this

There are 4 options for join

Fields that cannot be filled in are assigned NA values

(Continued)

It helps to visualize the join types:

Examples

First, we set up our connection to the baseball database:

library(DBI)
library(RSQLite)
drv = dbDriver("SQLite")
con = dbConnect(drv, dbname="baseball.db")
dbListFields(con, "Batting") # Columns in Batting
##  [1] "playerID"  "yearID"    "stint"     "teamID"    "lgID"     
##  [6] "G"         "G_batting" "AB"        "R"         "H"        
## [11] "2B"        "3B"        "HR"        "RBI"       "SB"       
## [16] "CS"        "BB"        "SO"        "IBB"       "HBP"      
## [21] "SH"        "SF"        "GIDP"      "G_old"
dbListFields(con, "Salaries") # Columns in Salaries
## [1] "yearID"   "teamID"   "lgID"     "playerID" "salary"

(Continued)

Suppose we want to figure out the average salaries of the players with the top 10 highest homerun averages. Then we’d have to combine the two tables below

dbGetQuery(con, paste("SELECT *",
                      "FROM Salaries",
                      "ORDER BY playerID",
                      "LIMIT 8"))
##   yearID teamID lgID  playerID  salary
## 1   2004    SFN   NL aardsda01  300000
## 2   2007    CHA   AL aardsda01  387500
## 3   2008    BOS   AL aardsda01  403250
## 4   2009    SEA   AL aardsda01  419000
## 5   2010    SEA   AL aardsda01 2750000
## 6   1986    BAL   AL  aasedo01  600000
## 7   1987    BAL   AL  aasedo01  625000
## 8   1988    BAL   AL  aasedo01  675000
dbGetQuery(con, paste("SELECT yearID, teamID, lgID, playerID, HR",
                      "FROM Batting", 
                      "ORDER BY playerID",
                      "LIMIT 8"))
##   yearID teamID lgID  playerID HR
## 1   2004    SFN   NL aardsda01  0
## 2   2006    CHN   NL aardsda01  0
## 3   2007    CHA   AL aardsda01  0
## 4   2008    BOS   AL aardsda01  0
## 5   2009    SEA   AL aardsda01  0
## 6   2010    SEA   AL aardsda01  0
## 7   1954    ML1   NL aaronha01 13
## 8   1955    ML1   NL aaronha01 27

(Continued)

We can use a JOIN on the pair: yearID, playerID

dbGetQuery(con, paste("SELECT yearID, playerID, salary, HR",
                      "FROM Batting JOIN Salaries USING(yearID, playerID)",
                      "ORDER BY playerID",
                      "LIMIT 8"))
##   yearID  playerID  salary HR
## 1   2004 aardsda01  300000  0
## 2   2007 aardsda01  387500  0
## 3   2008 aardsda01  403250  0
## 4   2009 aardsda01  419000  0
## 5   2010 aardsda01 2750000  0
## 6   1986  aasedo01  600000 NA
## 7   1987  aasedo01  625000 NA
## 8   1988  aasedo01  675000 NA

Note that here we’re missing of David Aardsma’s records from the Batting table (i.e., the JOIN discarded 1 record)

We can replicate this using merge() on imported data frames:

batting = dbReadTable(con, "Batting") 
salaries = dbReadTable(con, "Salaries") 
merged = merge(x=batting, y=salaries, by.x=c("yearID","playerID"), 
               by.y=c("yearID","playerID"))
merged[order(merged$playerID)[1:8], 
       c("yearID", "playerID", "salary", "HR")]
##       yearID  playerID  salary HR
## 16708   2004 aardsda01  300000  0
## 19378   2007 aardsda01  387500  0
## 20277   2008 aardsda01  403250  0
## 21164   2009 aardsda01  419000  0
## 21990   2010 aardsda01 2750000  0
## 585     1986  aasedo01  600000 NA
## 1360    1987  aasedo01  625000 NA
## 2033    1988  aasedo01  675000 NA

(Continued)

For demonstration purposes, we can use a LEFT JOIN on the pair: yearID, playerID

dbGetQuery(con, paste("SELECT yearID, playerID, salary, HR",
                      "FROM Batting LEFT JOIN Salaries USING(yearID, playerID)",
                      "LIMIT 7"))
##   yearID  playerID salary HR
## 1   2004 aardsda01 300000  0
## 2   2006 aardsda01     NA  0
## 3   2007 aardsda01 387500  0
## 4   2008 aardsda01 403250  0
## 5   2009 aardsda01 419000  0
## 6   1954 aaronha01     NA 13
## 7   1955 aaronha01     NA 27

Now we can see that we have all 5 of David Aardsma’s original records from the Batting table (i.e., the LEFT JOIN used them all, and just filled in an NA value when it was missing his salary)

Currently, RIGHT JOIN and FULL JOIN are not implemented in the RSQLite package

(Continued)

Now, as to our original question (average salaries of the players with the top 10 highest homerun averages):

dbGetQuery(con, paste("SELECT playerID, AVG(HR), AVG(salary)",
                      "FROM Batting JOIN Salaries USING(yearID, playerID)",
                      "GROUP BY playerID",
                      "ORDER BY Avg(HR) DESC",
                      "LIMIT 10"))
##     playerID  AVG(HR) AVG(salary)
## 1  howarry01 45.80000   9051000.0
## 2  pujolal01 40.80000   8953204.1
## 3  fieldpr01 38.00000   3882900.0
## 4  rodrial01 36.05882  15553897.2
## 5  reynoma01 34.66667    550777.7
## 6  bondsba01 34.63636   8556605.5
## 7  mcgwima01 34.29412   4814020.8
## 8  gonzaca01 34.00000    406000.0
## 9   dunnad01 33.50000   6969500.0
## 10 kingmda01 32.50000    908750.0