SQL: Computations

Statistical Computing, 36-350

Wednesday November 16, 2016

Databases versus data frames

Data frames in R are tables in database lingo

R jargon Database jargon
column field
row record
dataframe table
types of the columns table schema
collection of data frames database
conditional indexing SELECT, FROM, WHERE, HAVING
d*ply() GROUP BY
order() ORDER BY

SELECT

Main tool in the SQL language: SELECT, which allows you to perform queries on a particular table in a database. It has the form:

SELECT columns or computations
  FROM table
  WHERE condition
  GROUP BY columns
  HAVING condition
  ORDER BY column [ASC | DESC]
  LIMIT offset,count;

WHERE, GROUP BY, HAVING, ORDER BY, LIMIT are all optional. Importantly, as we’ll see today, in the first line of SELECT we can directly specify computations that we want performed

Examples

First, we set up our connection to the baseball database:

library(DBI)
library(RSQLite)
drv = dbDriver("SQLite")
con = dbConnect(drv, dbname="baseball.db")

(Continued)

To calculate the average number of homeruns, and average number of hits:

dbGetQuery(con, paste("SELECT AVG(HR), AVG(H)",
                      "FROM Batting"))
##    AVG(HR)   AVG(H)
## 1 2.970549 40.67684

We can replicate this simple command on an imported data frame:

batting = dbReadTable(con, "Batting") 
mean(batting$HR, na.rm=TRUE)
## [1] 2.970549
mean(batting$H, na.rm=TRUE)
## [1] 40.67684

GROUP BY

We can use the GROUP BY option in SELECT to define aggregation groups

(Note: the order of commands here matters; try switching the order of GROUP BY and ORDER BY below, and you’ll get an error)

dbGetQuery(con, paste("SELECT playerID, AVG(HR)",
                      "FROM Batting",
                      "GROUP BY playerID",
                      "ORDER BY AVG(HR) DESC",
                      "LIMIT 5"))
##    playerID  AVG(HR)
## 1 pujolal01 40.80000
## 2 howarry01 36.14286
## 3 rodrial01 36.05882
## 4 bondsba01 34.63636
## 5 mcgwima01 34.29412

AS

We can use AS in the first line of SELECT to rename computed columns

dbGetQuery(con, paste("SELECT yearID, AVG(HR) as avgHR",
                      "FROM Batting",
                      "GROUP BY yearID",
                      "ORDER BY avgHR DESC",
                      "LIMIT 5"))
##   yearID    avgHR
## 1   1987 5.300832
## 2   1996 5.073620
## 3   1986 4.730769
## 4   1999 4.692699
## 5   1977 4.601010

WHERE

We can use the WHERE option in SELECT to specify a subset of the rows to use (pre-aggregation/pre-calculation)

dbGetQuery(con, paste("SELECT yearID, AVG(HR) as avgHR",
                      "FROM Batting",
                      "WHERE yearID >= 1990",
                      "GROUP BY yearID",
                      "ORDER BY avgHR DESC",
                      "LIMIT 5"))
##   yearID    avgHR
## 1   1996 5.073620
## 2   1999 4.692699
## 3   2000 4.525437
## 4   2004 4.490115
## 5   2001 4.412288

HAVING

We can use the HAVING option in SELECT to specify a subset of the rows to display (post-aggregation/post-calculation)

dbGetQuery(con, paste("SELECT yearID, AVG(HR) as avgHR",
                      "FROM Batting",
                      "WHERE yearID >= 1990",
                      "GROUP BY yearID",
                      "HAVING avgHR >= 4",
                      "ORDER BY avgHR DESC"))
##    yearID    avgHR
## 1    1996 5.073620
## 2    1999 4.692699
## 3    2000 4.525437
## 4    2004 4.490115
## 5    2001 4.412288
## 6    2006 4.336554
## 7    2002 4.283658
## 8    1993 4.273595
## 9    2003 4.271534
## 10   1995 4.189938
## 11   1998 4.181668
## 12   2005 4.132619
## 13   1994 4.111940
## 14   1997 4.084507
## 15   2009 4.036829