Databases versus data frames

Data frames in R are tables in database lingo

R jargon	Database jargon
column	field
row	record
dataframe	table
types of the columns	table schema
collection of data frames	database
conditional indexing	`SELECT`, `FROM`, `WHERE`, `HAVING`
`d*ply()`	`GROUP BY`
`order()`	`ORDER BY`

`SELECT`

Main tool in the SQL language: SELECT, which allows you to perform queries on a particular table in a database. It has the form:

SELECT columns or computations
  FROM table
  WHERE condition
  GROUP BY columns
  HAVING condition
  ORDER BY column [ASC | DESC]
  LIMIT offset,count;

WHERE, GROUP BY, HAVING, ORDER BY, LIMIT are all optional. Importantly, as we’ll see today, in the first line of SELECT we can directly specify computations that we want performed

Examples

First, we set up our connection to the baseball database:

library(DBI)
library(RSQLite)
drv = dbDriver("SQLite")
con = dbConnect(drv, dbname="baseball.db")

(Continued)

To calculate the average number of homeruns, and average number of hits:

dbGetQuery(con, paste("SELECT AVG(HR), AVG(H)",
                      "FROM Batting"))

##    AVG(HR)   AVG(H)
## 1 2.970549 40.67684

We can replicate this simple command on an imported data frame:

batting = dbReadTable(con, "Batting") 
mean(batting$HR, na.rm=TRUE)

## [1] 2.970549

mean(batting$H, na.rm=TRUE)

## [1] 40.67684

`GROUP BY`

We can use the GROUP BY option in SELECT to define aggregation groups

(Note: the order of commands here matters; try switching the order of GROUP BY and ORDER BY below, and you’ll get an error)

dbGetQuery(con, paste("SELECT playerID, AVG(HR)",
                      "FROM Batting",
                      "GROUP BY playerID",
                      "ORDER BY AVG(HR) DESC",
                      "LIMIT 5"))

##    playerID  AVG(HR)
## 1 pujolal01 40.80000
## 2 howarry01 36.14286
## 3 rodrial01 36.05882
## 4 bondsba01 34.63636
## 5 mcgwima01 34.29412

`AS`

We can use AS in the first line of SELECT to rename computed columns

dbGetQuery(con, paste("SELECT yearID, AVG(HR) as avgHR",
                      "FROM Batting",
                      "GROUP BY yearID",
                      "ORDER BY avgHR DESC",
                      "LIMIT 5"))

##   yearID    avgHR
## 1   1987 5.300832
## 2   1996 5.073620
## 3   1986 4.730769
## 4   1999 4.692699
## 5   1977 4.601010

`WHERE`

We can use the WHERE option in SELECT to specify a subset of the rows to use (pre-aggregation/pre-calculation)

dbGetQuery(con, paste("SELECT yearID, AVG(HR) as avgHR",
                      "FROM Batting",
                      "WHERE yearID >= 1990",
                      "GROUP BY yearID",
                      "ORDER BY avgHR DESC",
                      "LIMIT 5"))

##   yearID    avgHR
## 1   1996 5.073620
## 2   1999 4.692699
## 3   2000 4.525437
## 4   2004 4.490115
## 5   2001 4.412288

`HAVING`

We can use the HAVING option in SELECT to specify a subset of the rows to display (post-aggregation/post-calculation)

dbGetQuery(con, paste("SELECT yearID, AVG(HR) as avgHR",
                      "FROM Batting",
                      "WHERE yearID >= 1990",
                      "GROUP BY yearID",
                      "HAVING avgHR >= 4",
                      "ORDER BY avgHR DESC"))

##    yearID    avgHR
## 1    1996 5.073620
## 2    1999 4.692699
## 3    2000 4.525437
## 4    2004 4.490115
## 5    2001 4.412288
## 6    2006 4.336554
## 7    2002 4.283658
## 8    1993 4.273595
## 9    2003 4.271534
## 10   1995 4.189938
## 11   1998 4.181668
## 12   2005 4.132619
## 13   1994 4.111940
## 14   1997 4.084507
## 15   2009 4.036829

SQL: Computations

Databases versus data frames

SELECT