Data frames in R are tables in database lingo
| R jargon | Database jargon |
|---|---|
| column | field |
| row | record |
| dataframe | table |
| types of the columns | table schema |
| collection of data frames | database |
| conditional indexing | SELECT, FROM, WHERE, HAVING |
d*ply() |
GROUP BY |
order() |
ORDER BY |
SELECTMain tool in the SQL language: SELECT, which allows you to perform queries on a particular table in a database. It has the form:
SELECT columns or computations
FROM table
WHERE condition
GROUP BY columns
HAVING condition
ORDER BY column [ASC | DESC]
LIMIT offset,count;
WHERE, GROUP BY, HAVING, ORDER BY, LIMIT are all optional. Importantly, as we’ll see today, in the first line of SELECT we can directly specify computations that we want performed
First, we set up our connection to the baseball database:
library(DBI)
library(RSQLite)
drv = dbDriver("SQLite")
con = dbConnect(drv, dbname="baseball.db")
To calculate the average number of homeruns, and average number of hits:
dbGetQuery(con, paste("SELECT AVG(HR), AVG(H)",
"FROM Batting"))
## AVG(HR) AVG(H)
## 1 2.970549 40.67684
We can replicate this simple command on an imported data frame:
batting = dbReadTable(con, "Batting")
mean(batting$HR, na.rm=TRUE)
## [1] 2.970549
mean(batting$H, na.rm=TRUE)
## [1] 40.67684
GROUP BYWe can use the GROUP BY option in SELECT to define aggregation groups
(Note: the order of commands here matters; try switching the order of GROUP BY and ORDER BY below, and you’ll get an error)
dbGetQuery(con, paste("SELECT playerID, AVG(HR)",
"FROM Batting",
"GROUP BY playerID",
"ORDER BY AVG(HR) DESC",
"LIMIT 5"))
## playerID AVG(HR)
## 1 pujolal01 40.80000
## 2 howarry01 36.14286
## 3 rodrial01 36.05882
## 4 bondsba01 34.63636
## 5 mcgwima01 34.29412
ASWe can use AS in the first line of SELECT to rename computed columns
dbGetQuery(con, paste("SELECT yearID, AVG(HR) as avgHR",
"FROM Batting",
"GROUP BY yearID",
"ORDER BY avgHR DESC",
"LIMIT 5"))
## yearID avgHR
## 1 1987 5.300832
## 2 1996 5.073620
## 3 1986 4.730769
## 4 1999 4.692699
## 5 1977 4.601010
WHEREWe can use the WHERE option in SELECT to specify a subset of the rows to use (pre-aggregation/pre-calculation)
dbGetQuery(con, paste("SELECT yearID, AVG(HR) as avgHR",
"FROM Batting",
"WHERE yearID >= 1990",
"GROUP BY yearID",
"ORDER BY avgHR DESC",
"LIMIT 5"))
## yearID avgHR
## 1 1996 5.073620
## 2 1999 4.692699
## 3 2000 4.525437
## 4 2004 4.490115
## 5 2001 4.412288
HAVINGWe can use the HAVING option in SELECT to specify a subset of the rows to display (post-aggregation/post-calculation)
dbGetQuery(con, paste("SELECT yearID, AVG(HR) as avgHR",
"FROM Batting",
"WHERE yearID >= 1990",
"GROUP BY yearID",
"HAVING avgHR >= 4",
"ORDER BY avgHR DESC"))
## yearID avgHR
## 1 1996 5.073620
## 2 1999 4.692699
## 3 2000 4.525437
## 4 2004 4.490115
## 5 2001 4.412288
## 6 2006 4.336554
## 7 2002 4.283658
## 8 1993 4.273595
## 9 2003 4.271534
## 10 1995 4.189938
## 11 1998 4.181668
## 12 2005 4.132619
## 13 1994 4.111940
## 14 1997 4.084507
## 15 2009 4.036829