Statistical Computing, 36-350
Wednesday November 16, 2016
Data frames in R are tables in database lingo
R jargon | Database jargon |
---|---|
column | field |
row | record |
dataframe | table |
types of the columns | table schema |
collection of data frames | database |
conditional indexing | SELECT , FROM , WHERE , HAVING |
d*ply() |
GROUP BY |
order() |
ORDER BY |
SELECT
Main tool in the SQL language: SELECT
, which allows you to perform queries on a particular table in a database. It has the form:
SELECT columns or computations
FROM table
WHERE condition
GROUP BY columns
HAVING condition
ORDER BY column [ASC | DESC]
LIMIT offset,count;
WHERE
, GROUP BY
, HAVING
, ORDER BY
, LIMIT
are all optional. Importantly, as we’ll see today, in the first line of SELECT
we can directly specify computations that we want performed
First, we set up our connection to the baseball database:
library(DBI)
library(RSQLite)
drv = dbDriver("SQLite")
con = dbConnect(drv, dbname="baseball.db")
To calculate the average number of homeruns, and average number of hits:
dbGetQuery(con, paste("SELECT AVG(HR), AVG(H)",
"FROM Batting"))
## AVG(HR) AVG(H)
## 1 2.970549 40.67684
We can replicate this simple command on an imported data frame:
batting = dbReadTable(con, "Batting")
mean(batting$HR, na.rm=TRUE)
## [1] 2.970549
mean(batting$H, na.rm=TRUE)
## [1] 40.67684
GROUP BY
We can use the GROUP BY
option in SELECT
to define aggregation groups
(Note: the order of commands here matters; try switching the order of GROUP BY
and ORDER BY
below, and you’ll get an error)
dbGetQuery(con, paste("SELECT playerID, AVG(HR)",
"FROM Batting",
"GROUP BY playerID",
"ORDER BY AVG(HR) DESC",
"LIMIT 5"))
## playerID AVG(HR)
## 1 pujolal01 40.80000
## 2 howarry01 36.14286
## 3 rodrial01 36.05882
## 4 bondsba01 34.63636
## 5 mcgwima01 34.29412
AS
We can use AS
in the first line of SELECT
to rename computed columns
dbGetQuery(con, paste("SELECT yearID, AVG(HR) as avgHR",
"FROM Batting",
"GROUP BY yearID",
"ORDER BY avgHR DESC",
"LIMIT 5"))
## yearID avgHR
## 1 1987 5.300832
## 2 1996 5.073620
## 3 1986 4.730769
## 4 1999 4.692699
## 5 1977 4.601010
WHERE
We can use the WHERE
option in SELECT
to specify a subset of the rows to use (pre-aggregation/pre-calculation)
dbGetQuery(con, paste("SELECT yearID, AVG(HR) as avgHR",
"FROM Batting",
"WHERE yearID >= 1990",
"GROUP BY yearID",
"ORDER BY avgHR DESC",
"LIMIT 5"))
## yearID avgHR
## 1 1996 5.073620
## 2 1999 4.692699
## 3 2000 4.525437
## 4 2004 4.490115
## 5 2001 4.412288
HAVING
We can use the HAVING
option in SELECT
to specify a subset of the rows to display (post-aggregation/post-calculation)
dbGetQuery(con, paste("SELECT yearID, AVG(HR) as avgHR",
"FROM Batting",
"WHERE yearID >= 1990",
"GROUP BY yearID",
"HAVING avgHR >= 4",
"ORDER BY avgHR DESC"))
## yearID avgHR
## 1 1996 5.073620
## 2 1999 4.692699
## 3 2000 4.525437
## 4 2004 4.490115
## 5 2001 4.412288
## 6 2006 4.336554
## 7 2002 4.283658
## 8 1993 4.273595
## 9 2003 4.271534
## 10 1995 4.189938
## 11 1998 4.181668
## 12 2005 4.132619
## 13 1994 4.111940
## 14 1997 4.084507
## 15 2009 4.036829