SQL: Queries

Statistical Computing, 36-350

Monday November 14, 2016

What is a database?

It helps define things “from the ground up”:

Databases versus data frames

Data frames in R are tables in database lingo

R jargon Database jargon
column field
row record
data frame table
types of the columns table schema
collection of related data frames database

Why do we need database software?

The client-server model, SQL

Connecting R to SQL

SQL is its own language, independent of R (similar to when you learned regexes). But we’re goign to learn how to run SQL queries through R

First, we need to install the packages DBI, RSQLite, then we load them into our R session with library()

Also, we need a database file: to run the following examples, download the file up at http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/baseball.db, and save it in your R working directory

library(DBI)
library(RSQLite)
drv = dbDriver("SQLite")
con = dbConnect(drv, dbname="baseball.db")

The object con is now a persistent connection to the database baseball.db

Listing what’s available

dbListTables(con) # List tables in our database 
##  [1] "AllstarFull"         "Appearances"         "AwardsManagers"     
##  [4] "AwardsPlayers"       "AwardsShareManagers" "AwardsSharePlayers" 
##  [7] "Batting"             "BattingPost"         "Fielding"           
## [10] "FieldingOF"          "FieldingPost"        "HallOfFame"         
## [13] "Managers"            "ManagersHalf"        "Master"             
## [16] "Pitching"            "PitchingPost"        "Salaries"           
## [19] "Schools"             "SchoolsPlayers"      "SeriesPost"         
## [22] "Teams"               "TeamsFranchises"     "TeamsHalf"          
## [25] "sqlite_sequence"     "xref_stats"
dbListFields(con, "Batting") # List fields in Batting table
##  [1] "playerID"  "yearID"    "stint"     "teamID"    "lgID"     
##  [6] "G"         "G_batting" "AB"        "R"         "H"        
## [11] "2B"        "3B"        "HR"        "RBI"       "SB"       
## [16] "CS"        "BB"        "SO"        "IBB"       "HBP"      
## [21] "SH"        "SF"        "GIDP"      "G_old"
dbListFields(con, "Pitching") # List fields in Pitching table
##  [1] "playerID" "yearID"   "stint"    "teamID"   "lgID"     "W"       
##  [7] "L"        "G"        "GS"       "CG"       "SHO"      "SV"      
## [13] "IPouts"   "H"        "ER"       "HR"       "BB"       "SO"      
## [19] "BAOpp"    "ERA"      "IBB"      "WP"       "HBP"      "BK"      
## [25] "BFP"      "GF"       "R"

Importing a table as a data frame

batting = dbReadTable(con, "Batting") 
class(batting) 
## [1] "data.frame"
dim(batting)
## [1] 93955    24

Now we could go on and perform R operations on batting, since it’s a data frame

This week, we’ll use this route primarily to check our work in SQL; in general, should try to do as much in SQL as possible, since it’s more efficient and can be simpler

SELECT

Main tool in the SQL language: SELECT, which allows you to perform queries on a particular table in a database. It has the form:

SELECT columns or computations
  FROM table
  WHERE condition
  GROUP BY columns
  HAVING condition
  ORDER BY column [ASC | DESC]
  LIMIT offset,count;

WHERE, GROUP BY, HAVING, ORDER BY, LIMIT are all optional

Examples

To pick out five columns from the table “Batting”, and only look at the first 10 rows:

dbGetQuery(con, paste("SELECT playerID, yearID, AB, H, HR",
                      "FROM Batting",
                      "LIMIT 10"))
##     playerID yearID  AB   H HR
## 1  aardsda01   2004   0   0  0
## 2  aardsda01   2006   2   0  0
## 3  aardsda01   2007   0   0  0
## 4  aardsda01   2008   1   0  0
## 5  aardsda01   2009   0   0  0
## 6  aaronha01   1954 468 131 13
## 7  aaronha01   1955 602 189 27
## 8  aaronha01   1956 609 200 26
## 9  aaronha01   1957 615 198 44
## 10 aaronha01   1958 601 196 30

This is our very first successful SQL query (congrats!)

We can replicate this simple command on the imported data frame:

batting[1:10, c("playerID", "yearID", "AB", "H", "HR")]
##     playerID yearID  AB   H HR
## 1  aardsda01   2004   0   0  0
## 2  aardsda01   2006   2   0  0
## 3  aardsda01   2007   0   0  0
## 4  aardsda01   2008   1   0  0
## 5  aardsda01   2009   0   0  0
## 6  aaronha01   1954 468 131 13
## 7  aaronha01   1955 602 189 27
## 8  aaronha01   1956 609 200 26
## 9  aaronha01   1957 615 198 44
## 10 aaronha01   1958 601 196 30

To reiterate: this was simply to check our work, and we wouldn’t actually want to do this on a large database, since it’d be much more inefficient to first read into an R data frame, and then call R commands)

ORDER BY

We can use the ORDER BY option in SELECT to specify an ordering for the rows

Default is ascending order; add DESC for descending

dbGetQuery(con, paste("SELECT playerID, yearID, AB, H, HR",
                      "FROM Batting",
                      "ORDER BY HR DESC",
                      "LIMIT 5"))
##    playerID yearID  AB   H HR
## 1 bondsba01   2001 476 156 73
## 2 mcgwima01   1998 509 152 70
## 3  sosasa01   1998 643 198 66
## 4 mcgwima01   1999 521 145 65
## 5  sosasa01   2001 577 189 64