It helps define things “from the ground up”:
Data frames in R are tables in database lingo
| R jargon | Database jargon | 
|---|---|
| column | field | 
| row | record | 
| data frame | table | 
| types of the columns | table schema | 
| collection of related data frames | database | 
SQL is its own language, independent of R (similar to when you learned regexes). But we’re goign to learn how to run SQL queries through R
First, we need to install the packages DBI, RSQLite, then we load them into our R session with library()
Also, we need a database file: to run the following examples, download the file up at http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/baseball.db, and save it in your R working directory
library(DBI)
library(RSQLite)
drv = dbDriver("SQLite")
con = dbConnect(drv, dbname="baseball.db")The object con is now a persistent connection to the database baseball.db
dbListTables(con) # List tables in our database ##  [1] "AllstarFull"         "Appearances"         "AwardsManagers"     
##  [4] "AwardsPlayers"       "AwardsShareManagers" "AwardsSharePlayers" 
##  [7] "Batting"             "BattingPost"         "Fielding"           
## [10] "FieldingOF"          "FieldingPost"        "HallOfFame"         
## [13] "Managers"            "ManagersHalf"        "Master"             
## [16] "Pitching"            "PitchingPost"        "Salaries"           
## [19] "Schools"             "SchoolsPlayers"      "SeriesPost"         
## [22] "Teams"               "TeamsFranchises"     "TeamsHalf"          
## [25] "sqlite_sequence"     "xref_stats"dbListFields(con, "Batting") # List fields in Batting table##  [1] "playerID"  "yearID"    "stint"     "teamID"    "lgID"     
##  [6] "G"         "G_batting" "AB"        "R"         "H"        
## [11] "2B"        "3B"        "HR"        "RBI"       "SB"       
## [16] "CS"        "BB"        "SO"        "IBB"       "HBP"      
## [21] "SH"        "SF"        "GIDP"      "G_old"dbListFields(con, "Pitching") # List fields in Pitching table##  [1] "playerID" "yearID"   "stint"    "teamID"   "lgID"     "W"       
##  [7] "L"        "G"        "GS"       "CG"       "SHO"      "SV"      
## [13] "IPouts"   "H"        "ER"       "HR"       "BB"       "SO"      
## [19] "BAOpp"    "ERA"      "IBB"      "WP"       "HBP"      "BK"      
## [25] "BFP"      "GF"       "R"batting = dbReadTable(con, "Batting") 
class(batting) ## [1] "data.frame"dim(batting)## [1] 93955    24Now we could go on and perform R operations on batting, since it’s a data frame
This week, we’ll use this route primarily to check our work in SQL; in general, should try to do as much in SQL as possible, since it’s more efficient and can be simpler
SELECTMain tool in the SQL language: SELECT, which allows you to perform queries on a particular table in a database. It has the form:
SELECT columns or computations
  FROM table
  WHERE condition
  GROUP BY columns
  HAVING condition
  ORDER BY column [ASC | DESC]
  LIMIT offset,count;WHERE, GROUP BY, HAVING, ORDER BY, LIMIT are all optional
To pick out five columns from the table “Batting”, and only look at the first 10 rows:
dbGetQuery(con, paste("SELECT playerID, yearID, AB, H, HR",
                      "FROM Batting",
                      "LIMIT 10"))##     playerID yearID  AB   H HR
## 1  aardsda01   2004   0   0  0
## 2  aardsda01   2006   2   0  0
## 3  aardsda01   2007   0   0  0
## 4  aardsda01   2008   1   0  0
## 5  aardsda01   2009   0   0  0
## 6  aaronha01   1954 468 131 13
## 7  aaronha01   1955 602 189 27
## 8  aaronha01   1956 609 200 26
## 9  aaronha01   1957 615 198 44
## 10 aaronha01   1958 601 196 30This is our very first successful SQL query (congrats!)
We can replicate this simple command on the imported data frame:
batting[1:10, c("playerID", "yearID", "AB", "H", "HR")]##     playerID yearID  AB   H HR
## 1  aardsda01   2004   0   0  0
## 2  aardsda01   2006   2   0  0
## 3  aardsda01   2007   0   0  0
## 4  aardsda01   2008   1   0  0
## 5  aardsda01   2009   0   0  0
## 6  aaronha01   1954 468 131 13
## 7  aaronha01   1955 602 189 27
## 8  aaronha01   1956 609 200 26
## 9  aaronha01   1957 615 198 44
## 10 aaronha01   1958 601 196 30To reiterate: this was simply to check our work, and we wouldn’t actually want to do this on a large database, since it’d be much more inefficient to first read into an R data frame, and then call R commands)
ORDER BYWe can use the ORDER BY option in SELECT to specify an ordering for the rows
Default is ascending order; add DESC for descending
dbGetQuery(con, paste("SELECT playerID, yearID, AB, H, HR",
                      "FROM Batting",
                      "ORDER BY HR DESC",
                      "LIMIT 5"))##    playerID yearID  AB   H HR
## 1 bondsba01   2001 476 156 73
## 2 mcgwima01   1998 509 152 70
## 3  sosasa01   1998 643 198 66
## 4 mcgwima01   1999 521 145 65
## 5  sosasa01   2001 577 189 64