It helps define things “from the ground up”:
Data frames in R are tables in database lingo
R jargon | Database jargon |
---|---|
column | field |
row | record |
data frame | table |
types of the columns | table schema |
collection of related data frames | database |
SQL is its own language, independent of R (similar to when you learned regexes). But we’re goign to learn how to run SQL queries through R
First, we need to install the packages DBI
, RSQLite
, then we load them into our R session with library()
Also, we need a database file: to run the following examples, download the file up at http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/baseball.db, and save it in your R working directory
library(DBI)
library(RSQLite)
drv = dbDriver("SQLite")
con = dbConnect(drv, dbname="baseball.db")
The object con
is now a persistent connection to the database baseball.db
dbListTables(con) # List tables in our database
## [1] "AllstarFull" "Appearances" "AwardsManagers"
## [4] "AwardsPlayers" "AwardsShareManagers" "AwardsSharePlayers"
## [7] "Batting" "BattingPost" "Fielding"
## [10] "FieldingOF" "FieldingPost" "HallOfFame"
## [13] "Managers" "ManagersHalf" "Master"
## [16] "Pitching" "PitchingPost" "Salaries"
## [19] "Schools" "SchoolsPlayers" "SeriesPost"
## [22] "Teams" "TeamsFranchises" "TeamsHalf"
## [25] "sqlite_sequence" "xref_stats"
dbListFields(con, "Batting") # List fields in Batting table
## [1] "playerID" "yearID" "stint" "teamID" "lgID"
## [6] "G" "G_batting" "AB" "R" "H"
## [11] "2B" "3B" "HR" "RBI" "SB"
## [16] "CS" "BB" "SO" "IBB" "HBP"
## [21] "SH" "SF" "GIDP" "G_old"
dbListFields(con, "Pitching") # List fields in Pitching table
## [1] "playerID" "yearID" "stint" "teamID" "lgID" "W"
## [7] "L" "G" "GS" "CG" "SHO" "SV"
## [13] "IPouts" "H" "ER" "HR" "BB" "SO"
## [19] "BAOpp" "ERA" "IBB" "WP" "HBP" "BK"
## [25] "BFP" "GF" "R"
batting = dbReadTable(con, "Batting")
class(batting)
## [1] "data.frame"
dim(batting)
## [1] 93955 24
Now we could go on and perform R operations on batting
, since it’s a data frame
This week, we’ll use this route primarily to check our work in SQL; in general, should try to do as much in SQL as possible, since it’s more efficient and can be simpler
SELECT
Main tool in the SQL language: SELECT
, which allows you to perform queries on a particular table in a database. It has the form:
SELECT columns or computations
FROM table
WHERE condition
GROUP BY columns
HAVING condition
ORDER BY column [ASC | DESC]
LIMIT offset,count;
WHERE
, GROUP BY
, HAVING
, ORDER BY
, LIMIT
are all optional
To pick out five columns from the table “Batting”, and only look at the first 10 rows:
dbGetQuery(con, paste("SELECT playerID, yearID, AB, H, HR",
"FROM Batting",
"LIMIT 10"))
## playerID yearID AB H HR
## 1 aardsda01 2004 0 0 0
## 2 aardsda01 2006 2 0 0
## 3 aardsda01 2007 0 0 0
## 4 aardsda01 2008 1 0 0
## 5 aardsda01 2009 0 0 0
## 6 aaronha01 1954 468 131 13
## 7 aaronha01 1955 602 189 27
## 8 aaronha01 1956 609 200 26
## 9 aaronha01 1957 615 198 44
## 10 aaronha01 1958 601 196 30
This is our very first successful SQL query (congrats!)
We can replicate this simple command on the imported data frame:
batting[1:10, c("playerID", "yearID", "AB", "H", "HR")]
## playerID yearID AB H HR
## 1 aardsda01 2004 0 0 0
## 2 aardsda01 2006 2 0 0
## 3 aardsda01 2007 0 0 0
## 4 aardsda01 2008 1 0 0
## 5 aardsda01 2009 0 0 0
## 6 aaronha01 1954 468 131 13
## 7 aaronha01 1955 602 189 27
## 8 aaronha01 1956 609 200 26
## 9 aaronha01 1957 615 198 44
## 10 aaronha01 1958 601 196 30
To reiterate: this was simply to check our work, and we wouldn’t actually want to do this on a large database, since it’d be much more inefficient to first read into an R data frame, and then call R commands)
ORDER BY
We can use the ORDER BY
option in SELECT
to specify an ordering for the rows
Default is ascending order; add DESC
for descending
dbGetQuery(con, paste("SELECT playerID, yearID, AB, H, HR",
"FROM Batting",
"ORDER BY HR DESC",
"LIMIT 5"))
## playerID yearID AB H HR
## 1 bondsba01 2001 476 156 73
## 2 mcgwima01 1998 509 152 70
## 3 sosasa01 1998 643 198 66
## 4 mcgwima01 1999 521 145 65
## 5 sosasa01 2001 577 189 64