Data frames in R are tables in database lingo
R jargon | Database jargon |
---|---|
column | field |
row | record |
dataframe | table |
types of the columns | table schema |
collection of data frames | database |
conditional indexing | SELECT , FROM , WHERE , HAVING |
d*ply() |
GROUP BY |
order() |
ORDER BY |
merge() |
INNER JOIN or just JOIN |
SELECT
Main tool in the SQL language: SELECT
, which allows you to perform queries on a particular table in a database. It has the form:
SELECT columns or computations
FROM tabA JOIN tabB USING(key)
WHERE condition
GROUP BY columns
HAVING condition
ORDER BY column [ASC | DESC]
LIMIT offset,count;
WHERE
, GROUP BY
, HAVING
, ORDER BY
, LIMIT
are all optional. Importantly, as we’ll see today, in FROM
we can specify more than one table with JOIN
JOIN
In all we’ve seen so far with SELECT
, the FROM
line has just specified one table. But sometimes we need to combine information from many tables. Use the JOIN
option for this
There are 4 options for join
INNER JOIN
or just JOIN
: retain just the rows each table that match the conditionLEFT OUTER JOIN
or just LEFT JOIN
: retain all rows in the first table, and just the rows in the second table that match the conditionRIGHT OUTER JOIN
or just RIGHT JOIN
: retain just the rows in the first table that match the condition, and all rows in the second tableFULL OUTER JOIN
or just FULL JOIN
: retain all rows in both tablesFields that cannot be filled in are assigned NA values
It helps to visualize the join types:
First, we set up our connection to the baseball database:
library(DBI)
library(RSQLite)
drv = dbDriver("SQLite")
con = dbConnect(drv, dbname="baseball.db")
dbListFields(con, "Batting") # Columns in Batting
## [1] "playerID" "yearID" "stint" "teamID" "lgID"
## [6] "G" "G_batting" "AB" "R" "H"
## [11] "2B" "3B" "HR" "RBI" "SB"
## [16] "CS" "BB" "SO" "IBB" "HBP"
## [21] "SH" "SF" "GIDP" "G_old"
dbListFields(con, "Salaries") # Columns in Salaries
## [1] "yearID" "teamID" "lgID" "playerID" "salary"
Suppose we want to figure out the average salaries of the players with the top 10 highest homerun averages. Then we’d have to combine the two tables below
dbGetQuery(con, paste("SELECT *",
"FROM Salaries",
"ORDER BY playerID",
"LIMIT 8"))
## yearID teamID lgID playerID salary
## 1 2004 SFN NL aardsda01 300000
## 2 2007 CHA AL aardsda01 387500
## 3 2008 BOS AL aardsda01 403250
## 4 2009 SEA AL aardsda01 419000
## 5 2010 SEA AL aardsda01 2750000
## 6 1986 BAL AL aasedo01 600000
## 7 1987 BAL AL aasedo01 625000
## 8 1988 BAL AL aasedo01 675000
dbGetQuery(con, paste("SELECT yearID, teamID, lgID, playerID, HR",
"FROM Batting",
"ORDER BY playerID",
"LIMIT 8"))
## yearID teamID lgID playerID HR
## 1 2004 SFN NL aardsda01 0
## 2 2006 CHN NL aardsda01 0
## 3 2007 CHA AL aardsda01 0
## 4 2008 BOS AL aardsda01 0
## 5 2009 SEA AL aardsda01 0
## 6 2010 SEA AL aardsda01 0
## 7 1954 ML1 NL aaronha01 13
## 8 1955 ML1 NL aaronha01 27
We can use a JOIN
on the pair: yearID
, playerID
dbGetQuery(con, paste("SELECT yearID, playerID, salary, HR",
"FROM Batting JOIN Salaries USING(yearID, playerID)",
"ORDER BY playerID",
"LIMIT 8"))
## yearID playerID salary HR
## 1 2004 aardsda01 300000 0
## 2 2007 aardsda01 387500 0
## 3 2008 aardsda01 403250 0
## 4 2009 aardsda01 419000 0
## 5 2010 aardsda01 2750000 0
## 6 1986 aasedo01 600000 NA
## 7 1987 aasedo01 625000 NA
## 8 1988 aasedo01 675000 NA
Note that here we’re missing of David Aardsma’s records from the Batting table (i.e., the JOIN
discarded 1 record)
We can replicate this using merge()
on imported data frames:
batting = dbReadTable(con, "Batting")
salaries = dbReadTable(con, "Salaries")
merged = merge(x=batting, y=salaries, by.x=c("yearID","playerID"),
by.y=c("yearID","playerID"))
merged[order(merged$playerID)[1:8],
c("yearID", "playerID", "salary", "HR")]
## yearID playerID salary HR
## 16708 2004 aardsda01 300000 0
## 19378 2007 aardsda01 387500 0
## 20277 2008 aardsda01 403250 0
## 21164 2009 aardsda01 419000 0
## 21990 2010 aardsda01 2750000 0
## 585 1986 aasedo01 600000 NA
## 1360 1987 aasedo01 625000 NA
## 2033 1988 aasedo01 675000 NA
For demonstration purposes, we can use a LEFT JOIN
on the pair: yearID
, playerID
dbGetQuery(con, paste("SELECT yearID, playerID, salary, HR",
"FROM Batting LEFT JOIN Salaries USING(yearID, playerID)",
"LIMIT 7"))
## yearID playerID salary HR
## 1 2004 aardsda01 300000 0
## 2 2006 aardsda01 NA 0
## 3 2007 aardsda01 387500 0
## 4 2008 aardsda01 403250 0
## 5 2009 aardsda01 419000 0
## 6 1954 aaronha01 NA 13
## 7 1955 aaronha01 NA 27
Now we can see that we have all 5 of David Aardsma’s original records from the Batting table (i.e., the LEFT JOIN
used them all, and just filled in an NA value when it was missing his salary)
Currently, RIGHT JOIN
and FULL JOIN
are not implemented in the RSQLite
package
Now, as to our original question (average salaries of the players with the top 10 highest homerun averages):
dbGetQuery(con, paste("SELECT playerID, AVG(HR), AVG(salary)",
"FROM Batting JOIN Salaries USING(yearID, playerID)",
"GROUP BY playerID",
"ORDER BY Avg(HR) DESC",
"LIMIT 10"))
## playerID AVG(HR) AVG(salary)
## 1 howarry01 45.80000 9051000.0
## 2 pujolal01 40.80000 8953204.1
## 3 fieldpr01 38.00000 3882900.0
## 4 rodrial01 36.05882 15553897.2
## 5 reynoma01 34.66667 550777.7
## 6 bondsba01 34.63636 8556605.5
## 7 mcgwima01 34.29412 4814020.8
## 8 gonzaca01 34.00000 406000.0
## 9 dunnad01 33.50000 6969500.0
## 10 kingmda01 32.50000 908750.0