36-350
12 November 2014
Reading: Spector, chapter 3; handout on class website
R jargon | Database jargon |
---|---|
column | field |
row | record |
dataframe | table |
types of the columns | table schema |
bunch of related dataframes | database |
debt[debt$Country=="France",c("growth","ratio")]
with(debt,debt[Country=="France",c("growth","ratio")])
subset(x=debt,subset=(Country=="France"),select=c("growth","ratio"))
SELECT columns or computations
FROM table
WHERE condition
GROUP BY columns
HAVING condition
ORDER BY column [ASC|DESC]
LIMIT offset,count;
SELECT
is the first word of a query, then modifiers say which fields/columns to use, and what conditions records/rows must meet, from which tables
The final semi-colon is obligatory
SELECT PlayerID,yearID,AB,H FROM Batting;
Four columns from table Batting
SELECT * FROM Salaries;
All columns from table Salaries
SELECT * FROM Salaries ORDER BY Salary;
As above, but by ascending value of Salary
SELECT * FROM Salaries ORDER BY Salary DESC;
Descending order
SELECT * FROM Salaries ORDER BY Salary DESC LIMIT 10;
top 10 salaries
Picking out rows meeting a condition
SELECT PlayerID,yearID,AB,H FROM Batting WHERE AB > 100 AND H > 0;
vs.
Batting[Batting$AB>100 & Batting$H > 0, c("PlayerID","yearID","AB","H")]
SELECT MIN(AB), AVG(AB), MAX(AB) FROM Batting;
SELECT AB,H,H/CAST(AB AS REAL) FROM Batting;
<small>Because `AB` and `H` are integers, and it won't give you a fractional part by default</small>
SELECT PlayerID,yearID,H/CAST(AB AS REAL) AS BattingAvg FROM Batting
ORDER BY BattingAvg DESC LIMIT 10;
We can do calculations on value-grouped subsets, like in aggregate
or d*ply
SELECT playerID, SUM(salary) FROM Salaries GROUP BY playerID
WHERE
GROUP BY
HAVING
SELECT playerID, SUM(salary) AS totalSalary FROM Salaries GROUP BY playerID
HAVING totalSalary > 200000000
FROM
has just been one tablepatient_last |
patient_first |
physician_id |
complaint |
---|---|---|---|
Morgan | Dexter | 37010 | insomnia |
Soprano | Anthony | 79676 | malaise |
Swearengen | Albert | NA | healthy as a goddam horse |
Garrett | Alma | 90091 | nerves |
Holmes | Sherlock | 43675 | nicotine-patch addiction |
physician_last |
physician_first |
physicianID |
plan |
---|---|---|---|
Meridian | Emmett | 37010 | UPMC |
Melfi | Jennifer | 79676 | BCBS |
Cochran | Amos | 90091 | UPMC |
Watson | John | 43675 | VA |
merge
to link the tables up by physicianID
physician_id
or physicianID
is acting as the key or unique identifiermerge
, it has JOIN
as a modifier to FROM
SELECT physician_first, physician_last FROM patients INNER JOIN physicians ON patients.physician_id == physicians.physicianID WHERE condition=="insomnia"
Creates a (virtual) table linking records where physician_id
in one table matches physicianID
in the other
SELECT nameLast,nameFirst,yearID,AB,H FROM Master INNER JOIN Batting
USING(playerID);
INNER JOIN ... USING
links records with the same value of playerID
LEFT OUTER JOIN
includes records from the first table which don't match any record in the 2nd
NA
in the 2nd table's fieldsRIGHT OUTER JOIN
is just what you'd think
FULL OUTER JOIN
R jargon | Database jargon |
---|---|
column | field |
row | record |
dataframe | table |
types of the columns | table schema |
bunch of dataframes | database |
selections, subset |
SELECT ... FROM ... WHERE ... HAVING |
aggregate , d*ply |
GROUP BY |
merge |
JOIN |
order |
ORDER BY |
DBI
is a unified interface to theminstall.packages("DBI", dependencies = TRUE) # Install DBI
install.packages("RSQLite", dependencies = TRUE) # Install driver for SQLite
library(RSQLite)
drv <- dbDriver('SQLite')
con <- dbConnect(drv, dbname="baseball.db")
con
is now a persistent connection to the database baseball.db
dbListTables(con) # Get tables in the database (returns vector)
dbListFields(con, name) # List fields in a table
dbReadTable(con, name) # Import a table as a data frame
dbGetQuery(conn, statement)
df <- dbGetQuery(con, paste(
"SELECT nameLast,nameFirst,yearID,salary",
"FROM Master NATURAL JOIN Salaries"))
Usual workflow:
sqldf
package lets you use SQL commands on dataframes