Our data are usually presented as a csv file and after loading a csv file into R studio, we will have a “data frame”. A data frame can be considered a special case of matrix where each column represents a measurement or variable of interest for each observation which correspond to the rows of the dataset. After loading the tidyverse
suite of packages, we use the read_csv()
function to load the heart_disease
dataset from yesterday:
library(tidyverse)
heart_disease <- read_csv("http://www.stat.cmu.edu/cmsac/sure/2022/materials/data/health/intro_r/heart_disease.csv")
##
## ── Column specification ───────────────────────────────────────────────────────────────
## cols(
## Cost = col_double(),
## Age = col_double(),
## Gender = col_character(),
## Interventions = col_double(),
## Drugs = col_double(),
## ERVisit = col_double(),
## Complications = col_double(),
## Comorbidities = col_double(),
## Duration = col_double(),
## id = col_double()
## )
By default, read_csv()
reads in the dataset as a tbl
(aka tibble
) object instead of a data.frame
object. You can read about the differences here, but it’s not that meaningful for purposes.
We can use the functions head()
and tail()
to view a sample of the data. Use the head()
function to view the first 6 rows, then use the tail()
function to view the last 3 rows:
# INSERT CODE HERE
View the dimensions of the data with dim()
:
# INSERT CODE HERE
Quickly view summary statistics for all variables with the summary()
function:
# Uncomment the following code by deleting the # at the front:
# summary(heart_disease)
View the data structure types with str()
:
# str(heart_disease)
What’s the difference between the output from the two functions?
You can find a description of the dataset here.
dplyr
An easier way to manipulate the data frame is through the dplyr
package, which is in the tidyverse
suite of packages. The operations we can do include: selecting specific columns, filtering for rows, re-ordering rows, adding new columns and summarizing data. The “split-apply-combine” concept can be achieved by dplyr
.
select()
The function select()
can be use to select certain column with the column names. First create a new table called heart_disease_ad
that only contains the Age
and Drugs
columns:
# INSERT CODE HERE
To select all the columns except a specific column, use the -
(subtraction) operator. For example, view the output from uncommenting the following line of code:
# head(select(heart_disease, -Interventions))
To select a range of columns by name (that are in consecutive order), use the :
(colon) operator. For example, view the output from uncommenting the following line of code:
#head(select(heart_disease, Drugs:Duration))
To select all columns that start with certain character strings, use the function starts_with()
. Ohter matching options are:
ends_with()
= Select columns that end with a character string
contains()
= Select columns that contain a character string
matches()
= Select columns that match a regular expression
one_of()
= Select columns names that are from a group of names
# Uncomment the following lines of code
#head(select(heart_disease, starts_with("Com")))
#head(select(heart_disease, contains("er")))
We can also select the rows/observations that satisfy certain criteria. Try selecting the rows with more than 500 assists:
# INSERT CODE HERE
We can also filter on mutiple criteria. Select rows with Age
above 60 and the gender
is ‘Male’:
# INSERT CODE HERE
arrange()
To arrange the data frame by a specific order we need to use the function arrange()
. The default is by increasing order and a negative operator will provide the decreasing order. First arrange the heart_disease
table by Duration
in ascending order:
# INSERT CODE HERE
Next by descending order:
# INSERT CODE HERE
Try combining a pipeline of select()
, filter()
, and arrange()
steps together with the %>%
operator by:
Age
, Cost
, ERVisit
, and Duration
columns,Age
above 60,Duration
in descending order# INSERT CODE HERE
mutate()
Sometimes the data does not include the variable that we are interested in and we need to manipulate the current variables to add new variables into the data frame. Create a new colum cost_per_day
by taking the Cost
and dividing by Duration
(reassign this output to the heart_disease
table following the commented code chunk so this column is added to the table):
# heart_disease <- heart_disease %>%
# mutate(INSERT CODE HERE)
summarize()
To create summary statistics for a given column in the data frame, we can use summarize()
function. Compute the mean
, min
, and max
number of Cost
:
# INSERT CODE HERE
The advantage of summarize
is more obvious if we combine it with the group_by()
, the group operators. Try to group_by()
the Gender
column first and then compute the same summary statistics:
# INSERT CODE HERE