Data wrangling

Read and preview data

Our data are usually presented as a csv file and after loading a csv file into R studio, we will have a “data frame”. A data frame can be considered a special case of matrix where each column represents a measurement or variable of interest for each observation which correspond to the rows of the dataset. After loading the tidyverse suite of packages, we use the read_csv() function to load the heart_disease dataset from yesterday:

library(tidyverse)
heart_disease <- read_csv("http://www.stat.cmu.edu/cmsac/sure/2022/materials/data/health/intro_r/heart_disease.csv")

## 
## ── Column specification ───────────────────────────────────────────────────────────────
## cols(
##   Cost = col_double(),
##   Age = col_double(),
##   Gender = col_character(),
##   Interventions = col_double(),
##   Drugs = col_double(),
##   ERVisit = col_double(),
##   Complications = col_double(),
##   Comorbidities = col_double(),
##   Duration = col_double(),
##   id = col_double()
## )

By default, read_csv() reads in the dataset as a tbl (aka tibble) object instead of a data.frame object. You can read about the differences here, but it’s not that meaningful for purposes.

We can use the functions head() and tail() to view a sample of the data. Use the head() function to view the first 6 rows, then use the tail() function to view the last 3 rows:

# INSERT CODE HERE

View the dimensions of the data with dim():

# INSERT CODE HERE

Quickly view summary statistics for all variables with the summary() function:

# Uncomment the following code by deleting the # at the front:
# summary(heart_disease)

View the data structure types with str():

# str(heart_disease)

What’s the difference between the output from the two functions?

You can find a description of the dataset here.

Data manipulation with `dplyr`

An easier way to manipulate the data frame is through the dplyr package, which is in the tidyverse suite of packages. The operations we can do include: selecting specific columns, filtering for rows, re-ordering rows, adding new columns and summarizing data. The “split-apply-combine” concept can be achieved by dplyr.

Selecting columns with `select()`

The function select() can be use to select certain column with the column names. First create a new table called heart_disease_ad that only contains the Age and Drugs columns:

# INSERT CODE HERE

To select all the columns except a specific column, use the - (subtraction) operator. For example, view the output from uncommenting the following line of code:

# head(select(heart_disease, -Interventions))

To select a range of columns by name (that are in consecutive order), use the : (colon) operator. For example, view the output from uncommenting the following line of code:

#head(select(heart_disease, Drugs:Duration))

To select all columns that start with certain character strings, use the function starts_with(). Ohter matching options are:

ends_with() = Select columns that end with a character string
contains() = Select columns that contain a character string
matches() = Select columns that match a regular expression
one_of() = Select columns names that are from a group of names

# Uncomment the following lines of code
#head(select(heart_disease, starts_with("Com")))
#head(select(heart_disease, contains("er")))

Selecting rows using filter()

We can also select the rows/observations that satisfy certain criteria. Try selecting the rows with more than 500 assists:

# INSERT CODE HERE

We can also filter on mutiple criteria. Select rows with Age above 60 and the gender is ‘Male’:

# INSERT CODE HERE

Arrange or re-order rows using `arrange()`

To arrange the data frame by a specific order we need to use the function arrange(). The default is by increasing order and a negative operator will provide the decreasing order. First arrange the heart_disease table by Duration in ascending order:

# INSERT CODE HERE

Next by descending order:

# INSERT CODE HERE

Try combining a pipeline of select(), filter(), and arrange() steps together with the %>% operator by:

Selecting the Age, Cost, ERVisit, and Duration columns,
Filter to select only rows with Age above 60,
Sort by Duration in descending order

# INSERT CODE HERE

Create new columns using `mutate()`

Sometimes the data does not include the variable that we are interested in and we need to manipulate the current variables to add new variables into the data frame. Create a new colum cost_per_day by taking the Cost and dividing by Duration (reassign this output to the heart_disease table following the commented code chunk so this column is added to the table):

# heart_disease <- heart_disease %>%
#   mutate(INSERT CODE HERE)

Create summaries of the data with `summarize()`

To create summary statistics for a given column in the data frame, we can use summarize() function. Compute the mean, min, and max number of Cost:

# INSERT CODE HERE

The advantage of summarize is more obvious if we combine it with the group_by(), the group operators. Try to group_by() the Gender column first and then compute the same summary statistics:

# INSERT CODE HERE

Data wrangling

June 7th, 2022

Read and preview data

Data manipulation with dplyr

Selecting columns with select()

Selecting rows using filter()

Arrange or re-order rows using arrange()

Create new columns using mutate()

Create summaries of the data with summarize()

Data manipulation with `dplyr`

Selecting columns with `select()`

Arrange or re-order rows using `arrange()`

Create new columns using `mutate()`

Create summaries of the data with `summarize()`