Our data are usually presented as a csv file and after loading a csv file into the R studio, we will achieve a data frame. A data frame can be considered a special case of matrix where each column usually represent a variable and all variables have exactly the same length.
nba <- read.csv("http://www.stat.cmu.edu/cmsac/sure/materials/data/intro_r/nba_2020_player_stats.csv")
We can use the functions head() and tail() to view a sample of the data frame. Examples are viewing the first six rows of the data frame “nba” and the last three rows.
head(nba)
## player position age team games games_started
## 1 Steven Adams C 26 OKC 58 58
## 2 Bam Adebayo PF 22 MIA 65 65
## 3 LaMarcus Aldridge C 34 SAS 53 53
## 4 Nickeil Alexander-Walker SG 21 NOP 41 0
## 5 Grayson Allen SG 24 MEM 30 0
## 6 Jarrett Allen C 21 BRK 64 58
## minutes_played field_goals field_goal_attempts three_pointers
## 1 1564 262 443 1
## 2 2235 408 719 1
## 3 1754 391 793 61
## 4 501 77 227 40
## 5 498 79 176 33
## 6 1647 267 413 0
## three_point_attempts two_pointers two_point_attempts free_throws
## 1 3 261 440 108
## 2 13 407 706 236
## 3 157 330 636 158
## 4 117 37 110 17
## 5 91 46 85 30
## 6 5 267 408 147
## free_throw_attempts offensive_rebounds defensive_rebounds assists steals
## 1 183 196 347 141 50
## 2 342 165 518 333 78
## 3 191 103 289 129 36
## 4 28 8 72 74 11
## 5 35 5 61 43 6
## 6 237 195 410 85 37
## blocks turnovers personal_fouls
## 1 65 86 111
## 2 85 185 164
## 3 87 74 128
## 4 7 40 46
## 5 1 23 36
## 6 85 72 144
tail(nba, 3)
## player position age team games games_started minutes_played
## 568 Cody Zeller C 27 CHO 58 39 1341
## 569 Ante ŽižiÄ C 23 CLE 22 0 221
## 570 Ivica Zubac C 22 LAC 64 62 1156
## field_goals field_goal_attempts three_pointers three_point_attempts
## 568 251 479 18 75
## 569 41 72 0 0
## 570 202 336 0 2
## two_pointers two_point_attempts free_throws free_throw_attempts
## 568 233 404 122 179
## 569 41 72 14 19
## 570 202 334 107 141
## offensive_rebounds defensive_rebounds assists steals blocks turnovers
## 568 160 251 88 40 25 75
## 569 18 48 6 7 5 10
## 570 166 295 69 13 60 53
## personal_fouls
## 568 140
## 569 27
## 570 146
To view the dimentions of the data frame, use the function dim()
dim(nba)
## [1] 570 22
Tow view the summary statistics for all variables and the data structure types, use summary() and str().
summary(nba)
## player position age team
## Anthony Tolliver: 3 C : 96 Min. :19.00 MIN : 24
## Jordan McRae : 3 PF:119 1st Qu.:22.00 DET : 22
## Alec Burks : 2 PG:100 Median :25.00 GSW : 22
## Alex Len : 2 SF:104 Mean :25.61 SAC : 22
## Allen Crabbe : 2 SG:151 3rd Qu.:28.00 ATL : 21
## Andre Drummond : 2 Max. :43.00 CLE : 21
## (Other) :556 (Other):438
## games games_started minutes_played field_goals
## Min. : 1.00 Min. : 0.00 Min. : 1.0 Min. : 0.0
## 1st Qu.:14.00 1st Qu.: 0.00 1st Qu.: 206.0 1st Qu.: 25.0
## Median :40.00 Median : 4.00 Median : 715.5 Median : 92.5
## Mean :35.97 Mean :17.04 Mean : 823.4 Mean :139.1
## 3rd Qu.:56.00 3rd Qu.:33.00 3rd Qu.:1396.8 3rd Qu.:216.0
## Max. :66.00 Max. :65.00 Max. :2243.0 Max. :623.0
##
## field_goal_attempts three_pointers three_point_attempts
## Min. : 0.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 54.5 1st Qu.: 2.25 1st Qu.: 9.25
## Median : 210.0 Median : 22.00 Median : 69.00
## Mean : 302.6 Mean : 41.33 Mean :115.63
## 3rd Qu.: 479.0 3rd Qu.: 65.00 3rd Qu.:186.50
## Max. :1386.0 Max. :271.00 Max. :769.00
##
## two_pointers two_point_attempts free_throws free_throw_attempts
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 16.00 1st Qu.: 30.25 1st Qu.: 7.00 1st Qu.: 10.25
## Median : 58.00 Median :114.50 Median : 32.00 Median : 44.00
## Mean : 97.78 Mean :186.94 Mean : 60.15 Mean : 78.02
## 3rd Qu.:153.00 3rd Qu.:278.75 3rd Qu.: 76.00 3rd Qu.:102.75
## Max. :540.00 Max. :998.00 Max. :619.00 Max. :719.00
##
## offensive_rebounds defensive_rebounds assists steals
## Min. : 0.00 Min. : 0.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 7.00 1st Qu.: 25.0 1st Qu.: 12.00 1st Qu.: 5.00
## Median : 21.00 Median : 88.0 Median : 47.50 Median : 20.00
## Mean : 34.47 Mean :118.4 Mean : 82.95 Mean : 26.09
## 3rd Qu.: 48.75 3rd Qu.:182.5 3rd Qu.:109.00 3rd Qu.: 40.00
## Max. :244.00 Max. :653.0 Max. :636.00 Max. :115.00
##
## blocks turnovers personal_fouls
## Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 3.00 1st Qu.: 9.00 1st Qu.: 20.00
## Median : 9.00 Median : 33.50 Median : 64.00
## Mean : 16.85 Mean : 47.29 Mean : 70.13
## 3rd Qu.: 22.00 3rd Qu.: 70.00 3rd Qu.:113.75
## Max. :187.00 Max. :289.00 Max. :246.00
##
str(nba)
## 'data.frame': 570 obs. of 22 variables:
## $ player : Factor w/ 514 levels "Aaron Gordon",..: 446 31 311 378 181 220 277 6 268 303 ...
## $ position : Factor w/ 5 levels "C","PF","PG",..: 1 2 1 5 5 1 5 2 4 2 ...
## $ age : int 26 22 34 21 24 21 27 29 26 26 ...
## $ team : Factor w/ 30 levels "ATL","BOS","BRK",..: 21 16 27 19 15 3 20 22 3 15 ...
## $ games : int 58 65 53 41 30 64 10 18 3 59 ...
## $ games_started : int 58 65 53 0 0 58 0 2 0 20 ...
## $ minutes_played : int 1564 2235 1754 501 498 1647 117 380 17 1140 ...
## $ field_goals : int 262 408 391 77 79 267 19 25 1 138 ...
## $ field_goal_attempts : int 443 719 793 227 176 413 44 86 6 280 ...
## $ three_pointers : int 1 1 61 40 33 0 5 9 0 16 ...
## $ three_point_attempts: int 3 13 157 117 91 5 16 36 3 62 ...
## $ two_pointers : int 261 407 330 37 46 267 14 16 1 122 ...
## $ two_point_attempts : int 440 706 636 110 85 408 28 50 3 218 ...
## $ free_throws : int 108 236 158 17 30 147 7 19 1 43 ...
## $ free_throw_attempts : int 183 342 191 28 35 237 11 29 2 66 ...
## $ offensive_rebounds : int 196 165 103 8 5 195 2 24 0 54 ...
## $ defensive_rebounds : int 347 518 289 72 61 410 7 63 2 203 ...
## $ assists : int 141 333 129 74 43 85 21 21 0 132 ...
## $ steals : int 50 78 36 11 6 37 5 18 0 46 ...
## $ blocks : int 65 85 87 7 1 85 2 8 1 31 ...
## $ turnovers : int 86 185 74 40 23 72 8 17 0 55 ...
## $ personal_fouls : int 111 164 128 46 36 144 7 27 2 85 ...
An easier way to manipulate the data frame is through the package “dplyr”, which is included in the package “tidyverse”. The operation we can do includes: selecting specific columns, filtering for rows, re-ordering rows, adding new columns and summarizing data. “Split-apply-combine” concept can also be achieved by “dplyr”. Before we use the package, remember to call it:
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.5.3
## Warning: package 'ggplot2' was built under R version 3.5.3
## Warning: package 'tibble' was built under R version 3.5.3
## Warning: package 'tidyr' was built under R version 3.5.3
## Warning: package 'readr' was built under R version 3.5.2
## Warning: package 'purrr' was built under R version 3.5.3
## Warning: package 'dplyr' was built under R version 3.5.3
## Warning: package 'stringr' was built under R version 3.5.2
## Warning: package 'forcats' was built under R version 3.5.2
The function select() can be use to select certain column with the column names. The first parameter is the name of the data frame. For example, if I only need to view the column “player” and “games”:
nba_pg <- select(nba, player, games)
head(nba_pg)
## player games
## 1 Steven Adams 58
## 2 Bam Adebayo 65
## 3 LaMarcus Aldridge 53
## 4 Nickeil Alexander-Walker 41
## 5 Grayson Allen 30
## 6 Jarrett Allen 64
To select all the columns except a specific column, use the -
(subtraction) operator. For example,
head(select(nba, -player))
## position age team games games_started minutes_played field_goals
## 1 C 26 OKC 58 58 1564 262
## 2 PF 22 MIA 65 65 2235 408
## 3 C 34 SAS 53 53 1754 391
## 4 SG 21 NOP 41 0 501 77
## 5 SG 24 MEM 30 0 498 79
## 6 C 21 BRK 64 58 1647 267
## field_goal_attempts three_pointers three_point_attempts two_pointers
## 1 443 1 3 261
## 2 719 1 13 407
## 3 793 61 157 330
## 4 227 40 117 37
## 5 176 33 91 46
## 6 413 0 5 267
## two_point_attempts free_throws free_throw_attempts offensive_rebounds
## 1 440 108 183 196
## 2 706 236 342 165
## 3 636 158 191 103
## 4 110 17 28 8
## 5 85 30 35 5
## 6 408 147 237 195
## defensive_rebounds assists steals blocks turnovers personal_fouls
## 1 347 141 50 65 86 111
## 2 518 333 78 85 185 164
## 3 289 129 36 87 74 128
## 4 72 74 11 7 40 46
## 5 61 43 6 1 23 36
## 6 410 85 37 85 72 144
To select a range of columns by name, use the :
(colon) operator. For example,
head(select(nba, player:games))
## player position age team games
## 1 Steven Adams C 26 OKC 58
## 2 Bam Adebayo PF 22 MIA 65
## 3 LaMarcus Aldridge C 34 SAS 53
## 4 Nickeil Alexander-Walker SG 21 NOP 41
## 5 Grayson Allen SG 24 MEM 30
## 6 Jarrett Allen C 21 BRK 64
To select all columns that start with certain character strings, use the function starts_with(). Ohter matching options are:
ends_with() = Select columns that end with a character string
contains() = Select columns that contain a character string
matches() = Select columns that match a regular expression
one_of() = Select columns names that are from a group of names
head(select(nba, starts_with("three")))
## three_pointers three_point_attempts
## 1 1 3
## 2 1 13
## 3 61 157
## 4 40 117
## 5 33 91
## 6 0 5
head(select(nba, contains("throw")))
## free_throws free_throw_attempts
## 1 108 183
## 2 236 342
## 3 158 191
## 4 17 28
## 5 30 35
## 6 147 237
We can also select the rows that satisfied certain criteria. For example, if I want to select the rows that the number of assists is larger than 500.
filter(nba, assists > 500)
## player position age team games games_started minutes_played
## 1 LeBron James PG 35 LAL 60 60 2094
## 2 Ricky Rubio PG 29 PHO 57 57 1802
## 3 Trae Young PG 21 ATL 60 60 2120
## field_goals field_goal_attempts three_pointers three_point_attempts
## 1 586 1176 133 381
## 2 252 611 66 188
## 3 546 1249 205 568
## two_pointers two_point_attempts free_throws free_throw_attempts
## 1 453 795 239 343
## 2 186 423 174 204
## 3 341 681 481 559
## offensive_rebounds defensive_rebounds assists steals blocks turnovers
## 1 59 414 636 74 30 239
## 2 42 223 507 88 9 155
## 3 32 223 560 65 8 289
## personal_fouls
## 1 106
## 2 143
## 3 104
We can also filter mutiple criteria.
filter(nba, age > 30, team %in% c("HOU", "GSW"))
## player position age team games games_started minutes_played
## 1 Ryan Anderson PF 31 HOU 2 0 14
## 2 DeMarre Carroll SF 33 HOU 6 0 96
## 3 Tyson Chandler C 37 HOU 26 5 219
## 4 Stephen Curry PG 31 GSW 5 5 139
## 5 Eric Gordon SG 31 HOU 34 13 972
## 6 Jeff Green PF 33 HOU 10 0 201
## 7 Jeremy Pargo PG 33 GSW 3 0 44
## 8 Thabo Sefolosha PF 35 HOU 41 0 436
## 9 P.J. Tucker PF 34 HOU 64 64 2203
## 10 Russell Westbrook PG 31 HOU 53 53 1905
## field_goals field_goal_attempts three_pointers three_point_attempts
## 1 2 7 1 5
## 2 12 25 4 14
## 3 14 18 0 0
## 4 33 82 12 49
## 5 162 438 92 288
## 6 41 66 14 34
## 7 11 22 3 7
## 8 37 91 15 54
## 9 162 374 95 257
## 10 568 1199 51 201
## two_pointers two_point_attempts free_throws free_throw_attempts
## 1 1 2 0 0
## 2 8 11 12 16
## 3 14 18 6 13
## 4 21 33 26 26
## 5 70 150 78 103
## 6 27 32 8 10
## 7 8 15 0 1
## 8 22 37 3 8
## 9 67 117 37 44
## 10 517 998 269 346
## offensive_rebounds defensive_rebounds assists steals blocks turnovers
## 1 0 7 2 1 0 1
## 2 4 12 7 2 2 6
## 3 25 41 6 6 8 8
## 4 4 22 33 5 2 16
## 5 8 57 52 22 12 36
## 6 9 23 12 9 5 9
## 7 1 2 8 1 0 4
## 8 21 72 25 24 12 15
## 9 102 339 101 70 31 67
## 10 94 332 370 88 17 236
## personal_fouls
## 1 1
## 2 8
## 3 32
## 4 11
## 5 68
## 6 19
## 7 8
## 8 43
## 9 209
## 10 187
Before we go any futher, let’s introduce the pipe operator: %>%
. This operator allows you to pipe the output from one function to the input of another function. For example, the following two statements will generate the same output:
head(select(nba, player, games))
## player games
## 1 Steven Adams 58
## 2 Bam Adebayo 65
## 3 LaMarcus Aldridge 53
## 4 Nickeil Alexander-Walker 41
## 5 Grayson Allen 30
## 6 Jarrett Allen 64
nba %>% select(player, games) %>% head
## player games
## 1 Steven Adams 58
## 2 Bam Adebayo 65
## 3 LaMarcus Aldridge 53
## 4 Nickeil Alexander-Walker 41
## 5 Grayson Allen 30
## 6 Jarrett Allen 64
The reason why we introduce the pipe operator is to easily combine multiple functions without using too many parentheses or name the variables in each step.
To arrange the data frame by a specific order we need to use the function arrange(). The default is by increasing order and a negaive operator will provide the decreasing order.
head(arrange(nba, personal_fouls))
## player position age team games games_started
## 1 Kostas Antetokounmpo PF 22 LAL 3 0
## 2 Troy Daniels SG 28 DEN 1 0
## 3 Jacob Evans SG 22 MIN 2 0
## 4 Kyle Guy SG 22 SAC 2 0
## 5 Jared Harper PG 22 PHO 3 0
## 6 Talen Horton-Tucker SG 19 LAL 2 0
## minutes_played field_goals field_goal_attempts three_pointers
## 1 5 0 0 0
## 2 1 0 1 0
## 3 4 0 1 0
## 4 4 1 2 0
## 5 8 1 4 0
## 6 5 0 1 0
## three_point_attempts two_pointers two_point_attempts free_throws
## 1 0 0 0 0
## 2 1 0 0 0
## 3 1 0 0 0
## 4 1 1 1 0
## 5 2 1 2 0
## 6 1 0 0 0
## free_throw_attempts offensive_rebounds defensive_rebounds assists steals
## 1 0 0 1 1 0
## 2 0 0 0 0 0
## 3 0 0 0 0 0
## 4 0 0 0 0 0
## 5 0 0 0 0 0
## 6 0 0 0 2 1
## blocks turnovers personal_fouls
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
head(arrange(nba, -personal_fouls))
## player position age team games games_started minutes_played
## 1 Dillon Brooks SG 24 MEM 65 65 1851
## 2 Jaren Jackson Jr. C 20 MEM 54 54 1512
## 3 P.J. Tucker PF 34 HOU 64 64 2203
## 4 James Harden SG 30 HOU 61 61 2241
## 5 Daniel Theis C 27 BOS 58 57 1382
## 6 Dwight Howard C 34 LAL 62 2 1193
## field_goals field_goal_attempts three_pointers three_point_attempts
## 1 372 925 132 358
## 2 328 701 135 340
## 3 162 374 95 257
## 4 603 1386 271 769
## 5 216 382 26 81
## 6 188 257 3 5
## two_pointers two_point_attempts free_throws free_throw_attempts
## 1 240 567 144 178
## 2 193 361 123 166
## 3 67 117 37 44
## 4 332 617 619 719
## 5 190 301 81 106
## 6 185 252 89 180
## offensive_rebounds defensive_rebounds assists steals blocks turnovers
## 1 61 152 128 57 25 108
## 2 53 201 76 36 87 94
## 3 102 339 101 70 31 67
## 4 64 324 450 106 53 273
## 5 130 250 95 35 75 48
## 6 156 301 42 27 76 73
## personal_fouls
## 1 246
## 2 219
## 3 209
## 4 206
## 5 201
## 6 200
It can also combine with the select() and filter(), with the pipe operator.
nba %>% select(player, team, age, games) %>% arrange(-age) %>% filter(games > 50) %>% head
## player team age games
## 1 Vince Carter ATL 43 60
## 2 LeBron James LAL 35 60
## 3 J.J. Redick NOP 35 54
## 4 LaMarcus Aldridge SAS 34 53
## 5 Taj Gibson NYK 34 62
## 6 Dwight Howard LAL 34 62
Sometimes the data does not provide the variable that we are interested in directly and in that case, we can manipulate the current variables and add new variables into the data frame. For example, if I am interested in number of fouls per game,
nba %>% mutate(foul_per_game = personal_fouls / games) %>% head
## player position age team games games_started
## 1 Steven Adams C 26 OKC 58 58
## 2 Bam Adebayo PF 22 MIA 65 65
## 3 LaMarcus Aldridge C 34 SAS 53 53
## 4 Nickeil Alexander-Walker SG 21 NOP 41 0
## 5 Grayson Allen SG 24 MEM 30 0
## 6 Jarrett Allen C 21 BRK 64 58
## minutes_played field_goals field_goal_attempts three_pointers
## 1 1564 262 443 1
## 2 2235 408 719 1
## 3 1754 391 793 61
## 4 501 77 227 40
## 5 498 79 176 33
## 6 1647 267 413 0
## three_point_attempts two_pointers two_point_attempts free_throws
## 1 3 261 440 108
## 2 13 407 706 236
## 3 157 330 636 158
## 4 117 37 110 17
## 5 91 46 85 30
## 6 5 267 408 147
## free_throw_attempts offensive_rebounds defensive_rebounds assists steals
## 1 183 196 347 141 50
## 2 342 165 518 333 78
## 3 191 103 289 129 36
## 4 28 8 72 74 11
## 5 35 5 61 43 6
## 6 237 195 410 85 37
## blocks turnovers personal_fouls foul_per_game
## 1 65 86 111 1.913793
## 2 85 185 164 2.523077
## 3 87 74 128 2.415094
## 4 7 40 46 1.121951
## 5 1 23 36 1.200000
## 6 85 72 144 2.250000
TO create summary statistics for a given column in the data frame, we can use summarise() function. For example, if I need to extract the mean, min, max number of assists,
nba %>% summarise(avg_assist = mean(assists),
min_assist = min(assists),
max_assist = max(assists))
## avg_assist min_assist max_assist
## 1 82.95088 0 636
The advantage of summarise is more obvious if we combine it with the group_by() , the group operators. Since players at the different position tend to have very different statistics, we can group the players by their poisition first and then generate summary statistics. For example,
nba %>% group_by(position) %>% summarise(avg_assist = mean(assists),
min_assist = min(assists),
max_assist = max(assists))
## # A tibble: 5 x 4
## position avg_assist min_assist max_assist
## <fct> <dbl> <int> <int>
## 1 C 62.4 0 446
## 2 PF 56.5 0 333
## 3 PG 165. 0 636
## 4 SF 60.8 0 340
## 5 SG 77.5 0 450
#R code here
# location of the dataset: http://www.stat.cmu.edu/cmsac/sure/materials/data/intro_r/mlb_teams_data.csv
#R code here
#R code here
#R code here
5.(option) Compare following statements and discuss the difference. Hint: read the help document of “regex”.
head(select(mlb, contains("runs")))
head(select(mlb, matches("runs")))
head(select(mlb, contains("runs.*")))
head(select(mlb, matches("runs.*")))
#R code here
#R code here
#R code here
#R code here
#R code here