The first difficulty is that the data file is not arranged in a table.
It is in four columns (Soprano, Alto, Tenor, Bass), but the columns are
not of equal length. This means that read.table
will not
be able to read the data. Instead, use scan
, skipping over
the column labels.
> height <- scan("singers.dat",skip=1) > height [1] 64 65 69 72 62 62 72 70 66 68 71 72 65 67 66 69 60 67 76 73 61 63 74 71 65 [26] 67 71 72 66 66 66 68 65 63 68 68 63 72 67 71 67 62 70 66 65 61 65 68 62 66 [51] 72 71 65 64 70 73 68 60 68 73 65 61 73 70 63 66 66 68 65 66 68 70 62 66 67 [76] 75 65 62 64 68 66 70 71 62 65 70 65 64 74 63 63 70 65 65 75 66 69 75 65 61 [101] 69 62 66 72 65 65 71 66 61 70 65 63 71 61 64 68 65 67 70 66 66 75 65 68 72 [126] 62 66 72 70 69Now the data are in S-PLUS, but are not arranged in any useful way. Analyzing this data set would be made easier if it were represented as two vectors, one for height and one for corresponding singing part. Since the height vector is jumbled, this could be tricky but there are two approaches to the problem. The first is to sort the height vector so that the sopranos come first, with the altos, tenors, and basses to follow. The second is to leave the height vector as it is, and generate the singing part vector in a way that accounts for the ragged columns in the data.
The second way looks easier, so let's give that a shot. The first step is to generate a vector for singing part, which we can fill in as we go.
> part <- rep("", 130)The data are in four columns for the first 20 lines, and the height vector was formed by reading across the data. This means that the first, fifth, ninth, etc. entries are sopranos, the second, sixth, tenth, etc. are altos and so on. In S-PLUS, ``a:b'' means ``the integers from a to b'', and we can use this shortcut to generate the indices which we need. The subscript 1 + 0:19*4 will pick out the first, fifth, ninth, etc. through 77th entries (which are sopranos).
> part[1 + 0:19*4] <- "soprano" > part[2 + 0:19*4] <- "alto" > part[3 + 0:19*4] <- "tenor" > part[4 + 0:19*4] <- "bass" > part [1] "soprano" "alto" "tenor" "bass" "soprano" "alto" "tenor" [8] "bass" "soprano" "alto" "tenor" "bass" "soprano" "alto" [15] "tenor" "bass" "soprano" "alto" "tenor" "bass" "soprano" [22] "alto" "tenor" "bass" "soprano" "alto" "tenor" "bass" [29] "soprano" "alto" "tenor" "bass" "soprano" "alto" "tenor" [36] "bass" "soprano" "alto" "tenor" "bass" "soprano" "alto" [43] "tenor" "bass" "soprano" "alto" "tenor" "bass" "soprano" [50] "alto" "tenor" "bass" "soprano" "alto" "tenor" "bass" [57] "soprano" "alto" "tenor" "bass" "soprano" "alto" "tenor" [64] "bass" "soprano" "alto" "tenor" "bass" "soprano" "alto" [71] "tenor" "bass" "soprano" "alto" "tenor" "bass" "soprano" [78] "alto" "tenor" "bass" "" "" "" "" [85] "" "" "" "" "" "" "" [92] "" "" "" "" "" "" "" [99] "" "" "" "" "" "" "" [106] "" "" "" "" "" "" "" [113] "" "" "" "" "" "" "" [120] "" "" "" "" "" "" "" [127] "" "" "" ""The next 15 lines of the data file were in three columns, since there were only 20 tenors but at least 35 of every other group. The sopranos are numbers 81, 84, 87, etc.
> part[81 + 0:14*3] <- "soprano" > part[82 + 0:14*3] <- "alto" > part[83 + 0:14*3] <- "bass"The next row of the data consists of one soprano and one bass, and the last three rows have basses only. That means the 126th is a soprano and the rest are basses.
> part[126] <- "soprano" > part[127:130] <- "bass"Now we have the two vectors that we need for the data frame: height and corresponding singing part. Let's put these data into a 130 by 2 matrix:
> temp.matrix <- matrix(c(height, part), nrow=130, ncol=2) > rm(height) > rm(part)Note that in the matrix, height has been converted to characters because singing part was a character vector. This will have to be fixed.
Now that we have the data in a matrix, we can convert it to a data frame and put the two columns in the form that we will want:
> singers.frame <- data.frame(temp.matrix) > rm(temp.matrix) > names(singers.frame) <- c("Height","Part") > singers.frame$Height <- as.numeric(singers.frame$Height) > singers.frame$Part <- ordered(as.factor(part), levels=c("bass","tenor", "alto","soprano")) > singers.frame Height Part 1 64 soprano 2 65 alto 3 69 tenor 4 72 bass 5 62 soprano 6 62 alto 7 72 tenor 8 70 bass 9 66 soprano 10 68 alto 11 71 tenor 12 72 bass 13 65 soprano 14 67 alto 15 66 tenor 16 69 bass 17 60 soprano 18 67 alto 19 76 tenor 20 73 bass ...We could also have made the vectors into a data frame via:
> singers.frame <- data.frame(list(height, part))This would have preserved
height
as a numeric vector.
Question: If you wanted to convert the raw data to a cleaner format
in the first way suggested (by putting the height
vector
in order by singing part), how would you do it?