The first difficulty is that the data file is not arranged in a table.
It is in four columns (Soprano, Alto, Tenor, Bass), but the columns are
not of equal length. This means that read.table will not
be able to read the data. Instead, use scan, skipping over
the column labels.
> height <- scan("singers.dat",skip=1)
> height
[1] 64 65 69 72 62 62 72 70 66 68 71 72 65 67 66 69 60 67 76 73 61 63 74 71 65
[26] 67 71 72 66 66 66 68 65 63 68 68 63 72 67 71 67 62 70 66 65 61 65 68 62 66
[51] 72 71 65 64 70 73 68 60 68 73 65 61 73 70 63 66 66 68 65 66 68 70 62 66 67
[76] 75 65 62 64 68 66 70 71 62 65 70 65 64 74 63 63 70 65 65 75 66 69 75 65 61
[101] 69 62 66 72 65 65 71 66 61 70 65 63 71 61 64 68 65 67 70 66 66 75 65 68 72
[126] 62 66 72 70 69
Now the data are in S-PLUS, but are not arranged in any useful way.
Analyzing this data set would be made easier if it were represented as
two vectors, one for height and one for corresponding
singing part. Since the height vector is jumbled, this could be tricky
but there are two approaches to the problem. The first is to sort the
height vector so that the sopranos come first, with the altos, tenors, and
basses to follow. The second is to leave the height vector as it is, and
generate the singing part vector in a way that accounts for the ragged
columns in the data.
The second way looks easier, so let's give that a shot. The first step is to generate a vector for singing part, which we can fill in as we go.
> part <- rep("", 130)
The data are in four columns for the first 20 lines, and the height vector was
formed by reading across the data. This means that the first, fifth, ninth,
etc. entries are sopranos, the second, sixth, tenth, etc. are altos and
so on. In S-PLUS, ``a:b'' means ``the integers from a to b'', and we can use
this shortcut to generate the indices which we need. The subscript
1 + 0:19*4 will pick out the first, fifth, ninth, etc. through 77th
entries (which are sopranos).
> part[1 + 0:19*4] <- "soprano" > part[2 + 0:19*4] <- "alto" > part[3 + 0:19*4] <- "tenor" > part[4 + 0:19*4] <- "bass" > part [1] "soprano" "alto" "tenor" "bass" "soprano" "alto" "tenor" [8] "bass" "soprano" "alto" "tenor" "bass" "soprano" "alto" [15] "tenor" "bass" "soprano" "alto" "tenor" "bass" "soprano" [22] "alto" "tenor" "bass" "soprano" "alto" "tenor" "bass" [29] "soprano" "alto" "tenor" "bass" "soprano" "alto" "tenor" [36] "bass" "soprano" "alto" "tenor" "bass" "soprano" "alto" [43] "tenor" "bass" "soprano" "alto" "tenor" "bass" "soprano" [50] "alto" "tenor" "bass" "soprano" "alto" "tenor" "bass" [57] "soprano" "alto" "tenor" "bass" "soprano" "alto" "tenor" [64] "bass" "soprano" "alto" "tenor" "bass" "soprano" "alto" [71] "tenor" "bass" "soprano" "alto" "tenor" "bass" "soprano" [78] "alto" "tenor" "bass" "" "" "" "" [85] "" "" "" "" "" "" "" [92] "" "" "" "" "" "" "" [99] "" "" "" "" "" "" "" [106] "" "" "" "" "" "" "" [113] "" "" "" "" "" "" "" [120] "" "" "" "" "" "" "" [127] "" "" "" ""The next 15 lines of the data file were in three columns, since there were only 20 tenors but at least 35 of every other group. The sopranos are numbers 81, 84, 87, etc.
> part[81 + 0:14*3] <- "soprano" > part[82 + 0:14*3] <- "alto" > part[83 + 0:14*3] <- "bass"The next row of the data consists of one soprano and one bass, and the last three rows have basses only. That means the 126th is a soprano and the rest are basses.
> part[126] <- "soprano" > part[127:130] <- "bass"Now we have the two vectors that we need for the data frame: height and corresponding singing part. Let's put these data into a 130 by 2 matrix:
> temp.matrix <- matrix(c(height, part), nrow=130, ncol=2) > rm(height) > rm(part)Note that in the matrix, height has been converted to characters because singing part was a character vector. This will have to be fixed.
Now that we have the data in a matrix, we can convert it to a data frame and put the two columns in the form that we will want:
> singers.frame <- data.frame(temp.matrix)
> rm(temp.matrix)
> names(singers.frame) <- c("Height","Part")
> singers.frame$Height <- as.numeric(singers.frame$Height)
> singers.frame$Part <- ordered(as.factor(part), levels=c("bass","tenor",
"alto","soprano"))
> singers.frame
Height Part
1 64 soprano
2 65 alto
3 69 tenor
4 72 bass
5 62 soprano
6 62 alto
7 72 tenor
8 70 bass
9 66 soprano
10 68 alto
11 71 tenor
12 72 bass
13 65 soprano
14 67 alto
15 66 tenor
16 69 bass
17 60 soprano
18 67 alto
19 76 tenor
20 73 bass
...
We could also have made the vectors into a data frame via:
> singers.frame <- data.frame(list(height, part))This would have preserved
height as a numeric vector.
Question: If you wanted to convert the raw data to a cleaner format
in the first way suggested (by putting the height vector
in order by singing part), how would you do it?