Reading and Organizing the Data

The first difficulty is that the data file is not arranged in a table. It is in four columns (Soprano, Alto, Tenor, Bass), but the columns are not of equal length. This means that read.table will not be able to read the data. Instead, use scan, skipping over the column labels.

> height <- scan("singers.dat",skip=1)
> height
  [1] 64 65 69 72 62 62 72 70 66 68 71 72 65 67 66 69 60 67 76 73 61 63 74 71 65
 [26] 67 71 72 66 66 66 68 65 63 68 68 63 72 67 71 67 62 70 66 65 61 65 68 62 66
 [51] 72 71 65 64 70 73 68 60 68 73 65 61 73 70 63 66 66 68 65 66 68 70 62 66 67
 [76] 75 65 62 64 68 66 70 71 62 65 70 65 64 74 63 63 70 65 65 75 66 69 75 65 61
[101] 69 62 66 72 65 65 71 66 61 70 65 63 71 61 64 68 65 67 70 66 66 75 65 68 72
[126] 62 66 72 70 69

Now the data are in S-PLUS, but are not arranged in any useful way. Analyzing this data set would be made easier if it were represented as two vectors, one for height and one for corresponding singing part. Since the height vector is jumbled, this could be tricky but there are two approaches to the problem. The first is to sort the height vector so that the sopranos come first, with the altos, tenors, and basses to follow. The second is to leave the height vector as it is, and generate the singing part vector in a way that accounts for the ragged columns in the data.

The second way looks easier, so let's give that a shot. The first step is to generate a vector for singing part, which we can fill in as we go.

> part <- rep("", 130)

The data are in four columns for the first 20 lines, and the height vector was formed by reading across the data. This means that the first, fifth, ninth, etc. entries are sopranos, the second, sixth, tenth, etc. are altos and so on. In S-PLUS, ``a:b'' means ``the integers from a to b'', and we can use this shortcut to generate the indices which we need. The subscript 1 + 0:19*4 will pick out the first, fifth, ninth, etc. through 77th entries (which are sopranos).

> part[1 + 0:19*4] <- "soprano"
> part[2 + 0:19*4] <- "alto"
> part[3 + 0:19*4] <- "tenor"
> part[4 + 0:19*4] <- "bass"
> part
  [1] "soprano" "alto"    "tenor"   "bass"    "soprano" "alto"    "tenor"
  [8] "bass"    "soprano" "alto"    "tenor"   "bass"    "soprano" "alto"
 [15] "tenor"   "bass"    "soprano" "alto"    "tenor"   "bass"    "soprano"
 [22] "alto"    "tenor"   "bass"    "soprano" "alto"    "tenor"   "bass"
 [29] "soprano" "alto"    "tenor"   "bass"    "soprano" "alto"    "tenor"
 [36] "bass"    "soprano" "alto"    "tenor"   "bass"    "soprano" "alto"
 [43] "tenor"   "bass"    "soprano" "alto"    "tenor"   "bass"    "soprano"
 [50] "alto"    "tenor"   "bass"    "soprano" "alto"    "tenor"   "bass"
 [57] "soprano" "alto"    "tenor"   "bass"    "soprano" "alto"    "tenor"
 [64] "bass"    "soprano" "alto"    "tenor"   "bass"    "soprano" "alto"
 [71] "tenor"   "bass"    "soprano" "alto"    "tenor"   "bass"    "soprano"
 [78] "alto"    "tenor"   "bass"    ""        ""        ""        ""
 [85] ""        ""        ""        ""        ""        ""        ""
 [92] ""        ""        ""        ""        ""        ""        ""
 [99] ""        ""        ""        ""        ""        ""        ""
[106] ""        ""        ""        ""        ""        ""        ""
[113] ""        ""        ""        ""        ""        ""        ""
[120] ""        ""        ""        ""        ""        ""        ""
[127] ""        ""        ""        ""

The next 15 lines of the data file were in three columns, since there were only 20 tenors but at least 35 of every other group. The sopranos are numbers 81, 84, 87, etc.

> part[81 + 0:14*3] <- "soprano"
> part[82 + 0:14*3] <- "alto"
> part[83 + 0:14*3] <- "bass"

The next row of the data consists of one soprano and one bass, and the last three rows have basses only. That means the 126th is a soprano and the rest are basses.

> part[126] <- "soprano"
> part[127:130] <- "bass"

Now we have the two vectors that we need for the data frame: height and corresponding singing part. Let's put these data into a 130 by 2 matrix:

> temp.matrix <- matrix(c(height, part), nrow=130, ncol=2)
> rm(height)
> rm(part)

Note that in the matrix, height has been converted to characters because singing part was a character vector. This will have to be fixed.

Now that we have the data in a matrix, we can convert it to a data frame and put the two columns in the form that we will want:

> singers.frame <- data.frame(temp.matrix)
> rm(temp.matrix)
> names(singers.frame) <- c("Height","Part")
> singers.frame$Height <- as.numeric(singers.frame$Height)
> singers.frame$Part <- ordered(as.factor(part), levels=c("bass","tenor",
"alto","soprano"))
> singers.frame
   Height    Part
 1     64 soprano
 2     65    alto
 3     69   tenor
 4     72    bass
 5     62 soprano
 6     62    alto
 7     72   tenor
 8     70    bass
 9     66 soprano
10     68    alto
11     71   tenor
12     72    bass
13     65 soprano
14     67    alto
15     66   tenor
16     69    bass
17     60 soprano
18     67    alto
19     76   tenor
20     73    bass
...

We could also have made the vectors into a data frame via:

> singers.frame <- data.frame(list(height, part))

This would have preserved height as a numeric vector.

Question: If you wanted to convert the raw data to a cleaner format in the first way suggested (by putting the height vector in order by singing part), how would you do it?

Pantelis Vlachos
1/15/1999