A FIRST SESSION WITH SPLUS ========================== We will demonstrate some exploratory data analysis [EDA] techniques, as a way of learning basic manipulations in SPLUS. The "Prestige" data comes from Fox's (1997) "Applied regression analysis, linear models, and related methods" textbook. There is a "prestige.dat" file (data), the first several lines of which look like this: GOV_ADMINISTRATORS 13.11 12351 11.16 68.8 1113 prof GENERAL_MANAGERS 12.26 25879 4.02 69.1 1130 prof ACCOUNTANTS 12.77 9271 15.70 63.4 1171 prof PURCHASING_OFFICERS 11.42 8865 9.11 56.8 1175 prof CHEMISTS 14.62 8403 11.68 73.5 2111 prof PHYSICISTS 15.64 11030 5.13 77.6 2113 prof BIOLOGISTS 15.09 8258 25.65 72.6 2133 prof ARCHITECTS 15.44 14163 2.69 78.1 2141 prof CIVIL_ENGINEERS 14.52 11377 1.03 73.1 2143 prof and a "Prestige.cbk" file (codebook), which explains what the variables are: 1971 Canadian Occupational Prestige Data [1] Occupational title [2] Average education of incumbents, years [3] Average income of incumbents, dollars [4] Percent of incumbents who are women [5] Pineo-Porter prestige score for occupation [6] Canadian Census occupational code [7] Type of occupation prof = professional and technical wc = white collar bc = blue collar ? = missing (not classified) Source: Census of Canada, 1971, Volume 3, Part 6, pp. 19-1--19- 21; and personal communication from B. Blishen, W. Carroll, and C. Moore, Departments of Sociology, York University and University of Victoria. We want to look at the data in various ways in SPLUS. GETTING STARTED (AND STOPPED) ============================= To start up Splus in UNIX, we type % Splus -e at the UNIX prompt. The "-e" allows us to edit command lines as we type them. (alternatively, you could write commands in a text editor and cut-and-paste them into the Splus command line). The two most useful functions for reading in data are read.table() scan() If you wanted to learn more about one of these you would say help(read.table) help(scan) at the SPLUS prompt. In most UNIX versions of SPLUS, if you wanted the help to appear in a separate window (and also get a menu of help topics to choose from) you would type help.start(gui="motif") Oh, yes, and while I am thinking about it, the way to get out of SPLUS is to type q() READING IN THE DATA =================== First we use read.table() to read the data into a "data frame" and then we print the first 10 lines to compare with the raw data file. As you can see below, the S-plus "prompt" for input is the ">" character. The prompt for continuation lines (for a two-line command, say) is the "+" character. Control-C > prestige <- read.table("prestige.dat") > prestige[1:10,] V2 V3 V4 V5 V6 V7 GOV_ADMINISTRATORS 13.11 12351 11.16 68.8 1113 prof GENERAL_MANAGERS 12.26 25879 4.02 69.1 1130 prof ACCOUNTANTS 12.77 9271 15.70 63.4 1171 prof PURCHASING_OFFICERS 11.42 8865 9.11 56.8 1175 prof CHEMISTS 14.62 8403 11.68 73.5 2111 prof PHYSICISTS 15.64 11030 5.13 77.6 2113 prof BIOLOGISTS 15.09 8258 25.65 72.6 2133 prof ARCHITECTS 15.44 14163 2.69 78.1 2141 prof CIVIL_ENGINEERS 14.52 11377 1.03 73.1 2143 prof MINING_ENGINEERS 14.64 11023 0.94 68.8 2153 prof You can think of this as a matrix, whose rows are labelled by the occupations, and whose columns are labelled with the generic names V2, V3, ... V7. DIGRESSION: There are many data types in SPLUS. The basic ones are: numeric - a single floating-point number character - an alphanumeric string enclosed in quotes vector - a vector of elements of a single type (all numeric, or all character, say) matrix - a two-dimensional array of elements of a single type data.frame - like a matrix, but different columns can have different data types list - a "ragged" data frame There are many more types than this, but these are the ones you will encounter first and most frequently. SPLUS generally knows smart things to do with each data type. For example it knows how to add and multiply vectors element-by-element: > x _ c(1,2,3) # Note the use of "_" as a synonym for "<-" > y _ c(4,5,6) > x+y [1] 5 7 9 > x*y [1] 4 10 18 and it knows dot products and matrix multiplication: > X _ matrix(scan(),byrow=T,ncol=3) # note the use of scan() # for raw input 1: 1 0 0 4: 0 2 0 7: 0 0 3 10: > Y _ matrix(sample(1:10,9,replace=T),byrow=T,ncol=3) > X [,1] [,2] [,3] [1,] 1 0 0 [2,] 0 2 0 [3,] 0 0 3 > Y [,1] [,2] [,3] [1,] 10 10 1 [2,] 8 8 6 [3,] 10 9 6 > X %*% Y [,1] [,2] [,3] [1,] 10 10 1 [2,] 16 16 12 [3,] 30 27 18 The square bracket notation is how you index elements of vectors, matrices and data frames: [2,] is the second row, [,3] is the third column, [3,4] is the element in row 3, column 4, etc. END OF DIGRESSION Back to the Canadian job prestige data... The variable names V2, V3, ... aren't very informative; we can do better. One way is with the "names" function: > names(prestige) _ c( + "education", + "income", + "pct.female", + "occ.prestige", + "occ.code", + "occ.type" + ) > prestige[1:10,] education income pct.female occ.prestige occ.code occ.type GOV_ADMINISTRATORS 13.11 12351 11.16 68.8 1113 prof GENERAL_MANAGERS 12.26 25879 4.02 69.1 1130 prof ACCOUNTANTS 12.77 9271 15.70 63.4 1171 prof PURCHASING_OFFICERS 11.42 8865 9.11 56.8 1175 prof CHEMISTS 14.62 8403 11.68 73.5 2111 prof PHYSICISTS 15.64 11030 5.13 77.6 2113 prof BIOLOGISTS 15.09 8258 25.65 72.6 2133 prof ARCHITECTS 15.44 14163 2.69 78.1 2141 prof CIVIL_ENGINEERS 14.52 11377 1.03 73.1 2143 prof MINING_ENGINEERS 14.64 11023 0.94 68.8 2153 prof Why is the "prestige" data frame listing names instead of numbers for the rows and the columns? Because there are names to use. We could assign names to the rows and columns of a matrix as well: > dimnames(X) _ list(c("first row","second row","third row"),c("fred", + "barney","wilma")) > X fred barney wilma first row 1 0 0 second row 0 2 0 third row 0 0 3 What does the "dimnames" structure look like? For a matrix with no dimnames it is "NULL". FOr a matrix with dimnames, like X, it is a list consisting of two character vectors. The first vector contains the row names, the second vector contains the column names. > dimnames(Y) NULL > dimnames(X) [[1]]: [1] "first row" "second row" "third row" [[2]]: [1] "fred" "barney" "wilma" SOME SIMPLE SUMMARIES ===================== Well that was exciting, and a bit more "under the hood" than I had thought I would do at first. Let's do some summaries of the data. First, we'll get summaries of all of the columns of the data frame. > dim(prestige) [1] 102 6 > for (i in 1:6) print(summary(prestige[,i])) Min. 1st Qu. Median Mean 3rd Qu. Max. 6.38 8.445 10.54 10.74 12.65 15.97 Min. 1st Qu. Median Mean 3rd Qu. Max. 611 4106 5930 6798 8187 25880 Min. 1st Qu. Median Mean 3rd Qu. Max. 0 3.592 13.6 28.98 52.2 97.51 Min. 1st Qu. Median Mean 3rd Qu. Max. 14.8 35.22 43.6 46.83 59.28 87.2 Min. 1st Qu. Median Mean 3rd Qu. Max. 1113 3120 5135 5402 8312 9517 ? bc prof wc 4 45 30 23 That wasn't very informative. Let's try again. > prestige[1,] education income pct.female occ.prestige occ.code occ.type GOV_ADMINISTRATORS 13.11 12351 11.16 68.8 1113 prof > summary(prestige$education) Min. 1st Qu. Median Mean 3rd Qu. Max. 6.38 8.445 10.54 10.74 12.65 15.97 > summary(prestige$income) Min. 1st Qu. Median Mean 3rd Qu. Max. 611 4106 5930 6798 8187 25880 > summary(prestige$pct.female) Min. 1st Qu. Median Mean 3rd Qu. Max. 0 3.592 13.6 28.98 52.2 97.51 > summary(prestige$occ.prestige) Min. 1st Qu. Median Mean 3rd Qu. Max. 14.8 35.22 43.6 46.83 59.28 87.2 > summary(prestige$occ.code) Min. 1st Qu. Median Mean 3rd Qu. Max. 1113 3120 5135 5402 8312 9517 > summary(prestige$occ.type) ? bc prof wc 4 45 30 23 That was better. Note the use of the "$" notation as an alternative way of specifying the columns, or variables, in a data frame. Also note that the summary for occ.type is different from the others. Why? What did we get instead? Wierdly, standard deviation is not among the summaries that SPLUS gives for numerical data. We have to do that by hand. For example > sqrt(var(prestige$income)) [1] 4245.922 If we wanted them all (or all that make sense) we could say > sqrt(diag(var(prestige[,1:5]))) [1] 2.728444 4245.922227 31.724931 17.204486 2644.993215 what was that??? A better set of summaries might be stem-and-leaf plots. Before producing them, I will show you a trick that saves typing. If you "attach" a data frame, like this: > attach(prestige) the columns become individual variables (vectors) that you can access directly. When you are done, "detach()": > detach() So... > attach(prestige) > stem(education) N = 102 Median = 10.54 Quartiles = 8.43, 12.71 Decimal point is at the colon 6 : 4677789 7 : 1345566667899 8 : 12344455568889 9 : 0224556899 10 : 0011135669 11 : 0001111234444566 12 : 133345788 13 : 1668 14 : 124455667 15 : 11224689 16 : 00 > stem(income) N = 102 Median = 5930.5 Quartiles = 4075, 8206 Decimal point is 3 places to the right of the colon 0 : 69 1 : 79 2 : 44689 3 : 001125556667999 4 : 01223345677788 5 : 1111234566889 6 : 012335566799 7 : 01145679 8 : 0000123344889999 9 : 36 10 : 4 11 : 004 12 : 45 13 : 14 : 026 High: 17498 19263 25308 25879 > stem(pct.female) N = 102 Median = 13.6 Quartiles = 3.59, 52.27 Decimal point is 1 place to the right of the colon 0 : zzzzz111111111111122223334444444555667788899 1 : 11112344566777 2 : 0144568 3 : 0134599 4 : 778 5 : 22567 6 : 3889 7 : 1256667 8 : 334 9 : 12366678 > stem(occ.code) N = 102 Median = 5135 Quartiles = 3117, 8313 Decimal point is 3 places to the right of the colon 1 : 1122 2 : 1111122223333445777 3 : 11111122334 4 : 111112222222222 5 : 11111222 6 : 111111222 7 : 127 8 : 22223333335555666777888888 9 : 1122355 > stem(occ.type) Error in Summary.factor(x): A factor is not a numeric object Dumped What went wrong with the last stem and leaf plot? Note that there are several high income jobs that fell off the stem and leaf plot. A cheap way to figure out what they are is as follows: > dimnames(prestige)[[1]][order(-income)][1:4] [1] "GENERAL_MANAGERS" "PHYSICIANS" "LAWYERS" [4] "OSTEOPATHS_CHIROPRACTORS" Maybe we will decode this bit of magic in class. > detach() A stem and leaf plot is a special case of a histogram. This is a high-resolution plot in SPLUS, so we also have to set up a graphics window. > motif() # or win.graph() or graphsheet() on a PC Occupational code is probably not interesting as a quantitative variable, so we will concentrate on the other four: "education" "income" "pct.female" "occ.prestige". To get all four histograms on one graph we split the graph into pieces > par(mfrow=c(2,2)) > for (i in c("education","income","pct.female","occ.prestige")) { + hist(prestige[,i]) + title(i) + } The resulting graph can be printed using a pulldown menu on the graphics window. The graph can also be saved as a postscript (or encapsulated postscript) document, which is convenient for printing later or including in a latex document. This graph is saved as "prestige-hists.ps" You can bypass the graphics window and "print directly to postscript" as follows: > postscript("mygraph.ps") > par(mfrow=c(2,2)) > for (i in c("education","income","pct.female","occ.prestige")) { + hist(prestige[,i]) + title(i) + } > dev.off() Now the graph "mygraph.ps" is complete, and can be printed to one of our printers in the dept, say, with the command "lpr mygraph.ps". The histogram estimates the density of a random variable by summarizing what range of values occured frequently or infrequently. A smoother estimate of the density is the so-called "kernel density estimate" (kde). The basic idea of a kde is to smear each data point into a little probability density function (called the "kernel" or the "window"), and then average these little density functions to produce an overall, smooth density estimate. The wider the individual "smears", the smoother the overall estimate (the width of the kernel is usually measured by something like the SD, which is called the "bandwidth" or the "window width" in this setting). Inevitably there is a tradeoff to be made between having a smooth estimate and obscuring the features of the data. There are some automatic rules for balancing these two goals, but a generally good approach is to make several different kde's of the same data and let your eye be the judge. There are two functions in SPLUS that are convenient for making kde's: ksmooth() density() You should explore both. I find "density" a little easier to work with. In the next graph, I produce two sets of density estimates, for the four quantitative variables in the canadian job prestige data. The "default" smoothing parameters for "density()" are used in the top row of plots; and somewhat smoother (bandwidths about twice as wide as the default) are used in the bottom row of plots. > par(mfrow=c(2,4)) > for (i in names(prestige)[1:4]) { + plot(density(prestige[,i]),type="l") + title(i) + } > for (i in names(prestige)[1:4]) { + plot(density(prestige[,i],width=diff(range(prestige[,i]))/4),type="l") + title(i) + } This graph is saved as "prestige-kdes.ps". To finish off our initial summaries, let's look at some boxplots. You may remember that the basic idea of a boxplot is something like this: +---+-----+ * * * -----| | |------------- * o +---+-----+ OUTLIERS Q1 MED Q3 WHISKER OUTLIER EXTREME OUTLIER The simplest way to make a boxplot by hand is to * sort the data using a stem and leaf plot; * compute the first and third quartiles and the median and plot the box * compute the value IQR = Q3 - Q1 (this is the width of the box) * compute the - inner fences at Q1 - (1.5)*IQR and Q3 + (1.5)*IQR - outer fences at Q1 - (3.0)*IQR and Q3 + (3.0)*IQR * plot the whiskers out to the lowest and highest values inside the inner fences * decorate the outliers according to whether they fall inside or outside each set of fences Tukey, who invented boxplots (and stem and leaf diagrams, and many other useful EDA plots) has rather folksy names for the versions of quartiles he likes to calculate; he calls Q1 and Q3 the lower and upper hinges, and the IQR the hinge spread. We will make two sets of boxplots for the prestige data. The first set of boxplots are for each quantitative variable separately, to try to learn more about the skewness of these variables. > par(mfrow=c(2,4)) > for (i in names(prestige)[1:4]) { + boxplot(prestige[,i]) + title(i) + } > for (i in names(prestige)[1:4]) { + boxplot(prestige[,i],style.bxp="old") + title(i) + } > (I have made the boxplots in two styles. I find the second style easier to look at). This graph is saved as "prestige-boxes1.ps" The second set of boxplots is aimed at looking at the relationship between occupation type ("occ.type") and income. > par(mfrow=c(1,1)) > boxplot(split(prestige$income,prestige$occ.type),style.bxp="old") We can find out what the outliers were using a trick like before: > attach(prestige) > dimnames(prestige)[[1]][occ.type=="bc"][order(-income[occ.type=="bc"])][1:3] [1] "PILOTS" "FIREFIGHTERS" "POLICEMEN" > dimnames(prestige)[[1]][occ.type=="prof"][order(-income[occ.type=="prof"])][1:3] [1] "GENERAL_MANAGERS" "PHYSICIANS" "LAWYERS" > detach() This graph is saved as "prestige-boxes2.ps". TRANSFORMATIONS =============== There are many reasons to transform data. Power transformations x^p or (x^p-1)/p (p=0 <==> log) are often used: - to reduce skewness in X and/or Y - to reduce nonlinearity (in Y) - to correct nonconstant spread The "prestige" data gives examples of each. Skewness: -------- To assess skewness and other tail problems it is often useful to think about qq plots, also known as normal quantile plots. The idea is to plot the quantiles of the observed data against the corresponding quantiles of a standard normal distribution. Deviations from a stright line may indicate skewness and other departures from normality. The help for qqplot says, "QQplots are used to assess whether data have a particular distribution, or whether two datasets have the same distribution. If the distributions are the same, then the plot will be approximately a straight line. The extreme points have more variability than points toward the center. A plot with a "U" shape means that one distribution is skewed relative to the other. An "S" shape implies that one distribution has longer tails than the other. In the default configuration a plot from qqnorm that is bent down on the left and bent up on the right means that the data have longer tails than the Gaussian." It takes some practice to get good at interpreting QQ plots. It is a good idea to include confidence bands in the plot, to help your eye see the outliers; but this is not done automatically by most software. Another approach is to simply look at lots of qq plots of normal data, of the same sample size, to get a feel for the amount of "insignificant" variability in one's data. Consider the income data again. Before looking at it we will look at some randomly generated normal data to get an idea how much variability there is in the qqplots. > par(mfrow=c(3,2)) > for (i in 1:6) { + zz _ qqnorm(rnorm(102)) + abline(lmsreg(zz$x,zz$y)) + } > mtext("Some qqplots of Normal data", outer=T,cex=1.5) This graph is saved as "qqplots-rnorms.ps" > attach(prestige) > par(mfrow=c(3,2)) > plot(density(income),type="l") > points(income,rep(0,102)) > zz _ qqnorm(income) > abline(lmsreg(zz$x,zz$y)) > plot(density(sqrt(income)),type="l") > points(sqrt(income),rep(0,102)) > zz _ qqnorm(sqrt(income)) > abline(lmsreg(zz$x,zz$y)) > plot(density(log(income)),type="l") > points(log(income),rep(0,102)) > zz _ qqnorm(log(income)) > abline(lmsreg(zz$x,zz$y)) I have added plots of the original data so you can see hew the data are moving as the transformation changes. After trying several other power transforms, I like the log transform best. This graph is saved as "income-xforms.ps" Improving Linearity ------------------- I'll talk about improving linearity with transformations after we've discussed regression a little bit. Nonconstant Spread ------------------ Again, let's see what this is like, with the income data, split up according to occupation type. > par(mfrow=c(3,1)) > boxplot(split(income,occ.type),style.bxp="old") > title("income") > boxplot(split((income^.25-1)/.25,occ.type),style.bxp="old") > title("(income^.25-1)/.25") > boxplot(split(log(income),occ.type),style.bxp="old") > title("log(income)") The nonconstant spread problem can't be completely fixed with this data, but I think I like the effects of the 4th-root transformation better than the log (or several other powers that I tried). This graph is saved as "income-fixvar.ps" The problem of nonconstant variance or nonconstant spread usually arises when spread increases with mean or median. In nice cases one can eyeball the power transformation one needs by plotting x=log(medians) vs y=log(IQR's), for example. Transforming Proportions ------------------------ Proportions, that are bounded between 0 and 1, and other bounded quantities (e.g. SAT scores are bounded between 0 [or really 200] and 800), do not respond well to simple power transforms. Instead, common transforms to try are - logit, probit and arcsine-sqrt - Tukey's folded-powers transformations, p^q - (1-p)^q (q=0 <==> logit) - when p=0 or 1, try p' = (f+1/2)/(N+1), or an expedient linear xform like p' = 0.005 + 0.99p Let's look at the raw data again: > stem(pct.female) N = 102 Median = 13.6 Quartiles = 3.59, 52.27 Decimal point is 1 place to the right of the colon 0 : zzzzz111111111111122223334444444555667788899 1 : 11112344566777 2 : 0144568 3 : 0134599 4 : 778 5 : 22567 6 : 3889 7 : 1256667 8 : 334 9 : 12366678 We see that we cannot apply most of the standard transformations because there are 0's in the data. So first we "shrink" the range of pct.female a little bit > range(pct.female/100) [1] 0.0000 0.9751 > newpct _ 0.001 + .998*pct.female/100 > range(newpct) [1] 0.0010000 0.9741498 and now we are looking at > stem(log(newpct/(1-newpct))) N = 102 Median = -1.842737 Quartiles = -3.263968, 0.0906805 Decimal point is at the colon -6 : 99999 -5 : 100 -4 : 99776554210 -3 : 9766543321111 -2 : 99987665543211100 -1 : 9887776664321110 -0 : 8877644111 0 : 1122357889 1 : 011222666 2 : 346 3 : 12236 > stem(qnorm(newpct)) N = 102 Median = -1.095139 Quartiles = -1.788742, 0.0568172 Decimal point is at the colon -3 : 11111 -2 : 5555 -2 : 4443333211100 -1 : 99888877776666555 -1 : 44433222221110000 -0 : 99987777655 -0 : 44433110 0 : 011223 0 : 55556777779 1 : 0034 1 : 578889 > stem(asin(sqrt(newpct))) N = 102 Median = 0.3787587 Quartiles = 0.1931046, 0.8080605 Decimal point is 1 place to the left of the colon 0 : 33333888899 1 : 0001123346677899 2 : 001113334577899 3 : 1134455788 4 : 01133368 5 : 1223689 6 : 22388 7 : 567 8 : 11356 9 : 2778 10 : 0246667 11 : 456 12 : 68 13 : 07778 14 : 1 > SCATTERPLOT MATRIX, SPIN AND BRUSH ================================== [ see class demo, I hope ] TO LEARN MORE ============= Try to work through some of the tutorials at http://www.stat.cmu.edu/~brian/707/S-Tutorial and other links to Splus tips and tricks, especially those maintained at Statlib: http://lib.stat.cmu.edu Howard Seltman's homepage: http://www.stat.cmu.edu/~hseltman or look at Krause and Olson, "The Basics of S and Splus", Chapters 3, 4 and 5 Venables and Ripley, "Modern Applied Statistics with Splus", Chapters 1 and 2