A FIRST SESSION WITH SPLUS
==========================

We will demonstrate some exploratory data analysis [EDA] 
techniques, as a way of learning basic manipulations in SPLUS.

The "Prestige" data comes from Fox's (1997) "Applied regression
analysis, linear models, and related methods" textbook.  There is
a "prestige.dat" file (data), the first several lines of which look
like this:

	GOV_ADMINISTRATORS            13.11 12351 11.16 68.8  1113  prof
	GENERAL_MANAGERS              12.26 25879  4.02 69.1  1130  prof
	ACCOUNTANTS                   12.77  9271 15.70 63.4  1171  prof
	PURCHASING_OFFICERS           11.42  8865  9.11 56.8  1175  prof
	CHEMISTS                      14.62  8403 11.68 73.5  2111  prof
	PHYSICISTS                    15.64 11030  5.13 77.6  2113  prof
	BIOLOGISTS                    15.09  8258 25.65 72.6  2133  prof
	ARCHITECTS                    15.44 14163  2.69 78.1  2141  prof
	CIVIL_ENGINEERS               14.52 11377  1.03 73.1  2143  prof

and a "Prestige.cbk" file (codebook), which explains what the
variables are:

	1971 Canadian Occupational Prestige Data
	[1] Occupational title 
	[2] Average education of incumbents, years 
	[3] Average income of incumbents, dollars
	[4] Percent of incumbents who are women
	[5] Pineo-Porter prestige score for occupation 
	[6] Canadian Census occupational code
	[7] Type of occupation
		 prof =  professional and technical
		 wc   =  white collar
		 bc   =  blue collar
		 ?    =  missing (not classified)
	Source: Census of Canada, 1971, Volume 3, Part 6, pp. 19-1--19-
	21; and personal communication from B. Blishen, W. Carroll, and
	C. Moore, Departments of Sociology, York University and
	University of Victoria.

We want to look at the data in various ways in SPLUS.

GETTING STARTED (AND STOPPED)
=============================

To start up Splus in UNIX, we type

% Splus -e

at the UNIX prompt.  The "-e" allows us to edit command lines as we
type them.  (alternatively, you could write commands in a text
editor and cut-and-paste them into the Splus command line).

The two most useful functions for reading in data are 

	read.table()
	scan()

If you wanted to learn more about one of these you would say

	help(read.table)
	help(scan)

at the SPLUS prompt.  In most UNIX versions of SPLUS, if you wanted
the help to appear in a separate window (and also get a menu of help
topics to choose from) you would type

	help.start(gui="motif")

Oh, yes, and while I am thinking about it, the way to get out of SPLUS
is to type

	q()

READING IN THE DATA
===================

First we use read.table() to read the data into a "data frame" and
then we print the first 10 lines to compare with the raw data file.

As you can see below, the S-plus "prompt" for input is the ">"
character.  The prompt for continuation lines (for a two-line command,
say) is the "+" character.  Control-C 

> prestige <- read.table("prestige.dat")
> prestige[1:10,]
                       V2    V3    V4   V5   V6   V7 
 GOV_ADMINISTRATORS 13.11 12351 11.16 68.8 1113 prof
   GENERAL_MANAGERS 12.26 25879  4.02 69.1 1130 prof
        ACCOUNTANTS 12.77  9271 15.70 63.4 1171 prof
PURCHASING_OFFICERS 11.42  8865  9.11 56.8 1175 prof
           CHEMISTS 14.62  8403 11.68 73.5 2111 prof
         PHYSICISTS 15.64 11030  5.13 77.6 2113 prof
         BIOLOGISTS 15.09  8258 25.65 72.6 2133 prof
         ARCHITECTS 15.44 14163  2.69 78.1 2141 prof
    CIVIL_ENGINEERS 14.52 11377  1.03 73.1 2143 prof
   MINING_ENGINEERS 14.64 11023  0.94 68.8 2153 prof

You can think of this as a matrix, whose rows are labelled by the
occupations, and whose columns are labelled with the generic names V2,
V3, ... V7.  

DIGRESSION:

   There are many data types in SPLUS.  The basic ones are:

   numeric    - a single floating-point number
   character  - an alphanumeric string enclosed in quotes
   vector     - a vector of elements of a single type (all numeric, or
                   all character, say) 
   matrix     - a two-dimensional array of elements of a single type
   data.frame - like a matrix, but different columns can have different
                   data types
   list       - a "ragged" data frame

   There are many more types than this, but these are the ones you will
   encounter first and most frequently.

   SPLUS generally knows smart things to do with each data type.  For
   example it knows how to add and multiply vectors element-by-element:
 
   > x _ c(1,2,3)  # Note the use of "_" as a synonym for "<-"
   > y _ c(4,5,6)
   > x+y
   [1] 5 7 9
   > x*y
   [1]  4 10 18

   and it knows dot products and matrix multiplication:

   > X _ matrix(scan(),byrow=T,ncol=3)  # note the use of scan()
                                           # for raw input
   1: 1 0 0                                 
   4: 0 2 0
   7: 0 0 3
   10: 
   > Y _ matrix(sample(1:10,9,replace=T),byrow=T,ncol=3)
   > X
        [,1] [,2] [,3] 
   [1,]    1    0    0
   [2,]    0    2    0
   [3,]    0    0    3
   > Y
        [,1] [,2] [,3] 
   [1,]   10   10    1
   [2,]    8    8    6
   [3,]   10    9    6
   > X %*% Y
        [,1] [,2] [,3] 
   [1,]   10   10    1
   [2,]   16   16   12
   [3,]   30   27   18

   The square bracket notation is how you index elements of vectors,
   matrices and data frames: [2,] is the second row, [,3] is the third
   column, [3,4] is the element in row 3, column 4, etc.

END OF DIGRESSION

Back to the Canadian job prestige data...

The variable names V2, V3, ... aren't very informative; we can do
better.   One way is with the "names" function:

> names(prestige) _ c(
+ "education",
+ "income",
+ "pct.female",
+ "occ.prestige",
+ "occ.code",
+ "occ.type"
+ )
> prestige[1:10,]
                    education income pct.female occ.prestige occ.code occ.type 
 GOV_ADMINISTRATORS     13.11  12351      11.16         68.8     1113     prof
   GENERAL_MANAGERS     12.26  25879       4.02         69.1     1130     prof
        ACCOUNTANTS     12.77   9271      15.70         63.4     1171     prof
PURCHASING_OFFICERS     11.42   8865       9.11         56.8     1175     prof
           CHEMISTS     14.62   8403      11.68         73.5     2111     prof
         PHYSICISTS     15.64  11030       5.13         77.6     2113     prof
         BIOLOGISTS     15.09   8258      25.65         72.6     2133     prof
         ARCHITECTS     15.44  14163       2.69         78.1     2141     prof
    CIVIL_ENGINEERS     14.52  11377       1.03         73.1     2143     prof
   MINING_ENGINEERS     14.64  11023       0.94         68.8     2153     prof

Why is the "prestige" data frame listing names instead of numbers for
the rows and the columns?  Because there are names to use.  

We could assign names to the rows and columns of a matrix as well:

> dimnames(X) _ list(c("first row","second row","third row"),c("fred",
+ "barney","wilma"))
> X
           fred barney wilma 
 first row    1      0     0
second row    0      2     0
 third row    0      0     3

What does the "dimnames" structure look like?  For a matrix with no
dimnames it is "NULL".  FOr a matrix with dimnames, like X, it is a
list consisting of two character vectors.  The first vector contains
the row names, the second vector contains the column names.

> dimnames(Y)
NULL
> dimnames(X)
[[1]]:
[1] "first row"  "second row" "third row" 

[[2]]:
[1] "fred"   "barney" "wilma" 

SOME SIMPLE SUMMARIES
=====================

Well that was exciting, and a bit more "under the hood" than I had
thought I would do at first.

Let's do some summaries of the data.  First, we'll get summaries of
all of the columns of the data frame.

> dim(prestige)
[1] 102   6
> for (i in 1:6) print(summary(prestige[,i]))
 Min. 1st Qu. Median  Mean 3rd Qu.  Max. 
 6.38   8.445  10.54 10.74   12.65 15.97
 Min. 1st Qu. Median Mean 3rd Qu.  Max. 
  611    4106   5930 6798    8187 25880
 Min. 1st Qu. Median  Mean 3rd Qu.  Max. 
    0   3.592   13.6 28.98    52.2 97.51
 Min. 1st Qu. Median  Mean 3rd Qu. Max. 
 14.8   35.22   43.6 46.83   59.28 87.2
 Min. 1st Qu. Median Mean 3rd Qu. Max. 
 1113    3120   5135 5402    8312 9517
 ? bc prof wc 
 4 45   30 23

That wasn't very informative.  Let's try again.

> prestige[1,]
                   education income pct.female occ.prestige occ.code occ.type 
GOV_ADMINISTRATORS     13.11  12351      11.16         68.8     1113     prof
> summary(prestige$education)
 Min. 1st Qu. Median  Mean 3rd Qu.  Max. 
 6.38   8.445  10.54 10.74   12.65 15.97
> summary(prestige$income)
 Min. 1st Qu. Median Mean 3rd Qu.  Max. 
  611    4106   5930 6798    8187 25880
> summary(prestige$pct.female)
 Min. 1st Qu. Median  Mean 3rd Qu.  Max. 
    0   3.592   13.6 28.98    52.2 97.51
> summary(prestige$occ.prestige)
 Min. 1st Qu. Median  Mean 3rd Qu. Max. 
 14.8   35.22   43.6 46.83   59.28 87.2
> summary(prestige$occ.code)
 Min. 1st Qu. Median Mean 3rd Qu. Max. 
 1113    3120   5135 5402    8312 9517
> summary(prestige$occ.type)
 ? bc prof wc 
 4 45   30 23

That was better.  Note the use of the "$" notation as an alternative
way of specifying the columns, or variables, in a data frame.

Also note that the summary for occ.type is different from the others.
Why?  What did we get instead?

Wierdly, standard deviation is not among the summaries that SPLUS
gives for numerical data.  We have to do that by hand.  For example

> sqrt(var(prestige$income))
[1] 4245.922

If we wanted them all (or all that make sense) we could say

> sqrt(diag(var(prestige[,1:5])))
[1]    2.728444 4245.922227   31.724931   17.204486 2644.993215

what was that???

A better set of summaries might be stem-and-leaf plots.

Before producing them, I will show you a trick that saves typing.  If
you "attach" a data frame, like this:
 
> attach(prestige)

the columns become individual variables (vectors) that you can access
directly.  When you are done, "detach()":

> detach()

So...

> attach(prestige)
> stem(education)

N = 102   Median = 10.54
Quartiles = 8.43, 12.71

Decimal point is at the colon

    6 : 4677789
    7 : 1345566667899
    8 : 12344455568889
    9 : 0224556899
   10 : 0011135669
   11 : 0001111234444566
   12 : 133345788
   13 : 1668
   14 : 124455667
   15 : 11224689
   16 : 00

> stem(income)

N = 102   Median = 5930.5
Quartiles = 4075, 8206

Decimal point is 3 places to the right of the colon

    0 : 69
    1 : 79
    2 : 44689
    3 : 001125556667999
    4 : 01223345677788
    5 : 1111234566889
    6 : 012335566799
    7 : 01145679
    8 : 0000123344889999
    9 : 36
   10 : 4
   11 : 004
   12 : 45
   13 :
   14 : 026

High:  17498  19263  25308  25879

> stem(pct.female)

N = 102   Median = 13.6
Quartiles = 3.59, 52.27

Decimal point is 1 place to the right of the colon

   0 : zzzzz111111111111122223334444444555667788899
   1 : 11112344566777
   2 : 0144568
   3 : 0134599
   4 : 778
   5 : 22567
   6 : 3889
   7 : 1256667
   8 : 334
   9 : 12366678

> stem(occ.code)

N = 102   Median = 5135
Quartiles = 3117, 8313

Decimal point is 3 places to the right of the colon

   1 : 1122
   2 : 1111122223333445777
   3 : 11111122334
   4 : 111112222222222
   5 : 11111222
   6 : 111111222
   7 : 127
   8 : 22223333335555666777888888
   9 : 1122355

> stem(occ.type)
Error in Summary.factor(x): A factor is not a numeric object
Dumped


What went wrong with the last stem and leaf plot?


Note that there are several high income jobs that fell off the stem
and leaf plot.  A cheap way to figure out what they are is as follows:

> dimnames(prestige)[[1]][order(-income)][1:4]
[1] "GENERAL_MANAGERS"         "PHYSICIANS"               "LAWYERS"                 
[4] "OSTEOPATHS_CHIROPRACTORS"

Maybe we will decode this bit of magic in class.

> detach()

A stem and leaf plot is a special case of a histogram.  This is a
high-resolution plot in SPLUS, so we also have to set up a graphics
window.

> motif()  # or win.graph() or graphsheet() on a PC

Occupational code is probably not interesting as a quantitative
variable, so we will concentrate on the other four: "education"
"income" "pct.female" "occ.prestige".

To get all four histograms on one graph we split the graph into pieces

> par(mfrow=c(2,2))
> for (i in c("education","income","pct.female","occ.prestige")) {
+ hist(prestige[,i])
+ title(i)
+ }

The resulting graph can be printed using a pulldown menu on the
graphics window.  The graph can also be saved as a postscript (or
encapsulated postscript) document, which is convenient for printing
later or including in a latex document.

This graph is saved as "prestige-hists.ps"

You can bypass the graphics window and "print directly to postscript"
as follows:

> postscript("mygraph.ps")
> par(mfrow=c(2,2))
> for (i in c("education","income","pct.female","occ.prestige")) {
+ hist(prestige[,i])
+ title(i)
+ }

> dev.off()

Now the graph "mygraph.ps" is complete, and can be printed to one of
our printers in the dept, say, with the command "lpr mygraph.ps".

The histogram estimates the density of a random variable by
summarizing what range of values occured frequently or infrequently.
A smoother estimate of the density is the so-called "kernel
density estimate" (kde).  

The basic idea of a kde is to smear each data point into a little
probability density function (called the "kernel" or the "window"),
and then average these little density functions to produce an overall,
smooth density estimate.  The wider the individual "smears", the
smoother the overall estimate (the width of the kernel is usually
measured by something like the SD, which is called the "bandwidth" or
the "window width" in this setting).  Inevitably there is a tradeoff
to be made between having a smooth estimate and obscuring the features
of the data.  There are some automatic rules for balancing these two
goals, but a generally good approach is to make several different
kde's of the same data and let your eye be the judge.

There are two functions in SPLUS that are convenient for making kde's:

ksmooth()
density()

You should explore both.  I find "density" a little easier to work
with.

In the next graph, I produce two sets of density estimates, for the
four quantitative variables in the canadian job prestige data.

The "default" smoothing parameters for "density()" are used in the top
row of plots; and somewhat smoother (bandwidths about twice as wide as
the default) are used in the bottom row of plots.

> par(mfrow=c(2,4))
> for (i in names(prestige)[1:4]) {
+ plot(density(prestige[,i]),type="l")
+ title(i)
+ }
> for (i in names(prestige)[1:4]) {
+ plot(density(prestige[,i],width=diff(range(prestige[,i]))/4),type="l")
+ title(i)
+ }

This graph is saved as "prestige-kdes.ps".

To finish off our initial summaries, let's look at some boxplots.  You
may remember that the basic idea of a boxplot is something like this:

                 +---+-----+
   * *   *  -----|   |     |-------------  *                     o
                 +---+-----+

   OUTLIERS     Q1   MED   Q3  WHISKER     OUTLIER         EXTREME OUTLIER

The simplest way to make a boxplot by hand is to 

* sort the data using a stem and leaf plot;
* compute the first and third quartiles and the median and plot the box
* compute the value IQR = Q3 - Q1   (this is the width of the box)
* compute the 
    - inner fences at Q1 - (1.5)*IQR and Q3 + (1.5)*IQR
    - outer fences at Q1 - (3.0)*IQR and Q3 + (3.0)*IQR
* plot the whiskers out to the lowest and highest values inside the
  inner fences
* decorate the outliers according to whether they fall inside or
  outside each set of fences

Tukey, who invented boxplots (and stem and leaf diagrams, and many
other useful EDA plots) has rather folksy names for the versions of
quartiles he likes to calculate; he calls Q1 and Q3 the lower and
upper hinges, and the IQR the hinge spread.

We will make two sets of boxplots for the prestige data.

The first set of boxplots are for each quantitative variable
separately, to try to learn more about the skewness of these
variables.

> par(mfrow=c(2,4))
> for (i in names(prestige)[1:4]) {
+ boxplot(prestige[,i])
+ title(i)
+ }
> for (i in names(prestige)[1:4]) {
+ boxplot(prestige[,i],style.bxp="old")
+ title(i)
+ }
> 

(I have made the boxplots in two styles.  I find the second style
easier to look at).

This graph is saved as "prestige-boxes1.ps"

The second set of boxplots is aimed at looking at the relationship
between occupation type ("occ.type") and income.

> par(mfrow=c(1,1))
> boxplot(split(prestige$income,prestige$occ.type),style.bxp="old")

We can find out what the outliers were using a trick like before:

> attach(prestige)
> dimnames(prestige)[[1]][occ.type=="bc"][order(-income[occ.type=="bc"])][1:3]
[1] "PILOTS"       "FIREFIGHTERS" "POLICEMEN"   
> dimnames(prestige)[[1]][occ.type=="prof"][order(-income[occ.type=="prof"])][1:3]
[1] "GENERAL_MANAGERS" "PHYSICIANS"       "LAWYERS"         
> detach()

This graph is saved as "prestige-boxes2.ps".

TRANSFORMATIONS
===============

There are many reasons to transform data.

Power transformations  x^p or (x^p-1)/p (p=0 <==> log) are often used:

  - to reduce skewness in X and/or Y
  - to reduce nonlinearity (in Y)
  - to correct nonconstant spread

The "prestige" data gives examples of each.

Skewness:
--------

To assess skewness and other tail problems it is often useful to think
about qq plots, also known as normal quantile plots.  The idea is to
plot the quantiles of the observed data against the corresponding
quantiles of a standard normal distribution.  Deviations from a
stright line may indicate skewness and other departures from
normality.  The help for qqplot says, 

  "QQplots are used to assess whether data have a particular
  distribution, or whether two datasets have the same distribution.
  If the distributions are the same, then the plot will be
  approximately a straight line.  The extreme points have more
  variability than points toward the center. A plot with a "U" shape
  means that one distribution is skewed relative to the other.  An "S"
  shape implies that one distribution has longer tails than the other.
  In the default configuration a plot from qqnorm that is bent down on
  the left and bent up on the right means that the data have longer
  tails than the Gaussian."

It takes some practice to get good at interpreting QQ plots.  It is a
good idea to include confidence bands in the plot, to help your eye
see the outliers; but this is not done automatically by most software.
Another approach is to simply look at lots of qq plots of normal data,
of the same sample size, to get a feel for the amount of
"insignificant" variability in one's data.

Consider the income data again.  Before looking at it we will look at
some randomly generated normal data to get an idea how much
variability there is in the qqplots.

> par(mfrow=c(3,2))
> for (i in 1:6) {
+ zz _ qqnorm(rnorm(102))
+ abline(lmsreg(zz$x,zz$y))
+ }
> mtext("Some qqplots of Normal data", outer=T,cex=1.5)


This graph is saved as "qqplots-rnorms.ps"


> attach(prestige)
> par(mfrow=c(3,2))

> plot(density(income),type="l")
> points(income,rep(0,102))
> zz _ qqnorm(income)
> abline(lmsreg(zz$x,zz$y))

> plot(density(sqrt(income)),type="l")
> points(sqrt(income),rep(0,102))
> zz _ qqnorm(sqrt(income))
> abline(lmsreg(zz$x,zz$y))

> plot(density(log(income)),type="l")
> points(log(income),rep(0,102))
> zz _ qqnorm(log(income))
> abline(lmsreg(zz$x,zz$y))

I have added plots of the original data so you can see hew the data
are moving as the transformation changes.  After trying several other
power transforms, I like the log transform best.

This graph is saved as "income-xforms.ps"

Improving Linearity
-------------------

I'll talk about improving linearity with transformations after we've
discussed regression a little bit.  

Nonconstant Spread
------------------

Again, let's see what this is like, with the income data, split up
according to occupation type.

> par(mfrow=c(3,1))
> boxplot(split(income,occ.type),style.bxp="old")
> title("income")
> boxplot(split((income^.25-1)/.25,occ.type),style.bxp="old")
> title("(income^.25-1)/.25")
> boxplot(split(log(income),occ.type),style.bxp="old")
> title("log(income)")

The nonconstant spread problem can't be completely fixed with this
data, but I think I like the effects of the 4th-root transformation
better than the log (or several other powers that I tried).

This graph is saved as "income-fixvar.ps"

The problem of nonconstant variance or nonconstant spread usually
arises when spread increases with mean or median.  In nice cases one
can eyeball the power transformation one needs by plotting
x=log(medians) vs y=log(IQR's), for example.

Transforming Proportions
------------------------

Proportions, that are bounded between 0 and 1, and other bounded
quantities (e.g. SAT scores are bounded between 0 [or really 200] and
800), do not respond well to simple power transforms.  Instead, common
transforms to try are

  - logit, probit and arcsine-sqrt
  - Tukey's folded-powers transformations, p^q - (1-p)^q (q=0 <==> logit)
  - when p=0 or 1, try p' = (f+1/2)/(N+1), or an expedient linear
    xform like p' = 0.005 + 0.99p

Let's look at the raw data again:

> stem(pct.female)

N = 102   Median = 13.6
Quartiles = 3.59, 52.27

Decimal point is 1 place to the right of the colon

   0 : zzzzz111111111111122223334444444555667788899
   1 : 11112344566777
   2 : 0144568
   3 : 0134599
   4 : 778
   5 : 22567
   6 : 3889
   7 : 1256667
   8 : 334
   9 : 12366678

We see that we cannot apply most of the standard transformations
because there are 0's in the data.  So first we "shrink" the range of
pct.female a little bit

> range(pct.female/100)
[1] 0.0000 0.9751
> newpct _ 0.001 + .998*pct.female/100
> range(newpct)
[1] 0.0010000 0.9741498

and now we are looking at

> stem(log(newpct/(1-newpct)))

N = 102   Median = -1.842737
Quartiles = -3.263968, 0.0906805

Decimal point is at the colon

  -6 : 99999
  -5 : 100
  -4 : 99776554210
  -3 : 9766543321111
  -2 : 99987665543211100
  -1 : 9887776664321110
  -0 : 8877644111
   0 : 1122357889
   1 : 011222666
   2 : 346
   3 : 12236

> stem(qnorm(newpct))

N = 102   Median = -1.095139
Quartiles = -1.788742, 0.0568172

Decimal point is at the colon

  -3 : 11111
  -2 : 5555
  -2 : 4443333211100
  -1 : 99888877776666555
  -1 : 44433222221110000
  -0 : 99987777655
  -0 : 44433110
   0 : 011223
   0 : 55556777779
   1 : 0034
   1 : 578889

> stem(asin(sqrt(newpct)))

N = 102   Median = 0.3787587
Quartiles = 0.1931046, 0.8080605

Decimal point is 1 place to the left of the colon

    0 : 33333888899
    1 : 0001123346677899
    2 : 001113334577899
    3 : 1134455788
    4 : 01133368
    5 : 1223689
    6 : 22388
    7 : 567
    8 : 11356
    9 : 2778
   10 : 0246667
   11 : 456
   12 : 68
   13 : 07778
   14 : 1

>  


SCATTERPLOT MATRIX, SPIN AND BRUSH
==================================

[ see class demo, I hope ]

TO LEARN MORE
=============

Try to work through some of the tutorials at

   http://www.stat.cmu.edu/~brian/707/S-Tutorial

and other links to Splus tips and tricks, especially those maintained at

   Statlib: http://lib.stat.cmu.edu

   Howard Seltman's homepage: http://www.stat.cmu.edu/~hseltman  

or look at

   Krause and Olson, "The Basics of S and Splus", 
       Chapters 3, 4 and 5

   Venables and Ripley, "Modern Applied Statistics with Splus",
       Chapters 1 and 2