Next: Removing Outliers Up: Statistical Modeling in S-PLUS Previous: Updating Models

Factors in Models

When a factor is used as a predictor in a model, S-PLUS calculates parameter estimates for each level of the factor by using Helmert contrasts. Helmert contrasts are a real nuisance when it comes to interpreting the coefficients, so use treatment contrasts instead. Type:

> options(contrasts=c("contr.treatment","contr.poly"))

This means use treatment contrasts for regular factors, and the default polynomial contrasts for ordered factors. Since you probably don't want to type this every time you run S-PLUS, put this line in your .First function. The .First function is executed every time S-PLUS begins.

> .First <- function()
+ {
+ options(contrasts=c("contr.treatment","contr.poly"))
+ }

When treatment contrasts of a factor are used in a model, the first level of that factor (alphabetically) is taken as the baseline. The coefficients of the other levels represent the difference between their effect and the baseline effect.

Suppose we had a vector ``Animal'' in three levels: Birds, Cats, and Dogs, and we were using this vector in a model. We are interested in the difference between the three types of animals. S-PLUS would create two dummy vectors, ``AnimalCats'' and ``AnimalDogs''. For birds, the entry in both dummy vectors would be 0. For cats, AnimalCats would be 1 and AnimalDogs would be 0. Dogs would be coded by AnimalCats = 0 and AnimalDogs = 1. The two dummy vectors AnimalCats and AnimalDogs would then be used in the model. The resulting coefficient of AnimalCats would represent the difference between birds and cats, after accounting for other variables, and the coefficient of AnimalDogs would represent the difference between birds and dogs, after accounting for other variables.

Let's try using treatment contrasts in fitting terms for location to the education model. There are two factors we could use: Region and Locale. We can't use both, because each unit of Region consists of several units of Locale. If we include Locale in the model, Region would provide no useful information (S-PLUS would give a ``computed fit is singular'' warning about it). Perhaps we should use Locale, since it is more specific.

> summary(lm(SE70 ~ PI68 + Y69 + Locale))

Call: lm(formula = SE70 ~ PI68 + Y69 + Locale)
Residuals:
    Min     1Q Median    3Q   Max
 -50.97 -14.97 -3.194 15.71 54.13

Coefficients:
                   Value Std. Error   t value  Pr(>|t|)
   (Intercept) -234.4908   79.5045    -2.9494    0.0053
          PI68    0.0449    0.0092     4.8930    0.0000
           Y69    0.7608    0.1944     3.9142    0.0003
  LocaleESTCNT  -17.8049   20.7650    -0.8575    0.3963
  LocaleMIDATL   37.2958   20.2641     1.8405    0.0731
LocaleMOUNTAIN   17.1916   16.4521     1.0450    0.3023
  LocaleNEWENG    7.2652   16.4033     0.4429    0.6602
 LocalePACIFIC   47.4117   16.9029     2.8049    0.0077
  LocaleSTHATL   11.9626   15.2510     0.7844    0.4374
  LocaleWNRCNT   21.7977   16.0552     1.3577    0.1822
  LocaleWSTCNT  -15.0130   19.4544    -0.7717    0.4448

Residual standard error: 26.44 on 40 degrees of freedom
Multiple R-Squared: 0.7408
F-statistic: 11.43 on 10 and 40 degrees of freedom, the p-value is 6.413e-09

(S-PLUS also produces a huge matrix of correlations, which has been omitted to save space).

Locale has nine levels, the first of which, ENRCNT (East North-Central), is taken as the baseline. The coefficients of the others represent their difference from the East North-Central. From the p-values, only the Pacific locale has significantly different spending than expected using income and school-aged population per capita. Specifically, the mean Pacific spending on education is 47.4117 higher than in the East North-Central.

Next: Removing Outliers Up: Statistical Modeling in S-PLUS Previous: Updating Models

Brian Junker 2002-08-26