When a factor is used as a predictor in a model, S-PLUS calculates parameter estimates for each level of the factor by using Helmert contrasts. Helmert contrasts are a real nuisance when it comes to interpreting the coefficients, so use treatment contrasts instead. Type:
> options(contrasts=c("contr.treatment","contr.poly"))This means use treatment contrasts for regular factors, and the default polynomial contrasts for ordered factors. Since you probably don't want to type this every time you run S-PLUS, put this line in your
.First
function. The .First
function is executed every time S-PLUS begins.
> .First <- function() + { + options(contrasts=c("contr.treatment","contr.poly")) + }When treatment contrasts of a factor are used in a model, the first level of that factor (alphabetically) is taken as the baseline. The coefficients of the other levels represent the difference between their effect and the baseline effect.
Suppose we had a vector ``Animal'' in three levels: Birds, Cats, and Dogs, and we were using this vector in a model. We are interested in the difference between the three types of animals. S-PLUS would create two dummy vectors, ``AnimalCats'' and ``AnimalDogs''. For birds, the entry in both dummy vectors would be 0. For cats, AnimalCats would be 1 and AnimalDogs would be 0. Dogs would be coded by AnimalCats = 0 and AnimalDogs = 1. The two dummy vectors AnimalCats and AnimalDogs would then be used in the model. The resulting coefficient of AnimalCats would represent the difference between birds and cats, after accounting for other variables, and the coefficient of AnimalDogs would represent the difference between birds and dogs, after accounting for other variables.
Let's try using treatment contrasts in fitting terms for location to the
education model. There are two factors we could use: Region
and
Locale
. We can't use both, because each unit of Region
consists
of several units of Locale
. If we include Locale
in the model, Region
would provide no useful information
(S-PLUS would give a ``computed fit is singular'' warning about it).
Perhaps we should use Locale
, since it is more specific.
> summary(lm(SE70 ~ PI68 + Y69 + Locale)) Call: lm(formula = SE70 ~ PI68 + Y69 + Locale) Residuals: Min 1Q Median 3Q Max -50.97 -14.97 -3.194 15.71 54.13 Coefficients: Value Std. Error t value Pr(>|t|) (Intercept) -234.4908 79.5045 -2.9494 0.0053 PI68 0.0449 0.0092 4.8930 0.0000 Y69 0.7608 0.1944 3.9142 0.0003 LocaleESTCNT -17.8049 20.7650 -0.8575 0.3963 LocaleMIDATL 37.2958 20.2641 1.8405 0.0731 LocaleMOUNTAIN 17.1916 16.4521 1.0450 0.3023 LocaleNEWENG 7.2652 16.4033 0.4429 0.6602 LocalePACIFIC 47.4117 16.9029 2.8049 0.0077 LocaleSTHATL 11.9626 15.2510 0.7844 0.4374 LocaleWNRCNT 21.7977 16.0552 1.3577 0.1822 LocaleWSTCNT -15.0130 19.4544 -0.7717 0.4448 Residual standard error: 26.44 on 40 degrees of freedom Multiple R-Squared: 0.7408 F-statistic: 11.43 on 10 and 40 degrees of freedom, the p-value is 6.413e-09(S-PLUS also produces a huge matrix of correlations, which has been omitted to save space).
Locale has nine levels, the first of which, ENRCNT (East North-Central), is taken as the baseline. The coefficients of the others represent their difference from the East North-Central. From the p-values, only the Pacific locale has significantly different spending than expected using income and school-aged population per capita. Specifically, the mean Pacific spending on education is 47.4117 higher than in the East North-Central.