Read the data in, get a general sense of the variables, and make a “pairs” plot (scatterplot matrix) of the numerical variables. Note that “Price” is the response variable.
nyc <- read.csv("nyc.csv")
str(nyc)
## 'data.frame': 168 obs. of 7 variables:
## $ Case : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Restaurant: chr "Daniella Ristorante" "Tello's Ristorante" "Biricchino" "Bottino" ...
## $ Price : int 43 32 34 41 54 52 34 34 39 44 ...
## $ Food : int 22 20 21 20 24 22 22 20 22 21 ...
## $ Decor : int 18 19 13 20 19 22 16 18 19 17 ...
## $ Service : int 20 19 18 17 21 21 21 21 22 19 ...
## $ East : int 0 0 0 0 0 0 0 1 1 1 ...
summary(nyc)
## Case Restaurant Price Food
## Min. : 1.00 Length:168 Min. :19.0 Min. :16.0
## 1st Qu.: 42.75 Class :character 1st Qu.:36.0 1st Qu.:19.0
## Median : 84.50 Mode :character Median :43.0 Median :20.5
## Mean : 84.50 Mean :42.7 Mean :20.6
## 3rd Qu.:126.25 3rd Qu.:50.0 3rd Qu.:22.0
## Max. :168.00 Max. :65.0 Max. :25.0
## Decor Service East
## Min. : 6.00 Min. :14.0 Min. :0.000
## 1st Qu.:16.00 1st Qu.:18.0 1st Qu.:0.000
## Median :18.00 Median :20.0 Median :1.000
## Mean :17.69 Mean :19.4 Mean :0.631
## 3rd Qu.:19.00 3rd Qu.:21.0 3rd Qu.:1.000
## Max. :25.00 Max. :24.0 Max. :1.000
pairs(nyc[,-c(1:2,7)])
We can get a more refined look at the variables with a scatterplot matrix that also includes histograms for each variable. If the histograms revealed especially long tails or wierd outliers, we might want to transform the data, recode or delete outliers, etc.
library(psych)
## Warning: package 'psych' was built under R version 4.0.2
## you would need to install the "psych" package
## one time before using this library() command...
pairs.panels(nyc[,-c(1:2,7)],
method = "pearson", ## correlation method
hist.col = "#00AFBB", ## a pretty color for histogram bars...
density = TRUE, ## show density plots
ellipses = TRUE ## show correlation ellipses
)
(There are lots of other packages, including ggplot, that can produce similar plots. This is really just a convenient illustration.)
The histograms don’t suggest any special processing (transformations, etc.) will be needed for the variables, so we can proceed. Note that all the variables seem fairly highly correlated with one another, which makes sense, but also can affect regression results, as we’ll learn later in the semester.
One of the main questions for this study is whether restaurants shoudl locate east or west of Fifth Avenue. A pair of boxplots give us a first look at this question:
with(nyc,boxplot(Price ~ East, xlab="East (1 = East of Fifth Avenue)"))
To find the two restaurants with modest Service ratings and maximal dinner Prices…
nyc[nyc$Price==max(nyc$Price),]
To find the restaurant with Service = 15….
nyc[nyc$Service==15,]
Here’s a very light regression analysis to see how the variables work with one another.
summary(lm.0 <- lm(Price ~ . , data=nyc[,-c(1,2)]))
##
## Call:
## lm(formula = Price ~ ., data = nyc[, -c(1, 2)])
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.0465 -3.8837 0.0373 3.3942 17.7491
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -24.023800 4.708359 -5.102 9.24e-07 ***
## Food 1.538120 0.368951 4.169 4.96e-05 ***
## Decor 1.910087 0.217005 8.802 1.87e-15 ***
## Service -0.002727 0.396232 -0.007 0.9945
## East 2.068050 0.946739 2.184 0.0304 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.738 on 163 degrees of freedom
## Multiple R-squared: 0.6279, Adjusted R-squared: 0.6187
## F-statistic: 68.76 on 4 and 163 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(lm.0)
This model, which just has main effects for each of the quantitative predictor variables, suggests some interesting effects on menu prices, and the residual (casewise) diagnostic plots don’t show any dramatic misfit, outliers, influential observations, etc.
From the table of coefficients, it looks like Food and Decor matter a lot for Price, but Service does not. This may be because Service is highly correlated with Food (and with Decor for that matter…).
There is also an effect for being East of Fifth Avenue; this is different from the result we got using only boxplots, because boxplots compare the whole distribution (and so differences have to be true across the distribution of prices) whereas regression analysis basically just looks at means, adjusted for the other variables in the model. Generally when you concentrate inference on the means, you get more dramatic results (because, roughly speaking, \(SE_{mean} = SD_{population}/\sqrt{sample~size}\)).
If you are a policy maker (say, you have a lot of money and you are going to open several restaurants), you may care more about the fact that the mean price can be higher East of Fifth Avenue, since on average you can charge a bit more in your restaurants.
On the other hand if you are considering opening just one restaurant, the story of the boxplots may be more important: the price distributions for East vs West restaurants greatly overlap, there’s little reason to make a location East of Fifth Avenue a primary concern.
Just for fun, we’ll also try the model that has all main effects and two-way interactions, and we’ll compare the two models with likelihood ratio test.
summary(lm.1 <- lm(Price ~ .^2 , data=nyc[,-c(1,2)]))
##
## Call:
## lm(formula = Price ~ .^2, data = nyc[, -c(1, 2)])
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.7758 -3.5519 0.3466 3.3383 17.2584
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -39.32976 42.34684 -0.929 0.35444
## Food 2.61252 2.34496 1.114 0.26694
## Decor 7.26725 2.43591 2.983 0.00331 **
## Service -4.68620 3.27542 -1.431 0.15450
## East 6.69634 10.55070 0.635 0.52656
## Food:Decor -0.35758 0.13716 -2.607 0.01001 *
## Food:Service 0.20733 0.15317 1.354 0.17782
## Food:East 1.87559 0.89562 2.094 0.03785 *
## Decor:Service 0.10665 0.09193 1.160 0.24777
## Decor:East -0.34309 0.46090 -0.744 0.45775
## Service:East -1.90937 0.87262 -2.188 0.03014 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.645 on 157 degrees of freedom
## Multiple R-squared: 0.6531, Adjusted R-squared: 0.631
## F-statistic: 29.55 on 10 and 157 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(lm.1)
This model is intersting in that several interactions seem to have coefficients significantly different from zero, and some of the main effects no longer do. The residual plots do not look much better (or worse) than the plots for the main-effects-only model.
As a rule, unless you have a VERY VERY VERY VERY good reason for doing otherwise, when you want to keep an interaction in a model you should also keep the main effects. Thus, if we wanted to keep the Service:East interaction, we should also keep the main variables Service and East, even though neither main effect is significantly different from zero.
However, in this analysis we do not need to worry so much about that, since the likelihood ratio test does not strongly favor the model with interactions; it appears we can “get away” with just the main effects models.
anova(lm.0,lm.1,test="LRT")
So, we can just stick with the simpler model, lm.0.
low.premium <- data.frame(Case=1000,Restaurant="The Ritz!",Price=NA,
Food=25,Decor=25,Service=25,East=0)
predict(lm.0,low.premium,interval="prediction")
## fit lwr upr
## 1 62.11319 50.35648 73.8699
high.premium <- data.frame(Case=1000,Restaurant="The Ritz!",Price=NA,
Food=30,Decor=30,Service=30,East=1)
predict(lm.0,high.premium,interval="prediction")
## fit lwr upr
## 1 81.40864 69.07858 93.73869
round(summary(lm.0)$coefficients,3)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -24.024 4.708 -5.102 0.000
## Food 1.538 0.369 4.169 0.000
## Decor 1.910 0.217 8.802 0.000
## Service -0.003 0.396 -0.007 0.995
## East 2.068 0.947 2.184 0.030