Data on businesses are often difficult to collect and can provide crucial insight into machine learning, data mining, and revenue management. The dataset that we are examining in this report is the combination of two real data sets on hotel demand. The two datasets contained the same variables and were collected from a resort hotel (labeled as H1 in our dataset) and a city hotel (labeled as H2 in our dataset). Overall, there are 32 variables and 119,390 observations, with each observation representing a single hotel booking. (Exact variables definitions can be found at: https://github.com/rfordatascience/tidytuesday/tree/master/data/2020/2020-02-11) In this report, our overarching goal is to discover the factors that influence hotel booking cancellations.
Through this report, we aim to answer three main questions:
is_repeated_guest
and previous_cancellations
actually have a larger effect on booking cancellations?When first examining the hotels
dataset, one will notice that exact times of hotel booking cancelations are provided in the form of variables such as arrival_date_year
, arrival_date_month
, etc. We decided to look into these variables a bit further to discover if there are seasonal trends to booking cancellations. This information can be useful for consumers to determine appropriate times to book hotel rooms as well as for businesses to learn how to accommodate for cancellations during certain times of the year.
To do this, we will first create some simple exploratory graphs to understand the data better.
In this stacked bar graph, we have utilized the arrival year, canceled variable, and the total number of bookings. We see that over the years 2015-2017, there has been a increase in hotel bookings and the majority within these years have not been canceled. The year 2016 can be seen as having the most canceled reservations despite these numbers soon dropping in the year to come. From this plot, we can infer there is no relationship between the arrival_date_year
variable and is_canceled
variable.
In this graph, we are now using a faceted bar graph to compare the relationships between the arrival month variable and the is canceled variable. As seen in the graphs, it can be said that there continues to be more non-canceled reservations than those canceled each month. Both graphs resemble a unimodal graph with a peak being in August and January being the least for both graphs. With the clarity portrayed we too can see that there are no specific trends being emphasized between the two variables.
Now, we will test to see if the time of the year and room cancellation are independent from each other.
##
## Pearson's Chi-squared test
##
## data: table_by_month_cancellation
## X-squared = 588.69, df = 11, p-value < 2.2e-16
##
## Pearson's Chi-squared test
##
## data: table_by_year_cancellation
## X-squared = 81.626, df = 2, p-value < 2.2e-16
The first test is a chi-squared test for arrival_date_month
and is_canceled
while the second test is a chi-squared test for arrival_date_year
and is_canceled
. In order to test the relationship between the variables is canceled and both arrival year and arrival month, we have relied on a chi-squared test. Given the output of both tests having p-values of less than 2.2e-16 and less than the significance level of 0.05, we can reject our null hypothesis that time of year and month with respect to cancellation of room booking are independent. This means we can assume that time of year and month and cancellation of room booking are dependent.
We can specifically look for which month and years hotel cancellations were statistically significant with absolute pearson residuals greater than 2 from the mosaic plots. For example, months July, August, and October are not statistically significant. Also, while months April, May, June, and September have positive association with canceled hotels and negative associated with not-canceled hotels, the otherwise is true for months between November to March. One can infer from the mosaic plot that we tend to have more cancellation during colder months (November ~ March), and we tend to have less cancellations during warmer months (May ~ Sep).
Next, after looking specifically at how the time of the year can affect booking cancellations, we were curious about what other variables may also affect booking cancellations. To do this, we created a logistic regression model with is_canceled
as the response variable. The independent variables included were selected based on general intuition of factors that may affect booking cancellation (including hotel
, previous_booking_not_canceled
, reserved_room_type
, lead_time
, etc.).
##
## Call:
## glm(formula = is_canceled ~ hotel + lead_time + is_repeated_guest +
## previous_cancellations + previous_bookings_not_canceled +
## reserved_room_type + booking_changes + days_in_waiting_list +
## adr + arrival_date_month, family = "binomial", data = hotels)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -8.4904 -0.8923 -0.6364 1.1088 6.6776
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.342e+00 3.047e-02 -44.028 < 2e-16 ***
## hotelResort Hotel -3.411e-01 1.594e-02 -21.395 < 2e-16 ***
## lead_time 5.803e-03 7.288e-05 79.628 < 2e-16 ***
## is_repeated_guest -1.042e+00 7.995e-02 -13.033 < 2e-16 ***
## previous_cancellations 2.977e+00 5.373e-02 55.404 < 2e-16 ***
## previous_bookings_not_canceled -6.454e-01 2.662e-02 -24.241 < 2e-16 ***
## reserved_room_typeB -2.006e-01 7.027e-02 -2.854 0.004317 **
## reserved_room_typeC 7.384e-02 7.706e-02 0.958 0.337933
## reserved_room_typeD -2.401e-01 1.897e-02 -12.660 < 2e-16 ***
## reserved_room_typeE -2.507e-01 3.186e-02 -7.869 3.58e-15 ***
## reserved_room_typeF -4.761e-01 4.628e-02 -10.288 < 2e-16 ***
## reserved_room_typeG -1.307e-01 5.335e-02 -2.451 0.014258 *
## reserved_room_typeH 1.122e-01 9.253e-02 1.213 0.225099
## reserved_room_typeL 4.878e-01 8.795e-01 0.555 0.579166
## reserved_room_typeP 1.215e+01 3.414e+01 0.356 0.722069
## booking_changes -6.972e-01 1.551e-02 -44.968 < 2e-16 ***
## days_in_waiting_list -8.551e-04 3.622e-04 -2.361 0.018223 *
## adr 6.916e-03 1.923e-04 35.960 < 2e-16 ***
## arrival_date_monthAugust -6.303e-01 2.966e-02 -21.252 < 2e-16 ***
## arrival_date_monthDecember -1.408e-01 3.576e-02 -3.937 8.26e-05 ***
## arrival_date_monthFebruary 3.528e-02 3.379e-02 1.044 0.296497
## arrival_date_monthJanuary 2.628e-03 3.786e-02 0.069 0.944668
## arrival_date_monthJuly -6.815e-01 2.983e-02 -22.850 < 2e-16 ***
## arrival_date_monthJune -3.047e-01 2.958e-02 -10.301 < 2e-16 ***
## arrival_date_monthMarch -1.092e-01 3.136e-02 -3.483 0.000496 ***
## arrival_date_monthMay -2.440e-01 2.905e-02 -8.400 < 2e-16 ***
## arrival_date_monthNovember -2.062e-01 3.588e-02 -5.748 9.01e-09 ***
## arrival_date_monthOctober -4.031e-01 3.057e-02 -13.184 < 2e-16 ***
## arrival_date_monthSeptember -6.298e-01 3.144e-02 -20.033 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 157398 on 119389 degrees of freedom
## Residual deviance: 133752 on 119361 degrees of freedom
## AIC: 133810
##
## Number of Fisher Scoring iterations: 9
## [1] 0.6830974
##
## pred 0 1
## 0 74512 37181
## 1 654 7043
## [1] 0.9150318
We can see that the model overall does an excellent job in predicting hotel cancellations. We found that the accuracy of the model to be approximately 0.6831 and the precision of the model to be approximately 0.9150. Rather than using accuracy as performance metric, we thought precision would be a better performance metric in evaluating models as it is more costly for businesses to predict reservations that have been canceled. Thus, in terms of precision, we can say that our model performance is pretty optimal. Now, lets direct our attention to model summary.
Looking at the months variable, we can see that month April
, which we previously found from mosaic plot to have positive association with canceled hotels and negative associated with not-canceled hotels, is our base-line variable against other month variables. This explains why most of the month variables are statistically significant with negative coefficient, as the month that is statistically significant in the “opposite manner” is serving as a base-line comparison. Thus, most of the month variables are decreasing the odds ratio of hotel cancellation.
Since there were too many unique values for categorical variables countries, hence lots of dummy variables, we could not include them in the model as it was computationally expensive. However, we added variables that intuitively would be tied to cancellations such as previous_cancellations
, is_repeated_guest
, lead_time
, and so on.
One interesting variable we observed was room_type F. Among several specific room_types that were statistically significant, when compared to the baseline room_type A, room_type F reduced the odds ratio the most by about 47.6%, or exp(-0.4761) = 0.6212, of hotel cancellation.
For this graph, we have estimated probability of reservation cancellation on the x-axis and the true value of ‘is_canceled’ on the y-axis. We approximated 0.75 as a cut-off since we start to see less of is_canceled
= 0 after 0.75.
From this graph, we can observe that as the number of booking_changes
increased, the probability of cancellation decreased. One possible explanation for this phenomenon is that people who really want to go to a certain hotel flexibly change their schedule without canceling their hotel.
In this graph, we can see that as the number of previous_cancellations increased, the probability of cancellation increased along with it, which intuitively makes sense as people who canceled more number of hotels in the past are more prone to make cancellations in the future.
Lastly, since our dataset contains a large amount of variables, we will be using PCA to try to simplify the dimensions of our dataset while retaining possible patterns of interest.
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.373 1.2163 1.1870 1.11252 1.03046 0.99323 0.96284
## Proportion of Variance 0.145 0.1138 0.1084 0.09521 0.08168 0.07589 0.07131
## Cumulative Proportion 0.145 0.2588 0.3672 0.46238 0.54406 0.61995 0.69126
## PC8 PC9 PC10 PC11 PC12 PC13
## Standard deviation 0.94139 0.90850 0.81935 0.76950 0.7416 0.69901
## Proportion of Variance 0.06817 0.06349 0.05164 0.04555 0.0423 0.03759
## Cumulative Proportion 0.75943 0.82292 0.87456 0.92011 0.9624 1.00000
We run the PCA on is_canceled
, lead_time
, stays_in_weekend_nights
, stays_in_week_nights
, adults
, children
, babies
, is_repeated_guest
, previous_cancellations
, previous_bookings_not_canceled
, booking_changes
, days_in_waiting_list
, adr
to get 13 principal components.
The Scree plot shows us that there is a steady decrease in the percentage of explained variances as the number of dimensions increases. Around a quarter of the variance is explained by the first two principal components, so we will proceed with our analysis using the first two principal components for simplicity’s sake.
From this visualization, we see that bookings that were canceled have lower values of PC2 and slightly higher values of PC1.
## PC1 PC2
## is_canceled 0.19220472 -0.467777922
## lead_time 0.28158819 -0.504621788
## stays_in_weekend_nights 0.38367319 0.105688981
## stays_in_week_nights 0.41395617 0.048976299
## adults 0.33361277 0.002163769
## children 0.17958177 0.350064980
## babies 0.03594308 0.154892667
## is_repeated_guest -0.43123658 0.047487599
## previous_cancellations -0.09496985 -0.275854321
## previous_bookings_not_canceled -0.36676224 0.015199708
## booking_changes 0.03336186 0.265764042
## days_in_waiting_list 0.03109482 -0.313486499
## adr 0.29796462 0.344882438
Each column in the rotation matrix represents the linear combination of variables that each principal component represents. According to the matrix, PC1 is highly positively correlated with lead_time
, stays in weekend nights, stays in week nights, and number of adults. PC1 is negatively correlated with is_repeated_guest
and previous_bookings_not_canceled
. PC2 is positively correlated with number of children
, adr
, and booking_changes
. PC2 is negatively correlated with lead_time
, previous_cancellations
, and days_in_waiting_list
. Since bookings that are canceled have higher values of PC1, we can say these bookings have longer lead time, higher stays in week nights, more adults, less likely to be a repeated guest and are less likely to have previous bookings canceled. Since bookings that are canceled have lower values of PC2, we can say these bookings have less children, lower average daily rate, fewer booking changes, higher lead time, more previous cancellations, and more days on the waiting list.
Overall, we have found that many factors can influence hotel booking cancellations.
From our original EDA of seasonal trends in hotel booking cancellations, we originally believed that hotel booking cancellations were independent from the time of the year. However, when performing a chi-squared test for independence, we found that hotel booking cancellations and month/year were actually dependent.
When taking a step back to examine more variables that may affect hotel booking cancellations, we found that certain room types seemed to be more significant as explanatory variables than others. Another interesting phenomenon that we discovered was that as the number of booking changes increased, the probability of canceling a booking decreased.
From our PCA, we found from scree plot that we should retain 2 PCs. We also found that PC1 is negatively correlated with is_repeated_guest
and previous_bookings_not_canceled
while PC2 is positively correlated with number of children
, adr
, and booking_changes
.
While we have been able to uncover many interesting patterns within our data, there is still room for improvement and future research. For example, while we found that seasonal trends were dependent on booking cancellations, our linear regression model indicated that variables representing time were not statistically significant. This is an occurrence that we thought was very interesting and worth looking into.
Furthermore, due to time constraints, most of our analysis was only able to briefly touch the surface of possible relationships between the variables. For example, while we found that reserved_room_type
may influence is_canceled
, we were not able to test this or conduct deeper analysis. If we were to continue the project, we would like to spend more time looking at how specific variables other than time may influence is_canceled
.
As an even more broad direction, we could also possibly branch out from looking at just is_canceled
as the response variable. While the main focus of this report was to study hotel booking cancellations, other variables in the hotels
dataset may also be worth looking into as a response variable. For example, examining how time of the year affects average daily rates (adr
) could be useful for an average consumer.