1. Introduction

Data on businesses are often difficult to collect and can provide crucial insight into machine learning, data mining, and revenue management. The dataset that we are examining in this report is the combination of two real data sets on hotel demand. The two datasets contained the same variables and were collected from a resort hotel (labeled as H1 in our dataset) and a city hotel (labeled as H2 in our dataset). Overall, there are 32 variables and 119,390 observations, with each observation representing a single hotel booking. (Exact variables definitions can be found at: https://github.com/rfordatascience/tidytuesday/tree/master/data/2020/2020-02-11) In this report, our overarching goal is to discover the factors that influence hotel booking cancellations.

2. Research Questions

Through this report, we aim to answer three main questions:

  1. Is there a seasonal trend to booking cancellations?
  2. What variables are indicative of booking cancellation? Do some intuitive variables like is_repeated_guest and previous_cancellations actually have a larger effect on booking cancellations?
  3. How can we understand the difference between canceled and non-canceled bookings using fewer dimensions?

4. Logistic Regression on Booking Cancellation

Next, after looking specifically at how the time of the year can affect booking cancellations, we were curious about what other variables may also affect booking cancellations. To do this, we created a logistic regression model with is_canceled as the response variable. The independent variables included were selected based on general intuition of factors that may affect booking cancellation (including hotel, previous_booking_not_canceled, reserved_room_type, lead_time, etc.).

## 
## Call:
## glm(formula = is_canceled ~ hotel + lead_time + is_repeated_guest + 
##     previous_cancellations + previous_bookings_not_canceled + 
##     reserved_room_type + booking_changes + days_in_waiting_list + 
##     adr + arrival_date_month, family = "binomial", data = hotels)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -8.4904  -0.8923  -0.6364   1.1088   6.6776  
## 
## Coefficients:
##                                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                    -1.342e+00  3.047e-02 -44.028  < 2e-16 ***
## hotelResort Hotel              -3.411e-01  1.594e-02 -21.395  < 2e-16 ***
## lead_time                       5.803e-03  7.288e-05  79.628  < 2e-16 ***
## is_repeated_guest              -1.042e+00  7.995e-02 -13.033  < 2e-16 ***
## previous_cancellations          2.977e+00  5.373e-02  55.404  < 2e-16 ***
## previous_bookings_not_canceled -6.454e-01  2.662e-02 -24.241  < 2e-16 ***
## reserved_room_typeB            -2.006e-01  7.027e-02  -2.854 0.004317 ** 
## reserved_room_typeC             7.384e-02  7.706e-02   0.958 0.337933    
## reserved_room_typeD            -2.401e-01  1.897e-02 -12.660  < 2e-16 ***
## reserved_room_typeE            -2.507e-01  3.186e-02  -7.869 3.58e-15 ***
## reserved_room_typeF            -4.761e-01  4.628e-02 -10.288  < 2e-16 ***
## reserved_room_typeG            -1.307e-01  5.335e-02  -2.451 0.014258 *  
## reserved_room_typeH             1.122e-01  9.253e-02   1.213 0.225099    
## reserved_room_typeL             4.878e-01  8.795e-01   0.555 0.579166    
## reserved_room_typeP             1.215e+01  3.414e+01   0.356 0.722069    
## booking_changes                -6.972e-01  1.551e-02 -44.968  < 2e-16 ***
## days_in_waiting_list           -8.551e-04  3.622e-04  -2.361 0.018223 *  
## adr                             6.916e-03  1.923e-04  35.960  < 2e-16 ***
## arrival_date_monthAugust       -6.303e-01  2.966e-02 -21.252  < 2e-16 ***
## arrival_date_monthDecember     -1.408e-01  3.576e-02  -3.937 8.26e-05 ***
## arrival_date_monthFebruary      3.528e-02  3.379e-02   1.044 0.296497    
## arrival_date_monthJanuary       2.628e-03  3.786e-02   0.069 0.944668    
## arrival_date_monthJuly         -6.815e-01  2.983e-02 -22.850  < 2e-16 ***
## arrival_date_monthJune         -3.047e-01  2.958e-02 -10.301  < 2e-16 ***
## arrival_date_monthMarch        -1.092e-01  3.136e-02  -3.483 0.000496 ***
## arrival_date_monthMay          -2.440e-01  2.905e-02  -8.400  < 2e-16 ***
## arrival_date_monthNovember     -2.062e-01  3.588e-02  -5.748 9.01e-09 ***
## arrival_date_monthOctober      -4.031e-01  3.057e-02 -13.184  < 2e-16 ***
## arrival_date_monthSeptember    -6.298e-01  3.144e-02 -20.033  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 157398  on 119389  degrees of freedom
## Residual deviance: 133752  on 119361  degrees of freedom
## AIC: 133810
## 
## Number of Fisher Scoring iterations: 9
## [1] 0.6830974
##     
## pred     0     1
##    0 74512 37181
##    1   654  7043
## [1] 0.9150318

We can see that the model overall does an excellent job in predicting hotel cancellations. We found that the accuracy of the model to be approximately 0.6831 and the precision of the model to be approximately 0.9150. Rather than using accuracy as performance metric, we thought precision would be a better performance metric in evaluating models as it is more costly for businesses to predict reservations that have been canceled. Thus, in terms of precision, we can say that our model performance is pretty optimal. Now, lets direct our attention to model summary.

Looking at the months variable, we can see that month April, which we previously found from mosaic plot to have positive association with canceled hotels and negative associated with not-canceled hotels, is our base-line variable against other month variables. This explains why most of the month variables are statistically significant with negative coefficient, as the month that is statistically significant in the “opposite manner” is serving as a base-line comparison. Thus, most of the month variables are decreasing the odds ratio of hotel cancellation.

Since there were too many unique values for categorical variables countries, hence lots of dummy variables, we could not include them in the model as it was computationally expensive. However, we added variables that intuitively would be tied to cancellations such as previous_cancellations, is_repeated_guest, lead_time, and so on.

One interesting variable we observed was room_type F. Among several specific room_types that were statistically significant, when compared to the baseline room_type A, room_type F reduced the odds ratio the most by about 47.6%, or exp(-0.4761) = 0.6212, of hotel cancellation.

4.1 Probability Graphs for Booking Cancellations

For this graph, we have estimated probability of reservation cancellation on the x-axis and the true value of ‘is_canceled’ on the y-axis. We approximated 0.75 as a cut-off since we start to see less of is_canceled = 0 after 0.75.

From this graph, we can observe that as the number of booking_changes increased, the probability of cancellation decreased. One possible explanation for this phenomenon is that people who really want to go to a certain hotel flexibly change their schedule without canceling their hotel.

In this graph, we can see that as the number of previous_cancellations increased, the probability of cancellation increased along with it, which intuitively makes sense as people who canceled more number of hotels in the past are more prone to make cancellations in the future.

5. Dimensionality Reduction using PCA

Lastly, since our dataset contains a large amount of variables, we will be using PCA to try to simplify the dimensions of our dataset while retaining possible patterns of interest.

## Importance of components:
##                          PC1    PC2    PC3     PC4     PC5     PC6     PC7
## Standard deviation     1.373 1.2163 1.1870 1.11252 1.03046 0.99323 0.96284
## Proportion of Variance 0.145 0.1138 0.1084 0.09521 0.08168 0.07589 0.07131
## Cumulative Proportion  0.145 0.2588 0.3672 0.46238 0.54406 0.61995 0.69126
##                            PC8     PC9    PC10    PC11   PC12    PC13
## Standard deviation     0.94139 0.90850 0.81935 0.76950 0.7416 0.69901
## Proportion of Variance 0.06817 0.06349 0.05164 0.04555 0.0423 0.03759
## Cumulative Proportion  0.75943 0.82292 0.87456 0.92011 0.9624 1.00000

We run the PCA on is_canceled, lead_time, stays_in_weekend_nights, stays_in_week_nights, adults, children, babies, is_repeated_guest, previous_cancellations, previous_bookings_not_canceled, booking_changes, days_in_waiting_list, adr to get 13 principal components.

Scree Plot of Principal Components

Scree Plot of Principal Components

The Scree plot shows us that there is a steady decrease in the percentage of explained variances as the number of dimensions increases. Around a quarter of the variance is explained by the first two principal components, so we will proceed with our analysis using the first two principal components for simplicity’s sake.

From this visualization, we see that bookings that were canceled have lower values of PC2 and slightly higher values of PC1.

##                                        PC1          PC2
## is_canceled                     0.19220472 -0.467777922
## lead_time                       0.28158819 -0.504621788
## stays_in_weekend_nights         0.38367319  0.105688981
## stays_in_week_nights            0.41395617  0.048976299
## adults                          0.33361277  0.002163769
## children                        0.17958177  0.350064980
## babies                          0.03594308  0.154892667
## is_repeated_guest              -0.43123658  0.047487599
## previous_cancellations         -0.09496985 -0.275854321
## previous_bookings_not_canceled -0.36676224  0.015199708
## booking_changes                 0.03336186  0.265764042
## days_in_waiting_list            0.03109482 -0.313486499
## adr                             0.29796462  0.344882438

Each column in the rotation matrix represents the linear combination of variables that each principal component represents. According to the matrix, PC1 is highly positively correlated with lead_time, stays in weekend nights, stays in week nights, and number of adults. PC1 is negatively correlated with is_repeated_guest and previous_bookings_not_canceled. PC2 is positively correlated with number of children, adr, and booking_changes. PC2 is negatively correlated with lead_time, previous_cancellations, and days_in_waiting_list. Since bookings that are canceled have higher values of PC1, we can say these bookings have longer lead time, higher stays in week nights, more adults, less likely to be a repeated guest and are less likely to have previous bookings canceled. Since bookings that are canceled have lower values of PC2, we can say these bookings have less children, lower average daily rate, fewer booking changes, higher lead time, more previous cancellations, and more days on the waiting list.

6. Conclusion

6.1 Takeaways

Overall, we have found that many factors can influence hotel booking cancellations.

From our original EDA of seasonal trends in hotel booking cancellations, we originally believed that hotel booking cancellations were independent from the time of the year. However, when performing a chi-squared test for independence, we found that hotel booking cancellations and month/year were actually dependent.

When taking a step back to examine more variables that may affect hotel booking cancellations, we found that certain room types seemed to be more significant as explanatory variables than others. Another interesting phenomenon that we discovered was that as the number of booking changes increased, the probability of canceling a booking decreased.

From our PCA, we found from scree plot that we should retain 2 PCs. We also found that PC1 is negatively correlated with is_repeated_guest and previous_bookings_not_canceled while PC2 is positively correlated with number of children, adr, and booking_changes.

6.2 Limitations and Next Steps

While we have been able to uncover many interesting patterns within our data, there is still room for improvement and future research. For example, while we found that seasonal trends were dependent on booking cancellations, our linear regression model indicated that variables representing time were not statistically significant. This is an occurrence that we thought was very interesting and worth looking into.

Furthermore, due to time constraints, most of our analysis was only able to briefly touch the surface of possible relationships between the variables. For example, while we found that reserved_room_type may influence is_canceled, we were not able to test this or conduct deeper analysis. If we were to continue the project, we would like to spend more time looking at how specific variables other than time may influence is_canceled.

As an even more broad direction, we could also possibly branch out from looking at just is_canceled as the response variable. While the main focus of this report was to study hotel booking cancellations, other variables in the hotels dataset may also be worth looking into as a response variable. For example, examining how time of the year affects average daily rates (adr) could be useful for an average consumer.