Final Poject Team 25

The data set contains booking informations for a city hotel and a resort hotel. These information includes when the booking was made, length of stay, the number of visitors, the number of available parking spaces, the arrival date, average daily rate, cancellation status of the cooking and etc. There are a total of 32 variables in total and 119391 rows. There are a total of 18 categorical variables where 4 are related to dates, a total of 13 of quantitative variables and a date variable reservation_status_date. We want to mainly focus on the time-series analysis, average daily rate analysis, and whether if we can have a good prediction of whether a reservation will get cancelled or not with the variables.

Categorical : hotel,is_canceled,arrival_date_year,arrival_date_month,arrival_date_week_number,arrival_date_day_of_month,meal,country,market_segment,distribution_channel,is_repeated_guest,reserved_room_type,assigned_room_type,deposit_type,agent,company,customer_type,reservation_status,reservation_status_date

Quantative: lead_time,stays_in_weekend_nights,stays_in_week_nights,adults,children,babies,previous_cancellations,previous_bookings_not_canceled,booking_changes,days_in_waiting_list,adr,required_car_parking_spaces,total_of_special_requests

Date: reservation_status_date

We first wanted to learn about how time and season generally affect hotel reservation retention and cancellation probabilities, which suggests that we should examine the effect of monthly groupings on overall cancellation probability across those groups.

The time series graphs above suggest a clear dip in reservation retention in November and a respective rise in the immediately following March with the pattern of reduction in reservation frequency being exhibited in both years of data collection. Additionally, we notice similar shapes in the time series graphs for non-cancelled reservation frequency compared to the respective time series for cancelled reservation frequency. In support of this second observation, you can see that the first dendrogram which is clustered by CANCELLATION PROBABILITY is not that useful as all the months are pretty similar. Between the minimum probability of cancellation in April and the maximum probability in March, we only see about a 10% difference in probabilities. The second dendrogram which is clustered by the number of total reservations better supports conclusions regarding the shape of the time series graphs. We can see that late fall through the winter months create that first dip in reservation frequency, and the following rise in march through early-mid fall are captured. The late-spring through summer cluster is a less apparent structuring to the time series, but is still captured with the highest reservation count by far.

Next, we wanted to learn about how different hotel types generally affect the Average Daily Rate over time of the year. Due to the fact that there is one point in the dataset has an Average Daily Rate value of 5400 that has a large impact on our visualization. As a result, we decided to filter out that point as outlier.

From this box plot, we can suggest that there is actually a difference between Average daily rate between Resort Hotel and City hotel. On average, Resort Hotel has a lower average, lower bound and a similar higher bound. Then We want to see that if this difference in Average daily rate between Resort Hotel and City hotel is consistent over the years.

There is a clear trend on average daily rate from arrival week. During the summer months, the average rate of hotels is much higher than in the winter months. The one exception is right before the end of the year when a ton of people travel for the holidays, which causes a spike in hotel prices in the last week. People who give longer lead time tend to book their hotels for the summer and tend to get lower prices than those who booked their hotels more last minute. That means that people tend to plan their summer vacations longer in advance and tend to get lower prices because of it. Using the first graph, it is hard to see the breakdown of resort vs city hotels, so other graphs split them up. Interestingly, resort hotels account for almost all of the seasonal variation in prices, as city hotel prices stay relatively the same all year. A logical explanation is that city hotels are used for business, which has stable demand all year, while resorts are usually for vacation, which is very high during the summer months and holidays.

And at last, we wanted to learn about predicting the probability of cancellation based on the quantitative variables, which suggests we should examine PCA analysis and vitualize the dendrogram.

## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     1.3264 1.2286 1.11641 1.05526 1.03305 1.01037 0.97432
## Proportion of Variance 0.1353 0.1161 0.09588 0.08566 0.08209 0.07853 0.07302
## Cumulative Proportion  0.1353 0.2515 0.34734 0.43300 0.51509 0.59361 0.66664
##                            PC8    PC9    PC10    PC11    PC12    PC13
## Standard deviation     0.94302 0.9298 0.87010 0.85601 0.77479 0.69982
## Proportion of Variance 0.06841 0.0665 0.05824 0.05637 0.04618 0.03767
## Cumulative Proportion  0.73504 0.8015 0.85979 0.91615 0.96233 1.00000

The above graph suggests that PC1 accounts for a low proportion of the variance, 13%, and the slope of the elbow plot flattens at PC4, thus we chose k = 4. There is no clear grouping in the PCA graph which suggests the quantitative variables do not predict whether a booking is cancelled.

We then want to learn how would specific country of origin will affect our conclusion from the PCA graph. From PCA-elbow analysis, We chose to use the first 4 PCA component to preform a dendrogram on USA portion of the dataset.

The above graph suggests that when we are looking at customers originated from USA can not be split into two groups correctly using complete dendrogram. Even when we are using a complete dendragom, the graph appears more as a single linkage dendrogram and does not show correct grouping with the current information.

In conclusion, We can conclude that there is a clear trend of reservation arrival date through out the year. It is easier to conclude on when a reservation has been made base on its month, average daily rate, and its hotel type compare to concluding on whether we can determine if a reservation would be canceled or not. We do not see much relationship between the variables we tested and probability a reservation is cancelled. But we did see a relationship between Average daily rate with hotel type and arrival time of the year.

Final Poject Team 25

Xiao shen,Andrew Butler,Graham Eversden,Keaton Tam

5/5/2021