Introduction

With e-commerce growing exponentially, analyzing detailed browsing and purchasing data helps in optimizing website design, personalizing marketing efforts, and ultimately enhancing the consumer’s shopping experience. Understanding the factors that influence online shopping behaviors can lead to improved sales strategies and customer engagement. Our goal is to explore how different variables in the dataset relate to purchasing intention and identify patterns that could help businesses optimize their online shopping experience.

Data Description

The Online Shoppers Purchasing Intention Dataset from the UCI Machine Learning Repository contains data belonging to 12,330 sessions from an online retailer over a one-year period, where each row corresponds to a different user. It features 18 attributes (10 numerical, 8 categorical) related to visitor behavior on an e-commerce website, as follows:

Numeric Variables:

  • Administrative: Number of administrative pages visited
  • Administrative_Duration: Total time (in seconds) spent on administrative pages
  • Informational: Number of informational pages (e.g., about page, address page) visited
  • Informational_Duration: Total time (in seconds) spent on informational pages
  • ProductRelated: Number of product-related pages visited
  • ProductRelated_Duration: Total time (in seconds) spent on product-related pages
  • BounceRates: Percentage of visitors who enter the site from that page and then leave without triggering any other requests
  • ExitRates: The percentage of pageviews that were the final ones in the session
  • PageValues: Average value of the page visited before completing a transaction
  • SpecialDay: Proximity of site visiting time to special days (e.g., Mother’s Day, Valentine’s Day)

Categorical Variables:

  • Month: Month of the visit
  • OperatingSystems: Operating system used
  • Browser: Browser used
  • Region: Geographic region of the visitor
  • TrafficType: Traffic source type
  • VisitorType: Visitor type (e.g., returning, new)
  • Weekend: Whether the visit occurred on a weekend
  • Revenue: Whether a purchase was made

With these information, we intend to answer the following research questions:

1. How does the type of page visited (administrative, informational, product-related) influence the purchasing intention of customers across different months?

2. How does the site performance (bounce rates and exit rates) correlate with purchasing decisions?

3. Is there a correlation between visitor type (new, returning, other) and the final purchase decision?

Research Question 1

We first focus on three variables: the type of page visited (we can choose a couple of categories for simplicity, such as ‘ProductRelated’ and ‘Administrative’), the month of the visit, and the purchasing intention (whether a purchase was made or not).

We created a facetted grouped bar plot for EDA, which shows the average number of visits for Administrative, Informational, and ProductRelated pages for each month, differentiated by whether a purchase was made or not. This visualization helps us understand if there’s a seasonal pattern to how different types of page visits influence purchasing intention. We see that for all page types and for most months, those who made a purchase has a higher average number of visits. This suggests that page visits may be a good indicator of purchasing intention, where more visits could be correlated with higher likelihood of purchase. This is especially true in winter and spring months (Nov, Dec, Feb, Mar), where those who made a purchase tend to make more page visits first.

The above facetted time series plot shows the purchase probability over different months, with lines representing whether visitors have had visits or not to those page types. For Administrative pages, the probability of making a purchase seems slightly higher when there are visits (TRUE) compared to when there are no visits (FALSE). This trend is fairly consistent across all months (except October), which suggests that visits to Administrative pages could have a modest, positive impact on purchase probability. Similar to Administrative pages, Informational pages show a generally higher purchase probability when visited (TRUE), though the difference between visited and not visited may not be as large. The influence of Informational page visits on purchase probability appears relatively stable throughout the year. For ProductRelated pages, the purchasing probability is generally higher for those who visited ProductRelated pages versus those without (except the spike in August), indicating a stronger relationship between visiting product-related pages and the likelihood of making a purchase. There seems to be outlier points at August and September for ProductRelated pages, which could be due to external factors that influenced purchasing behavior during those months.

PC1 PC2 PC3
Administrative_Duration 0.5580442 0.6918356 0.4582032
Informational_Duration 0.5522949 -0.7217704 0.4171544
ProductRelated_Duration 0.6193198 0.0202727 -0.7848771

To further explore the relationship between different types of page visits and purchasing intention, we conducted a Principal Component Analysis (PCA) on the duration variables (Administrative_Duration, Informational_Duration, ProductRelated_Duration). From the PCA loadings, we observe a strong negative loading for ProductRelated_Duration on PC3 and its strong positive loading with PC1, suggesting that the amount of time customers spend on product-related pages is particularly indicative of purchasing behavior. For PC1, all variables have fairly strong positive loadings, indicating that sessions with longer durations on Administrative, Informational, and Product-related pages tend to score higher on this component. For PC2, Informational_Duration has a strong negative loading, suggesting that the longer duration on informational pages might be associated with factors that are inversely related to those captured by the second principal component, which might differentiate sessions in a way that’s not directly related to purchasing.

The biplot shows the variables as vectors and the observations as points, with the color indicating whether a purchase was made or not. The first two components account for around 80% of the total variance in the data. All arrows point towards the right indicates that they all positively correlate with the first principal component (sessions with longer durations tend to have a higher value on the first principal component), which could be an indicator of higher inclination to purchase. However, the degree of this inclination is not clearly separable in the biplot, implying other factors might also be influential. The three arrows do not point in exactly the same direction, suggesting that while they’re all positively associated with the first principal component, their contributions differ in magnitude, and there might be some distinctions in how they relate to purchasing intention.

Research Question 2

Then, we move on to exploring the relationship between site performance metrics, specifically bounce rates and exit rates, and purchasing decisions. Bounce rate is defined as the percentage of visitors who navigate away from the site after viewing only one page, while exit rate is the percentage of visitors who leave after viewing more than one page. Understanding how these metrics influence consumer behavior can help in enhancing site engagement and optimizing conversion rates.

We begin by analyzing the correlation between site performance metrics (BounceRates, ExitRates) and various factors influencing purchasing decisions as well as the decision itself (Revenue, PageValues, Administrative_Duration, Informational_Duration, ProductRelated_Duration). Heat map is used because the color gradients provide a clear representation of the strength and direction of correlations. The general trend observed from the graph does make sense – BounceRates and ExitRates are strongly positively correlated with each other, and each of them has a negative correlation with Revenue. Looking more carefully at the heat map, it is worth noting that ExitRates is more negatively correlated to Revenue than BounceRates, suggesting that higher exit rates may lead to a lower likelihood of purchasing, and the magnitude could be greater for exit rates than bounce rates. In addition, though Administrative_Duration, Informational_Duration, and ProductRelated_Duration show varying shades of color when compared with BounceRates and ExitRates, there seems to be a relatively strong negative correlation between ProductRelated_Duration and ExitRates.

To better answer our question, we decide to take a closer look into how BounceRates and ExitRates are correlated with Revenue. We fit two separate logistic regression models using the glm function with the logit link—one for BounceRates and one for ExitRates—with Revenue as the response variable. Then we calculate the predicted probabilities of a purchase (yes = 1) for each observation in the dataset based on the logistic regression models, and the probabilities are plotted against their corresponding BounceRates and ExitRates. According to the plot, both curves slope downward as the rates increase, which indicates a negative relationship between both BounceRates and ExitRates and the likelihood of making a purchase, as expected. The red curve (exit rates) start higher than the blue curve (bounce rates), suggesting that sessions with higher exit rates initially may have a higher likelihood of resulting in a purchase than sessions with higher bounce rates. However, the slope of the red curve is less steep than that of the blue curve, which implies a stronger negative impact on the probability of a purchase as the bounce rate increases, compared to the exit rate. A user may have a higher Exit Rate because they viewed multiple pages before leaving, suggesting they engaged with the site more even if they did not end up making a purchase. In contrast, a bounce (leaves the site after viewing only one page) is often a stronger negative indicator of not making a purchase.

After exploring the strength and direction of correlations as well as the different trends of purchase success by different rates, we focus on the distribution of BounceRates and ExitRates by Revenue. A box plot is used to illustrate the spread and median of the rates for sessions with and without a purchase. It shows that sessions with a purchase tend to have lower median bounce and exit rates, and the spread is tighter, especially for bounce rates. There are also outliers, particularly for sessions without a purchase, indicating that some sessions had unusually high bounce and exit rates.

Since there are a considerable amount of outliers, we decide to explore the box plot across different VisitorType to better understand the dataset for potential improvements on conversion rate. According to the plot, Returning_Visitor has the highest range and outliers for both BounceRates and ExitRates among sessions with no purchase, which suggests that while they may browse more, they are not necessarily more likely to make a purchase unless engaged effectively. New visitors have a lower median ExitRates when they make a purchase compared to when they don’t, which could imply that if new visitors are kept engaged past the initial pages they land on, they may have a better chance of converting into purchasers.

The above analysis suggests several key takeaways for optimizing online platforms: (1) lowering bounce rates: since high bounce rates strongly correlate with reduced likelihood of purchase, efforts should focus on engaging users upon their initial landing page to encourage deeper site exploration. (2) managing exit rates: although exit rates have a less drastic impact than bounce rates, minimizing unnecessary navigation that doesn’t lead to purchases remains important. (3) engagement strategies: different strategies may be required for new versus returning visitors, as shown by the differing patterns in bounce and exit rates across these groups.

Research Question 3

For the final research question, we want to explore on whether the final purchase decision were differenet significantly across customer types. We focuse on two variables: Revenue and VisitorType. Both of the variables are categorical. For ‘Revenue’, it’s binary (FALSE, TRUE); for ‘VisitorType’, it has 3 categories(New_Visitor, Returning_Visitor, Other).

We begin the examination by scanning through the conditional distribution of ‘Revenue’ given by ‘VisitorType’, so we can evaluate how the frequency and likelihood of generating revenue vary across different types of visitors.

##        
##         New_Visitor     Other Returning_Visitor
##   FALSE   0.7508855 0.8117647         0.8606767
##   TRUE    0.2491145 0.1882353         0.1393233

According to the table, new visitors have the highest likelihood of generating revenue at 24.91%, followed by other types of visitors at 18.82%, and returning visitors at 13.93%. This suggests that visitor type probably would influence e-commerce revenue generation, with new visitors being the most likely to make purchases.

We also want to present these numbers more straightforward, so we creat a “proportional” bar plot. In this plot, we can see how much proportion of Revenue Generated is TRUE/FALSE (whether finally made purchase) in each of the customer type.

The stacked bar plot indicates that new visitors have a proportionately higher incidence of making final purchase compared to returning visitors and other visitor types. This visual pattern aligns with the numerical data in the table we have, revealing that the likelihood of a new visitor making final purchase is potentially more significant than that of other visitor types.

We also wanted to better understand how the distribution of revenue generation varied by the type of visitor. To do this, we plotted the relationship between ‘Revenue’ and ‘VisitorType’ using a Mosaic plot, which allows us to observe the proportional representation and interaction between these categorical variables.

The above mosiac plot suggests that there are notable differences in revenue(final purchase decision) between the different visitor types. The larger segments for ‘FALSE’ compared to ‘TRUE’ revenue across all visitor types indicate that non-revenue sessions predominate. For the cells for new visitors who finally did not make purchase and for returning customers who did make purchase are colored in red, the counts of them counts are significantly lower than expected under independence. While the cells for new visitors who did make purchase is colored in blue, which means that counts of them counts is significantly higher than expected under independence. These observations suggest that there is a significant association between visitor type and the likelihood of making purchase, with new visitors being more likely than expected to make final purchase, and returning visitors less likely than expected to do so.

The patterns observed in previous table and graphs led us to question the statistical significance of the relationship between visitor type and purchase decision. To assess whether the deviations from expected counts under independence were due to chance or indicated a genuine association, we conducted a Chi-square test of independence. This test evaluates whether there is a significant difference in the observed frequency distribution from what would be expected if the two variables were independent of each other.

## 
##  Pearson's Chi-squared test
## 
## data:  table_visitors
## X-squared = 135.25, df = 2, p-value < 2.2e-16

The results from our Chi-squared test provide strong statistical evidence to confirm the patterns observed in the Mosaic plot. With small p-value of less than 2.2e-16, our chi-sqr test suggests that the association between ‘VisitorType’ and ‘Revenue’ is highly significant. In practical terms, this means that the likelihood of a visitor who make purchase is not the same across the ‘New Visitor’, ‘Other’, and ‘Returning Visitor’ categories. We can reject the null hypothesis of independence and conclude that there is a statistically significant relationship between the type of visitor to the e-commerce website and the probability of a purchase being made.

Conclusion

Throughout this project, we explored the various aspects of online shopper behavior and their correlations with purchasing decisions. Our analyses revealed several key insights that could assist e-commerce businesses in optimizing their platforms. We found that higher average page visits correlate with increased purchasing likelihood, a trend that is particularly pronounced during the winter and spring seasons. The longer durations on all page types positively influence the first principal component, suggesting overall engagement is a key factor in purchasing behavior. The biplot visualization, however, reveals that although longer sessions generally align with a higher probability of purchase, the data points do not form distinct clusters by purchasing behavior, suggesting that other unexamined factors might play a role.

We also discovered that both bounce and exit rates are significant indicators of purchasing behavior, with higher rates generally leading to lower purchase probabilities. Additionally, we observed that the type of visitor significantly influences purchasing decisions, with new visitors more likely to make purchases than returning ones.

The findings suggest practical strategies for e-commerce sites, such as enhancing user engagement to reduce bounce rates and tailoring marketing strategies to convert new visitors into customers. Our research also highlights the importance of continuously analyzing visitor behavior and site performance to adapt to changing consumer preferences and technological advancements.

For future work, we recommend expanding data scope, incorporating more data such as interaction with specific elements on the site (ex: reviews or search sections), and the impact of mobile versus desktop browsing. Understanding how these interactions correlate with purchasing decisions could help in further optimizing user engagement strategies.