Customer Personality Analysis is a comprehensive study aimed at
understanding a company’s ideal customers. It enables businesses to
delve deeper into the characteristics of their consumers, thereby
enhancing their ability to strategize and increase sales effectively. In
this project, we analyze a dataset sourced from Kaggle and provided by
Dr. Omar Romero-Hernandez, a professor at U.C. Berkeley. The dataset
comprises 15 quantitative variables, including customer demographic
information like Year_Birth
and Income
, and
spending details such as MntWines
and
NumWebPurchases
, etc. It also contains 13 categorical
variables related to customer profiles and marketing responses, such as
Education
, Marital_Status
, and
AcceptedCmp1
through AcceptedCmp5
, etc.
Our primary interest lies in identifying potential factors that influence purchasing decisions. To this end, we have reclassified certain expenditures into three categories:
Primary Sector
: necessities (fish, meat, and
fruits),Secondary Sector
: extra goods (sweets and wines),Tertiary Sector
: luxury items (such as gold).From the figure above, we can observe that the Primary and Secondary Sectors display similar distributions, as indicated by the nearly equal bar lengths for each Spending Amount. Specifically, their ranges span from 1 to 1727 and 0 to 1549, respectively. The Primary Sector, focusing on necessities, is more right-skewed, as evidenced by a lower median of 90 compared to the Secondary Sector’s median of 199.5, which focuses on extra goods. In contrast, purchases in the Tertiary Sector are generally fewer, with values clustering on the left and a maximum value of only 362. This suggests that luxury items are infrequently purchased by most people.
Additionally, we have consolidated these three variables into a new
composite variable: Total Money Spent (MntTotal
).
In this research, we are particularly interested in addressing the following three research questions:
What are the key quantitative predictors of Total Money Spent?
How do some of the categorical variables (i.e. people’s education levels and marital statuses) affect their spending habits, and how are individuals’ spending habits on one sector (necessities, extra goods, luxury) correlated with the others?
What factors influence the number of deals one purchases, and who should the company target to advertise these deals?
By delving into these questions, we plan to identify a range of digital and traditional marketing opportunities that stakeholders can leverage to increase customer engagement and boost sales.
What Are the Key Quantitative Predictors of Total Money Spent
(MntTotal
)?
Our study aims to ascertain whether the total money spent by an individual can be predicted using their demographic details. To explore this, we focus initially on the impact of quantitative variables on our outcome variable, MntTotal. Here is a list of our quantitative variables: Year_Birth, Income, Recency, NumDealsPurchases, NumWebPurchases, NumCatalogPurchases, NumStorePurchases, NumWebVisitsMonth, and Dt_Customer. Among all, Dt_Customer, which records the date of a customer’s enrollment, differs in nature from the other quantitative variables as it is date-based. This distinction prompts us to start our investigation by utilizing a time series plot to explore whether Dt_Customer could be a viable predictor.
In the provided figure, we present a time series analysis of Total Money Spent against the Date of Enrollment, categorized by educational attainment levels. This stratification by Education level serves to reduce the amount of data within each subgroup for better visualization. The graph reveals no evident cyclical patterns or persistent trends in average spending, suggesting an absence of significant long-term variations in expenditure. Notably, the observable inconsistencies in data from individuals with bachelor’s or master’s degrees can be explained by their predominant numbers in the dataset, which results in a higher concentration of data points and potential anomalies.
Consequently, based on these findings, the variable Dt_Customer will be excluded from potential predictors for Total Money Spent.
Proceeding, we conduct a PCA analysis on the other quantitative variables of interest to identify potential key quantitative predictors. It is worth noting that the data from 24 participants have been temporarily excluded due to missing entries in the Income variable.
From the above elbow plot, it is evident that the first three principal components account for 66.3% of the variation, which represents a moderate level of data simplification. Additionally, the clear ‘elbow’ observed in the scree plot suggests that these components capture the most significant patterns within the data, justifying their selection for further analysis.
In the PCA biplot, we initially identify which variables
significantly contribute to the first principal component (PC1), which
accounts for the most variation. Observing horizontally, we find that
Income
has the highest negative loading on PC1 at
-0.488311229, followed by NumCatalogPurchases
at
-0.477381621, and NumStorePurchases
at -0.467808801.
NumWebVisitsMonth
shows a notable positive loading of
0.409829056. Looking vertically for contributions to the second
principal component (PC2), NumDealsPurchases
displays the
highest loading of 0.70847325, with NumWebPurchases
following at a loading of 0.48715615. Consequently, all quantitative
variables, except for Recency
and Year_Birth
,
are potential candidates as quantitative predictors.
How do education levels and marital statuses influence spending across sectors, and how are spending habits in one sector correlated with others?
In this project, we are primarily interested in the categorical variables of education and marital status, and we hope to explore the how education and marital status might affect how much people spend on different types of products, like necessities (fish, meat, and fruits), extra goods (sweets and wines), and luxury items (like gold). After exploring thoese relationships, we want to understand if these personal factors play a role in the amount of money people spend on these items, who could help us figure out how people’s backgrounds relate to their shopping habits.
To better understand the distribution of spending in each sector based on education and marital status, let us do a boxplot to look at the median spending and the variability in each group.
The graph is divided into three panels, one for each sector. Each panel compares the distribution of spending across different education levels and marital status while different colors represent different marital statuses. The y-axis indicates the education level, and the x-axis shows the spending amount while each sector has its own scale. For Primary sector, regardless of marital status, the medium spending for Bachelor is slightly higher than people who holds a Master and PhD drgree. Additionally within Bachelor, widows have even higher medium spending. For Secondary sector, the medium is higher for people who holds a PhD. Across all education levels, widows have a higher medium spending. For Tertiaty sector, the medium spending for people who holds a PhD degree tends to be lower. Similar to Secondary sector, people who are widowed tend to spend the most. Across all sectors, outliers are present which suggests that there are individuals with spending habits that differ significantly from the median of their group.
Furthermore, given Income might also influence people’s spending habit as shown in PCA, we decide to plot a correlation matrix to see the relationships among Income and all sectors.
Each cell in the matrix represents the correlation coefficient between two variables from -1 to 1: 1 indicates a perfect positive linear relationship, -1 indicates a perfect negative linear relationship, and 0 indicates no linear relationship. On the graph, bigger and darker circles corresponds to stronger relationships, and the color represents the direction of the correlation with blue for positive and red for negative correlations. From the graph, we can tell that income has the strongest positive correlation with wines and meat products, which implies that as people earn more, they tend to spend more on these items. For people who love purchase gold products, there is less correlation with the other products which suggests that it is a different category of products. It makes sense given it is the only Teriary product in the project. However, as shown, people who buy fruits, have a lower tendency of purchasing wines, but they tend to buy more meat and fish, which are all in necesities(Primary_sector) for living as defined in this project. Fruits, meats and fish products are all positively associated with each other while wine is only strong positively correlated with income and meat product given it is in secondary sector. Overall, income is more associated with meats and wine, which could be the reason that those items are necessary, and people are willing to pay higher for them. Yet, gold does not strongly correlate with any of them, suggesting that it is a luxury product.
What factors influence the number of deals purchases? Who should the company target to advertise deals?
The graph above shows the density of web purchases to store purchases. From the two density curves we can tell that the most common number of purchases lies in the range of 0-3 for both web and store purchases. Specifically from 0-2 Web purchases consistently have a higher frequency than store purchases. Then store purchases have a spike around 3-4, and consistently showing a greater frequency in comparison to web purchased.
In order to formally compare whether there is a difference between the mean of NumWebPurchases and NumStorePurchases, we applied a two-sample t-test as shown below:
##
## Welch Two Sample t-test
##
## data: web_purchases and store_purchases
## t = -32.693, df = 13120, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.807603 -1.603111
## sample estimates:
## mean of x mean of y
## 4.084821 5.790179
The null hypothesis for this t-test is that the means of NumWebPurchases and NumStorePurchases are the same. As shown in the t test, the p value is less than 2.2e-16, which lets us formally reject the null hypothesis and conclude that the means of NumWebPurchases and NumStorePurchases are statistically different from each other. Moreover, the means of NumStorePurchases is larger than the mean of NumWebPurchases. Thus, we focus on the NumStorePurchases below.
We wanted to learn about the factors that influence the number of deals purchases by customers, which can help to optimize their marketing strategies and improve customer engagement.
Given the heat-map above, we can see that the darker spots are clustered in the combination of number of deals purchased from 0 to 10 and number of web purcahses from 0 to 10. There are more common occurrences of customers with fewer web purchases and a low to moderate number of deal purchases, and higher numbers of web purchases and deal purchases appear to be less common. The plot appears to show that as the number of web purchases increases, the number of deals purchased also increases up to a point, which is around 10 on both sides. However, there are fewer data points as you move towards the higher end of both axes, which suggests that fewer customers fall into this category. It suggests that most people visit web 0 to 10 times, and they mostly are prone to make 1 deal purchase. Therefore, increasing the number of deal purchases made may not help boost people’s web purchases.
Next, we want to examine income and total spending to get a macro view of the relationship between people’s consumption amount and number of deals purchases.
After eliminate some outliers, the above graph suggests that people with middle range income (40000-75000) tend to have more deals purchases, which suggests that people with this income level are more inclined towards availing deals and discounts. Within the same level of income for 1-60000, a higher number of deals purchases seems to have higher total spending. This finding suggests that offering attractive deals and discounts could potentially encourage higher spending among customers with moderate incomes. As the income level increases beyond 75000, the number of deals purchases decreases while the total spending still remains high. This implies that marketing campaigns shouldn’t aim at individuals with high incomes.
After seeing the general distribution of number of deals purchased against numeral income, we delve into the distribution in different education background and income groups.
Within each education level, there are differences in the number of deals purchased across different income levels. It appears that for PhDs, the high-income category has a notably higher median number of deals purchased compared to the other income levels. For Bachelor and Master levels, the median number of deals purchased is relatively consistent across income levels, although the high-income group tends to have a slightly higher median in the Bachelor category. There are several outliers, particularly in the Bachelor and Others categories, where some individuals with low and medium income have purchased a high number of deals. These could be anomalies or may indicate specific segments that are highly responsive to deals despite their income.
Our study focused on identifying key factors that influence customer spending behavior, using a combination of data visualizations and statistical analyses. An initial time series analysis showed that the date of customer enrollment (Dt_Customer) did not demonstrate a significant trend in spending, leading to its exclusion from further analysis. A further PCA analysis indicated that Income, NumCatalogPurchases, NumStorePurchases, and NumWebVisitsMonth were significant predictors of total money spent. After the selection of quantitative variables, we then target categorical variables: education level and marital status. To further explore the public’s purchasing behavior, we categorize the goods purchased into three sectors: primary, secondary, and tertiary. By comparison, people’s spending behavior on three sectors of goods differ in groups categorized by these two traits. Additionally, we examine the spending correlation between each product to see if there’s a chance to create bundles of goods to promote people’s purchases. We noticed that while all goods have positive correlation, wine is only strongly positively correlated with income and meat products, indicating a possible advertising direction. To understand consumer spending behavior on promotions, we analyze the key variable “Number of Deals Purchased,” which indicates how promotions influence profits. Because of the rise of e-commerce, we were interested in whether there’s a difference between online/in-store promotions so that we could utilize it to increase profits by promotion. Unfortunately, we noticed that when people make bulk purchases, they tend to do in store purchases. Also, increasing the number of deal purchases made may not help boost people’s web purchases, leading us to focus on the in-store advertising strategies. Lastly, to see the relationship Between Income, Total spending and Number of deals purchases, we observed that offering attractive deals and discounts could potentially encourage higher spending among customers with moderate incomes.
Proposal of Advertising Strategies:
After our analysis, we concluded several advertising strategies below that can be used by companies to promote consumption: - Food sector targeted selling: design campaigns that foster a sense of community among widows, encouraging them to engage with the brand and products across all sectors. - Bundle selling: market campaigns for wine and meat should target higher-income customers, Create bundles of wines and meats, and discounts for bundles of fruits, meats and fish products. - Digital vs in-store selling: focusing on in-store promotions for bulk purchases could be an effective strategy. - Targeted Income Group: offering attractive deals and discounts to customers with moderate incomes. - PhDs with High Income: This group shows a higher median number of deals purchased, suggesting that they might be more inclined to take advantage of deals. Therefore, they could be a prime target for advertising deals. - Bachelor’s Degree Holders: This group, particularly in the high-income bracket, also shows potential for being targeted, as indicated by a relatively higher number of deals purchased and the presence of outliers who purchase many deals. - We could also investigate the outliers in the low and medium income groups across all education levels to understand why they are purchasing more deals.
Questions Need further research:
After the analysis of question 3, despite the patterns we observe, we also notice that there are several outliers, particularly in the Bachelor and Others categories, where some individuals with low and medium income have purchased a high number of deals. These could be anomalies or may indicate specific segments that are highly responsive to deals despite their income, which need more investigation and analysis.
Our study doesn’t analyze factors related to the customer loyalty program. Given the significance of customer loyalty for business growth, further research could focus on these questions: what factors contribute to higher customer loyalty among different education levels and marital status? Are there particular customer segments that demonstrate higher retention rates, and what strategies can increase loyalty across other segments?