Data Description

Businesses rely on customer data to understand their target clients associated with various products. Customer personality analysis helps a business see these hidden relationships that can inform the decision making process in building more targeted commercial campaigns and adjusting products to meet their clients needs.

In this project, we deal with a customer personality dataset that contains 2240 subjects and four types of customer attributes, which are people, products, promotion, and place. The list of attributes is indicated below.

People	Variable Explanation
ID	Customer’s unique identifier
Year_Birth	Customer’s birth year
Education	Customer’s education level
Marital_Status	Customer’s marital status
Income	Customer’s yearly household income
Kidhome	Number of children in customer’s household
Teenhome	Number of teenagers in customer’s household
Dt_Customer	Date of customer’s enrollment with the company
Recency	Number of days since customer’s last purchase
Complain	1 if customer complained in the last 2 years, 0 otherwise

Products	Variable Explanation
MntWines	Amount spent on wine in last 2 years
MntFruits	Amount spent on fruits in last 2 years
MntMeatProducts	Amount spent on meat in last 2 years
MntFishProducts	Amount spent on fish in last 2 years
MntSweetProducts	Amount spent on sweets in last 2 years
MntGoldProds	Amount spent on gold in last 2 years

Promotion	Variable Explanation
NumDealsPurchases	Number of purchases made with a discount
AcceptedCmp1	1 if customer accepted the offer in the 1st campaign, 0 otherwise
AcceptedCmp2	1 if customer accepted the offer in the 2nd campaign, 0 otherwise
AcceptedCmp3	1 if customer accepted the offer in the 3rd campaign, 0 otherwise
AcceptedCmp4	1 if customer accepted the offer in the 4th campaign, 0 otherwise
AcceptedCmp5	1 if customer accepted the offer in the 5th campaign, 0 otherwise
Response	1 if customer accepted the offer in the last campaign, 0 otherwise

Place	Variable Explanation
NumWebPurchases	Number of purchases made through the company’s web site
NumCatalogPurchases	Number of purchases made using a catalogue
NumStorePurchases	Number of purchases made directly in stores
NumWebVisitsMonth	Number of visits to company’s web site in the last month

Research Purpose

We, as a professional data analytic team, are invited by this particular shopping mall to perform basic customer classification, as the prerequisite for further marketing campaign plans, and personal services.

Customer classification is the act of seeking out and identifying common traits in a group of customers. It answers a broad question: what is similar about these people and their purchasing habits?

Some of benefits of customer segmentation includes, improving your product and services, getting quality revenues, increasing sale, and lastly focusing on your customized marketing message.

In this research paper, we are going to raise questions to each section of data they provided, which are related with the customer individual, products they purchased, whether they accept marketing campaign, and at last, where do they prefer to make the purchase.

People

Research Question 1

Can we classify customer based on their person information?

First, we used K-means clustering, together with paired plots to have an intuitive idea of what the plots would look like.

K-means clustering is a technique that uses ‘centroids’, K different randomly-initiated points in the data, and assigns every data point to the nearest centroid. After every point has been assigned, the centroid is moved to the average of all of the points assigned to it. Then the process repeats: every point is assigned to its nearest centroid, centroids are moved to the average of points assigned to it. The algorithm is done when no point changes assigned centroid.

Some of the advantages of K-means include relatively simple to implement, scales to large data sets, and guarantees convergence.

From the paired plot, we can see that for most variables’ smoothed density plot that the conditional distribution based on the which center they belong to does not vary too much, and they just overlap with each other.

However, on the second row, second column of the entire chart, we can easily identify for different cluster’s customer income, there is a huge difference, the three modes are nearly distinct with each other.

The previous graph has given us important information about characteristics of each individual clusters, that Income, and Age are the first and second most identifiable differences, thus we are taking a close look at these two variables.

## # A tibble: 3 × 4
##   center    Max Median   Min
##    <int>  <int>  <dbl> <int>
## 1      1 113734 74805  63564
## 2      2  39791 29672   1730
## 3      3  63516 51382. 39858

The variable summarize has told us that if we want to classify customers based on their primary differences, it’s a good idea to classify based on their income values, assigning customers with income more than $63,500 to high income group, customers with income from $63,500 to $40,000 as middle income, and customers with income less than $40,000 as low income.

People and Products

Research Question 2.1

What kind of people spend more on wine/fruit/meat/fish/sweet/gold?

To answer this question, we created subsets that contain the most relevant customer attributes, which are Education, Marital_Status, Income, Kidhome,Teenhome, and Recency. We also decided to classify the amount of purchases by low and high, and thus the quantitative variables about wine, fruit, meat, fish, and gold purchases are transformed to binary outcomes of low and high. Then, for each category of purchase, we built a decision tree to classify customers into subjects that have more or less purchases. This process leads to five decision trees’ visualization in total.

The first thing we notice is that for all categories of purchases, income is at the root of the decision trees, which means it is the most dominant classifying attribute that gives the statistically significant first split based on the gini impurity algorithm.

From the wine purchase decision tree plot, we identify that the right branch of the first decision split of income is low, 0.9, and 46%. This means that if the customer income is less than 49000, the chance that he is in the lower range of wine purchase is 90%. Moreover, Kidhome is another splitting criteria, and we identify that if income is between 49k and 55k, and if the customer does not have a kid at home, there is a high chance of 76% that he purchases more wine. An interesting but understandable observation is that if income is greater than 110k, there is only a 9% chance that the customer purchases less wine regardless of whether there is a kid present at home. Similarly, if the income is less than 49k, the chance of the customer purchasing less wine is 90% regardless of his other attributes.

From the fruit purchase decision tree plot, we identify that the likelihood that customers with an income lower than 59k purchase more fruit is 29%, and the likelihood for customers with an income higher than that purchase more fruit is 85%. Furthermore, we conclude that customers who buy more fruits are people with an income higher than 59k or people with no kid and an income between 46k and 59k.

From the meat purchase decision tree plot, we identify that customers who purchase more meat are people with an income greater than 55k, and the likelihood is 89%. Moreover, the chance of customers buying less meat is 92% if his income is below 42k.

Next, we observe the fish purchases’ decision tree and identify that customers who purchase more fish are people with income greater than 58k, and the likelihood is 85%. For customers with lower income than this boundary, there appears to be more chances in purchasing fish for customers with no kid or teen present at home, income greater than 15k, and with the last purchase of more than 13 days.

From the sweets purchase decision tree plot, we identify that customers with income greater than 59k purchase more sweets, and the likelihood is 85%.

Lastly, for gold purchases’ decision tree, there is a high chance of purchasing more gold for customers with income more than 51k. We notice that although Kidhome is a common splitting criteria for all other categories of purchase, that is not the case for gold purchases, and this means that whether there is a kid present at home does not make any statistical difference in customers’ purchase of gold. Furthermore, we see that education is an important splitting criteria for gold purchase, and customers with education of 2nd cycle have 83% of purchasing more gold.

Research Question 2.2

What do customers prefer to purchase in each income group?

The original income variable is a discrete quantitative variable, and we transformed it into a factor of five income levels, which are “lowest”, “second lowest”, “medium”, “second highest”, and highest. Then we generated the following side-by-side bar chart that visualizes the distribution of customer purchases by each of the five income groups.

We can easily see that the average customer purchase amount per person is the least in the lowest income group and increases over the higher income groups. This means that people with higher incomes buy more products overall. What’s more, it appears that the two lowest income groups have an even number of purchases for each product type. However, as the income condition gets higher, the number of purchases in wine and meat per person increases drastically.

Furthermore, customers in the higher income groups buy wine and meat a lot more than they purchase other types of products, and they devote a similar amount of purchases for all other kinds of products on average.

People and Promotions

In this section, we are going to analyze the relationship between customers and product campaigns.

Research Question 3.1

Is there association between customer’s income and whether they would accept a product campaign?

The purpose of this part is to find out if there is any association between customer’s income and whether they would accept a product campaign. The findings from this section would be particularly useful for companies when doing target customer assessment, and the result could be utilized in further marketing strategies. Our assumption is that customers who have higher income values would be more likely to accept a product Campaign.

We first used subset to separate the customers into two groups: those who never accept any campaign and those at least have accepted one campaign.

According to this boxplot, customers who have at least accepted one campaign have a higher median of annual income than customers who have never accepted any campaign. This finding corresponds to our previous assumption. Therefore, we conclude that customers who have higher income tend to accept product campaigns than those with lower income. Companies should take into consideration target customers’ income and product prices when building their future campaign.

Research Question 3.2

Is there association between customer accepting marketing campaign and number of discount deals they accept?

Most companies have sales and discounts during holidays, and knowing the statistics about the number of deal purchases could be helpful to maximize their revenue. We think there should be an association between the number of deal purchases and the customer acceptance towards the campaign. According to these two histograms, for those customers who have accepted campaigns before, nearly 50% of them made a purchase with a discount once, less than 20% of them made two purchases with discount, and fewer customers made purchases over 2 times with discounts. For customers who have never accepted any campaigns, around 40% of them purchased once with discounts. Over 20% of these customers made two purchases with discounts.

We noticed that the proportion of one time purchases with discounts is higher for those who accepted campaigns, but the proportion of more than one time purchases is higher for those who did not accept campaigns. These findings suggest customers favoring campaigns tend to purchase less with discounts, whereas customers rejected campaigns tend to make multiple purchases with discounts.

People and Places

Research Question 4.1

What is the user profile for web/catalog/store purchases?

The boxplot above illustrates the distribution of customer’s Income based on the place where customers make the purchase most frequently. We observe that the majority of customers whose purchases made using “catalog” most frequently have the highest income, while the majority of customers whose purchases made using “store” most frequently and using “web” most frequently have the similar income.

We also created five mosaic plots to access the relationship between categorical variables of users’ attributes (i.e. Education, Marital_Status, Kidhome, Teenhome, and Complain).

From the “marital status” plot, it appears that there are significantly more purchases using “catalog” for customers who are widows. From the “number of kids at home” plot, it appears that there are significantly more purchases using “catalog” for customers who have no kids at home, and there are significantly less purchases using “catalog” for customers who have 1 kid at home. From the “number of teen at home” plot, it appears that there are significantly more purchases using “catalog” for customers who have no teens at home and also significantly more purchases using “web” for customers who have 1 teen at home. Additionally, there are significantly less purchases using “web” for customers who have no teens at home and significantly less purchases using “catalog” for customers who have no teens at home 1 teen at home. Thus we conclude that marital status, number of kids at home, and number of teens at home have a dependency relationship with the most frequent places these customers purchase.

Meanwhile, the “education” and “complain” plots do not indicate any substantially large Pearson residuals, thereby not providing any evidence to reject the null hypothesis of independence between “education” and “place” and between “complain” and “place”, respectively.

Research Question 4.2

Do people’s website visit this month is indicative of their total web purchases?

The above graph suggests that the number of web visits in the past month is not indicative of the total number of web purchases. We observe that the low number of web visits in the past month (0, or 1) matches the highest number of web purchases (over 20), while the high number of web visits in the past month matches the lowest number of web purchases (below 5). The majority of data are in the range [0, 9], but there’s no clear pattern for the relationship between number of web visits in the past month and the total number of web purchases. However, we cannot strictly conclude that they have no relationship because NumWebPurchases means the total number of web purchases instead of the number of web purchases in the past month.

Conclusion

In our research, we investigated the customer demographics, types of product customers purchased, the relationship between marketing campaign and number of purchases, and customers’ preference on different shopping methods.

From our research, we found that income is the most important factor to customers’ purchase decisions. In particular, the increase of income will lead to more purchases on wine and meat products, and customers with higher annual income tend to spend more on purchasing wine and meat.

For the marketing promotion, we found that customers who have higher income tend to accept product campaigns than those with lower income. In addition, customers favoring campaigns tend to purchase less with discounts, whereas customers rejected campaigns tend to make multiple purchases with discounts.

From the analysis on customers’ preferences on shopping approaches, we observed that the majority of customers whose purchases made using “catalog” most frequently have the highest income, while the majority of customers whose purchases made using “store” most frequently and using “web” most frequently have the similar income. In particular, we produced five mosaic plots accessing the relationship between categorical variables of users’ attributes, which includes Education, marital status, and number of kids.

In the last section, we have analyzed if there is a causal relationship between online purchase with website visits. We have certain limitations on making conclusions, and we believe further research is needed to address this question.

With all these analyses on customers and their consumption preferences, we believe this report would be useful for the shopping mall to conduct future product campaigns. And the shopping mall should adjust their current marketing strategies and reassess their target consumers based on the customer demographics and preferences to increase their revenue.

36315 Final Project

Yifan Wang, Yilin Wang, Lingjie Feng, Xuyang Wu

12/6/2021