Dataset Description

The dataset that we chose to examine is a customer personality analysis dataset that examined different attributes of a customer so that stores can better understand who is buying their products and when they are buying so that they can adapt to such patterns, retaining customers and making more sales.

The data includes different demographic variables such as race, income, marital status, education. There is also information on the number of children they have. Another category of variables are the amount of different products they bought. As well as whether a customer accepted a deal with discounts and which campaign they accepted. The last category is separating the amount of products in settings such as online, in the store or in the catalog.

There were many different predictor variables we could look at. Particularly, we grouped these variables into overarching themes.

One grouping was based on demographic information¹:

ID: Customer’s unique identifier
Year_Birth: Customer’s birth year
Education: Customer’s education level
Marital_Status: Customer’s marital status
Income: Customer’s yearly household income
Kidhome: Number of children in customer’s household
Teenhome: Number of teenagers in customer’s household
Dt_Customer: Date of customer’s enrollment with the company
Recency: Number of days since customer’s last purchase
Complain: 1 if customer complained in the last 2 years, 0 otherwise

We also had access to amounts of different goods customers purchased:

MntWines: Amount spent on wine in last 2 years
MntFruits: Amount spent on fruits in last 2 years
MntMeatProducts: Amount spent on meat in last 2 years
MntFishProducts: Amount spent on fish in last 2 years
MntSweetProducts: Amount spent on sweets in last 2 years
MntGoldProds: Amount spent on gold in last 2 years

And information on how customers utilized store deals:

NumDealsPurchases: Number of purchases made with a discount
AcceptedCmp1: 1 if customer accepted the offer in the 1st campaign, 0 otherwise
AcceptedCmp2: 1 if customer accepted the offer in the 2nd campaign, 0 otherwise
AcceptedCmp3: 1 if customer accepted the offer in the 3rd campaign, 0 otherwise
AcceptedCmp4: 1 if customer accepted the offer in the 4th campaign, 0 otherwise
AcceptedCmp5: 1 if customer accepted the offer in the 5th campaign, 0 otherwise
Response: 1 if customer accepted the offer in the last campaign, 0 otherwise

Finally we knew where customers made their purchases:

NumWebPurchases: Number of purchases made through the company’s web site
NumCatalogPurchases: Number of purchases made using a catalogue
NumStorePurchases: Number of purchases made directly in stores
NumWebVisitsMonth: Number of visits to company’s web site in the last month

Research Questions

Can we predict someone’s income given their purchasing habits?
1. Faceted boxplots
2. MDS plot
How do the number of kids/teenagers in a household affect how that household shops?
1. Scatter plot
2. Dendrogram
3. PCA
What are the differences in the demographic groups of customers based on where they make their purchases?
1. EDA density curve

Graphs

Research Q1 - Income and Purchasing Habits

For our first research question, we wanted to examine the relationship between income and customer purchasing habits more broadly. We used multidimensional scaling to reduce all of the quantitative predictors in our dataset to two dimensions. The color of each point represents Income.

It seems that customers with incomes below $25000 display very similar purchasing behaviors. There also seems to be a broader cluster for customers with incomes over $75000.

Specifically, we examined how income impacted the number of discounted purchases and whether customers accepted the first campaign offer.

The plot shows that for customers who accepted the first promotional campaign, income doesn’t seem to be related to the number of discounted purchases made. While the medians for 0 discounted purchases and 15 discounted purchases do differ, their spreads overlap. However, among customers who didn’t accept the offer, those who bought more discounted goods tended to be of lower income levels. The plot helps answer the question about the nature of the relationship between income, discounted purchases, and acceptance of promotional campaigns as we seek to understand customer behavior.

Research Q2 - Kids and Household Shopping

We want to take a closer look at how customers with a different number of children shop for question 2, specifically what locations they shop the most, how recent they bought a product and if they bought products with discounts.

This principal component graph was created after taking in the variables of Recency, Number of purchases made with discounts, web purchases and web visits, as well as number of catalog purchases and store purchases. Then we plotted the first two principal components and colored by the number of kids the customers have at home.

As we can see, The red zeroes that are dominating the left side of the graph, there is more green ones on the right side of the graph and some zeroes, ones and twos mixed in the middle.

This suggests that people that have no children at home tend to make a similar amount of purchases across the different locations (web, store, catalog) as well as a similar number of deals with discounts. The green cluster shows that the customers with 1 child shop at the same locations and make roughly the same amount of purchases. The customers with 2 children are more spread out and do not shop as closely as the customers with 0 or 1 children.

Overall it seems that there is more variation in how a customer with no children shops, although a good amount of them do behave similarly. The customers with 1 child seem to have a little bit less variation in how they shop because they are constrained in that there are definitely things necessary to buy in order to provide for a child.

First, we look at two indicative variables, MntWines and NumDealsPurchases, both in their separate distributions and in their relationship in gain a better picture of how these distributions and relationships might vary based on the number of children in a household.

We see consumers with no kids tend to spend more on wine and have a larger spread in their distribution of amount spent on wine, while they tend to make less purchases with a discount, as their distribution in the y-axis is more centered around lower values. We also see there is little to no correlation between expenditure on wine and purchases made using discounts for households with no children. On the other hand, families with 1 or 2 kids spend less on wine and make more purchases with a discount, signified by the smaller, left-centered spread of blue and green values in the x-axis and higher values in the y-axis. We also see strong relationships between greater expenditure on wine and more purchases made with discounts, which has its implications for spending more on wine when purchases with discounts are possible.

Additionally, we look to see if households are inherently similar to each other based on the number of children. Therefore we use clustering techniques, measured by euclidean distance to see how customers group in this high dimensional space.

We see with both complete linkage and average linkage, customers with at least one kid are more similar to each other by their quantitative shopping habits. Particularly in terms of how much of different products they buy, shoppers with children have relatively small euclidean distances from each other. We see there is more variability in the distances of shoppers to each other with no kids, which makes sense because one could imagine having children unites households in commonality but that is not necessarily true for the opposite–namely that shoppers without children would be common in their shopping habits. We also note while there are not as many data points for shoppers with 2 children, we see these black labels are somewhat spread throughout the blue clustering of labels, meaning there might not be a big difference between having 1 child and 2 children in a household in terms of shopping habits.

Research Q3 - Purchase Location and Customer Demographics

First, we look at density plots for the number of store purchases and number of web purchases, faceted by whether the customer was younger or older than the median age.

We suspected that age would be one of the main factors in predicting whether a customer shopped online more or in person more, presuming that younger customers would take advantage of online shopping more while older customers would prefer going in person. However, the plots are mostly inconclusive, mostly just indicating that younger customers purchase more products but not showing much of a difference or association between age and likelihood to shop online or in-store.

To try to incorporate more variables into our analysis, we created an MDS plot that factors in most of the quantitative data in the dataset.

These variables were income, year of birth, and then the respective amounts spent on wine, fruits, meat, fish, sweets, and gold products in the previous two years. We created two MDS variables, mds1 and mds2. The customers are plotted by the values of mds1 and mds2, and then colored by which location they made the most purchases (or “equal” if there was a tie for most prolific location). As it turns out, there are no noticeable clusters or trends in the data.

Finally, we used a multiple correspondence analysis (MCA) package (FactoMineR) to incorporate the categorical data (education, number of kids at home, and whether someone is older or younger than the median age).

MCA is “a generalization of principal component analysis when the variables to be analyzed are categorical instead of quantitative (Abdi and Williams 2010).” Much like with PCA, each MCA dimension is responsible for a certain percentage of the variance in the data. We plotted the first two dimensions, where the points represent many customers (as the data is categorical and theres only so many combinations of factors, dozens of rows will have the exact same dimension values). The first two dimensions are dimensions most responsible for the variance in the data, accounting for ~34% of the variance together. The points are colored by the most prevalent location among the rows represented by that point. Confidence intervals for the mean dimensional values for each of the location groups are displayed as circles around four of the points. Once again, we can draw no real conclusion from this graph, as the means are all very close to each other and there are no clusters.

Conclusion

For our first research question, we found that lower-income customers (incomes belos $25,000) exhibited similar behaviors in terms of purchase behavior, and that higher-income customers (incomes above $75,000) are in a looser cluster. Specifically, customers wih higher incomes tended to buy less discounted purchases if they accepted the first advertisement campaign offer.

From our second research question, we discovered households without children spend more on wines and do not maximize deal purchases. Particularly, we conclude that households with children definitively shop similarly compared to households without children both in what they buy and how these goods are purchased.

For the relationship between different customers and where they shop (research question 3), from the clustering techniques that we have employed, we don’t see a particular trend. There seems to be more randomness in how people buy products from the locations available because people will probably just choose what location is easiest to access in that moment, which may change from time to time. We do not have a definitive predictor of what makes people shop where.

Kaggle Dataset: https://www.kaggle.com/imakash3011/customer-personality-analysis?select=marketing_campaign.csv ↩︎

Customer Personality Analysis

Janice Lee, Jonathan Huang, Meera Ray, Philip Kaufholz

36-315 Final Project