About the Data set

The data set we are using contains customer personality analysis data from Kaggle (https://www.kaggle.com/datasets/imakash3011/customer-personality-analysis). The data contains 2,240 rows of customers and 29 columns of various variables characterizing the customers. Our dataset included four main groups of variables that we used in our analyses which include demographic features of a customer, amount of money spent on products, consumer behaviors with promotions, and consumer purchasing behaviors. All the variables in the original dataset are included below:

Demographic Features:

Product Spending:

Promotion Behavior:

Purchasing Behavior:

In order to explore the data more and make some of the analyses easier, we also created and recoded a few variables that were used in our later plots. The recoded variables are education and marital status and they were changed to create more clear categories. The variables that were created are:

One benefit of this dataset is that it makes it easy for businesses to better understand its customers, see which products might sell best depending on the characteristics of certain consumers, and even create products to target certain customers. This has many positive implications for creating new marketing strategies and analyzing which customers are most likely to buy a certain product. This could give companies a chance to bring in more business and get a higher profit while also benefiting consumers since they can get more products that they like and want.

Overall, we were interested in exploring four main research questions with this dataset.

1: Do certain groups of consumers have different spending habits?

We were first interested in learning more about consumer spending behaviors in relation to products, specifically whether certain demographics bought more or less of a product. In order to explore this, we needed to look at the relationship between different demographic factors like education, marital status, child status, age, and income to the amount of money spent on different products like wine, meat, fruit, fish, sweets, and gold.

We can see from the boxplot above that there is a clear trend in how much people spend depending on what children they have with people with no children spending the most on fruit at around $35 on average, the next highest being people with teenages, then only kids, and households with teenages and kids spending the least. This makes sense since people with no children would have more money that they could spend on themselves, however, it would be assumed that households with children would spend more on fruit not less in order to give their children a healthy diet and help them get vitamins as they grow. Possible explanations are that households with children are really busy and hectic so parents choose to buy easier groceries that are cheaper and longer lasting than fruit and also that children eat meals at school which would require parents to need to buy less groceries for home. Overall, this shows that households without children will spend more money on products which would be relevant information for companies to know so they could target more ads towards them.

Additionally, it can be seen across many products that as income increases, the amount spent on products increases which makes sense because people have more disposable income to spend on groceries. This is seen in the contour plot above since it can be seen that many of the points are highly concentrated around high income and high spending on wine which hints at a strong association between the two. However, it can also be seen that this trend is weaker for lower amounts of money spent on wine due to many outliers. This suggests that there are other factors that explain why people spend a bit on wine such as preferences and special occasions and would be useful for companies to know when they’re trying to sell more luxury items like wine to different income brackets.

Besides demographic features, we wanted to see if there were any trends and correlations among spending for different products which we can see with a PCA biplot of the first two principal components above. Overall, spending among different products are either strongly positively correlated or have little to no correlation which makes sense because if someone spends more on one product, they are generally able to spend more on other products due to either the quantity they need or the price range they are shopping in. We can see in the PCA biplot that spending on fish, fruits, and sweets are strongly positively correlated while spending on wine and gold is strongly positively correlated. While more spending in one group could be slightly correlated with more spending in the other, it is not a strong link since someone’s purchases for luxury goods like wine and gold is based on preferences and special occasions while common groceries like fish, fruits, and sweets are always needed. This could prove to be useful for companies to know since they can categorize consumers based on what they buy and focus on advertising other products highly correlated with those purchases.

2: Is there a difference between the types of purchases consumers make?

After exploring different spending habits across demographics, we were then interested in investigating how the consumers purchased products differed — to ultimately discover if there were differences in the number of purchases made using a catalog, directly in store, through the company’s website, or using a discount.

Based on the barplot above comparing the frequencies of the different types of purchases, we can conclude that store purchases were most common among consumers, with an average of approximately 6 in-store purchases per consumer. This was followed by almost 4 web purchases and roughly 2.5 catalog purchases per consumer on average. Additionally, the average number of purchases made using a discount was also around 2.5 per consumer. However, looking at the overlap of error bars on the graph, we can assume that these differences between different types of purchases are not necessarily significantly different. Nevertheless, seeing this comparison between types helps us better understand consumer behavior as it suggests that store purchases are most frequent and should be considered in future marketing decisions.

To further investigate the differences between the types of purchases, we used PCA to see the relationships among the variables. Thus, from the PCA biplot, it appears that the number of store purchases variable and the number of catalog purchases variable both point towards the right, signaling that customers with a high first principal component tend to have higher values of these variables. However, the more relevant observation is how the number of store and catalog purchases are almost perfectly orthogonal to the number of deals purchases. Since one of our key questions seeks to investigate if there are differences between the different types of purchases, this PCA biplot suggests that customers who made purchases in store and via a catalog may not be using deals as often as others.

3: How do consumer behaviors change over time with aspects like joining the program and frequency of shopping?

In addition to exploring spending habits across different demographics and differences in purchasing behavior, we also wanted to see if there were any noticeable patterns or behaviors over time in terms of how frequently customers were shopping or how many customers were joining the program for their respective company that they shop through.

The graph below shows time series data spanning from August 2012 to July 2014–the y-axis on the top graph shows the number of days since the customer last made a purchase, while the y-axis on the bottom represents the total number of customers that joined for each month in our time frame. This is informative since it compares the dates that customers joined the program and the total number of customers that joined as well as the days since the last purchase. It gives a picture of any seasonality in terms of what time of the year customers seem to join as well as information on frequency of shopping.

We can see slight patterns in our graph in terms of the days since last purchase. For example, we see dips in the number of days since last purchase around September, December, February, and July specifically, which seems to be rather consistent across several years. This possibly could be explained by the expected increase in consumer spending over popular events such as Back to School shopping, Christmas, Valentine’s Day, and Fourth of July (plus other summer vacationing) that falls in September, December, February, and July respectively.

When looking at the number of customers that enroll with the company over time, we can see slight peaks scattered throughout the months, but no specific pattern that is consistent across the years. It seems that generally there seems to be an average of three new customers enrolling/joining every day.

4: Do different groups of customers vary in the complaints they have on products?

One interesting variable that is part of the dataset is “complain”, which states whether a customer has a complaint about a product or not within the last two years. We use mosaic plots to see if there are significantly more complaints than we would expect under the null hypothesis of independence across several different demographic information.

The plot showing the complaints broken down by the generation that the customer belongs to, we find that there are significantly more complaints in people born before 1901, leading us to conclude that complaints and generations are not independent of each other within the pre-1901 generations.

The other plots that show the complaints broken down by marital status, children, and education do not indicate any substantially large Pearson residuals, thereby not providing any evidence to reject the null hypothesis of independence between the mentioned variables and complaints for this dataset.

Conclusions:

Overall, we can see across our first three visualizations that there are some common trends and correlations between a person’s demographic/spending factors on the amount they are spending on different products such as households without children spend more, people with higher income spend more, and certain products are highly correlated in terms of spending. We can use this information to help implement better marketing strategies by focusing certain ads, deals, and promotions to different groups of consumers for products they care more about which will help companies make more sales and help consumers buy the products they want easier and cheaper.

The following two visualizations indicate that there is a difference between the types of purchases consumers make, such that in-store and web purchases are more common than catalog purchases. We also learned that surprisingly, consumers may not be using discounts in stores very often, hence proposing that companies might want to consider focusing on discount marketing and promoting discounts more especially in stores to generate higher sales and number of purchases.

The next visualization shows slight seasonality in terms of the frequency of shopping that consumers participated in given the month of the year. We found certain months commonly associated with holidays/events (August/September, December, February, and July specifically) were associated with peaks in the frequency of shopping. In terms of the number of customers that joined the program within the last two years, we did not find any distinct patterns in terms of when in the year they were consistently more likely to join. These results can inform companies when they should stock more products to prepare for higher sales, or what time of year they should push ad campaigns during the months that consumers are less likely to make purchases.

The last visualization we created was a mosaic plot of whether the customer had any complaints about products within the last two years split by different demographics such as education, marital status, or the generation that they were born. We found no significant results except that there are significantly more complaints in people born before 1901, meaning complaints and generations are not independent of each other within the pre-1901 generations. It was uncertain whether the customers that stated their birth year as before 1901 was a data entry mistake or not, since it would have been unlikely that a respondent would be over a hundred years old. These results can help companies better prepare their customer service departments on how to handle larger amounts of complaints from specific groups of customers.

Despite all of these important findings, we do acknowledge that our visualizations and research are not robust. Our dataset was limited to information about a few customer demographics, amount spent on different products, behaviors with promotions, and the mode of which they buy products. Some information that would be important to collect and analyze in the future would include the amount of time spent in stores, location of the stores, and employment status of customers. Additionally, our research only looks at the relationships between the variables visually to find trends. It would be beneficial to use our findings in a regression analysis in the future to predict consumer behavior and spending trends. Specifically, we could use a linear regression to predict the amount that consumers would spend on products as well as perform a logistic regression to predict whether certain marketing and promotional techniques were a success or failure in their goals. These would give more details to stores which could help better inform advertisement decisions and store layouts. Overall, while we acknowledge that our findings could be improved in the future, our research still highlights important and major trends in consumer behavior which would be important for companies to take into consideration.