Introduction

Credit card customer behavior plays a critical role in shaping banking strategies and product offerings. Understanding why customers select specific card categories, the financial differences between genders, and the factors influencing customer attrition can provide valuable insights for improving customer satisfaction and retention.

Data Description

This project leverages a dataset of 10,000 customers with 20 features encompassing demographic, behavioral, and financial variables from kaggle. The rows in the dataset represent individual customers, while the columns (18 variables) include a variety of features that can be grouped into three main categories:

Demographic Variables:

Customer_Age, Gender, Education_Level, Marital_Status, Income_Category (The annual income category of the customer, e.g. <40K, 40K−60K, 60K-120K, >120K).

Behavioral and Engagement Variables:

Months_on_book: The number of months the customer has maintained a relationship with the bank
Total_Relationship_Count: The total number of financial products the customer holds with the bank
Months_Inactive_12_mon: The number of months the customer was inactive in the last 12 months
Contacts_Count_12_mon: The number of contacts made with the customer in the last 12 months

Financial Variables:

Credit_Limit: The maximum amount of credit the customer can access
Total_Trans_Amt: The total transaction amount in the last 12 months
Total_Trans_Ct: The total transaction count in the last 12 months
Avg_Utilization_Ratio: The average utilization ratio of the credit card

Target Variable:

Attrition_Flag: Indicates whether the account is active or has been closed.

Research Questions

This dataset provides a robust foundation for analyzing customer behavior and predicting outcomes such as credit card preferences, financial differences between genders, and customer attrition. Through comprehensive exploratory data analysis, statistical modeling, and visualization, this report aims to uncover patterns and relationships that address three key questions:

  1. What factors influence the card category a person gets?
  2. How do male and female customers differ in their financial behaviors?
  3. What are the key factors influencing customer attrition?

By analyzing these aspects, we aim to provide actionable insights that banks can use to better cater to their customers and reduce attrition rates.

Question 1: What are the important factors that influence the card category that a person gets?

The research question “What are the important factors that influence the card category that a person gets?” aims to identify and analyze the key variables that determine the type of credit card a person is assigned to. This question is significant because understanding these factors can provide valuable insights for both financial institutions and consumers.

To start with, we are interested in the age and gender variables and see if these basic categorical variables would impact the card category that bank’s member get.

Barplot with Gender

From the bar plot, we see that females have more blue cards over males. But for the other cards, male seem to have a larger count of cards.

The fact that more females have blue cards might suggests that women are more likely to hold a Blue card compared to men. The Blue card is often an entry-level or basic card with fewer requirements. This could imply that women, on average, may have lower eligibility thresholds or are more likely to apply for a basic card.

And males having more cards in other categories (Gold, Platinum, Silver) would imply that men may generally have higher income, greater credit scores, or a higher propensity for spending, which makes them more eligible for these types of cards. Thus, gender could be an important factor in determining the card categories.

Density plot on Age

The Blue and Silver cardholders seem to have the most concentration of customers around the mid-40s to early 50s range, with the highest density near the age of 40–45. This suggests that people with Blue and Silver cards are most likely in this age group.

The Gold and Platinum cardholders show a slight shift towards slightly older customers, peaking around the 45–50 age range, suggesting a trend where higher card categories (Gold, Platinum) might attract older customers compared to Blue and Silver cardholders.

Moving to the overall distribution, all categories show similar general trends, but gold cardholders have a broader age range, with densities extending more evenly across ages 30–65. This could mean Gold cardholders come from more varied age groups. Platinum cardholders show a relatively tighter distribution around 40–50, indicating that they are more likely to be within a specific age range. Blue and Silver cards appear to have more concentrated groups with a sharper peak, likely showing that customers in these categories are less age-diverse compared to higher-tier cards. Overall, the density plot indicates that Customer Age varies somewhat across different Card Categories.

Boxplot on Months on book by Income category vs Card category

Furthermore, we are hypothesizing that the month on the book and income category could affect the card category as well. We can construct a boxplot to take a closer look at it.

The box plot provides a detailed distribution of months on book for each combination of income category and card category. The horizontal axis represents income categories, ranging from less than $40K to more than $120K, with an additional category for unknown income. The vertical axis shows the duration in months on book. Each income category contains box plots for four different card categories: Blue, Gold, Platinum, and Silver, color-coded. The box plots display the median, interquartile range (IQR), and potential outliers for each combination. From the plot, we observe that higher income groups, particularly those earning $120K+, tend to have more consistent and longer durations of cardholding, especially for Platinum and Silver cards. This is indicated by the narrower IQRs and higher medians around 35-40 months. Mid-income groups ($40K-$120K) also exhibit consistent durations, with Platinum cardholders having slightly longer durations (medians around 38-42 months). Lower income groups (less than $40K and Unknown) show greater variability and shorter durations, with wider IQRs and more outliers. Blue cardholders across all income categories demonstrate a broad range of durations, with medians in the mid-30s to 40s.

By observing and comparing the medians and IQRs, we can infer that income level significantly influences cardholding behavior, with higher income clients maintaining their cards longer, particularly premium cards like Platinum and Silver.

Random Forest Model

The random forest model provides insights into the factors influencing the Card_Category by evaluating feature importance using two metrics: Mean Decrease Accuracy and Mean Decrease Gini. Both metrics allow us to identify the variables most critical in predicting a person’s card category.

The Mean Decrease Accuracy plot highlights the impact of each feature on the overall accuracy of the model. The feature Credit_Limit emerged as the most significant predictor, with its exclusion causing the largest reduction in model accuracy. This result indicates that a person’s credit limit is a key determinant of the card category they are assigned to, reflecting the role of financial creditworthiness in such classifications.

Following Credit_Limit, the Income_Category was the second most influential feature. This shows that an individual’s income bracket provides additional, valuable information for distinguishing between card categories, further emphasizing the importance of financial standing. Total_Trans_Amt (the total transaction amount) demonstrated moderate importance, suggesting that spending behavior contributes to the prediction but is not as impactful as the credit limit. Other features, such as Gender, Customer_Age, Months_on_book, and Total_Revolving_Bal, show minimal influence. The Mean Decrease Gini plot focuses on how well each feature contributes to splitting the data into distinct categories (In this case, Gold, Platinum, Silver, and Gold). Similar to the accuracy metric, Credit_Limit had the highest importance, showing that it effectively separates card categories by reducing node impurity. Income_Category again emerged as the second most important variable, confirming its consistent role in improving model classification. Transactional variables such as Total_Trans_Amt and Total_Trans_Ct (the count of transactions) showed moderate importance. Features like Customer_Age, Months_on_book, Total_Revolving_Bal, and Gender were less impactful in this metric as well, suggesting that these factors do not provide meaningful splits in the tree and have limited predictive power.

Overall Summary

The analysis identifies several key factors that influence the card category a person receives. Income stands out as the most significant determinant, with individuals in higher income brackets more likely to hold premium cards like Gold or Platinum. This is supported by the linear regression and the boxplot results, where income is positively correlated with the card category. Credit limit and total transaction amount also emerge as strong factors; higher credit limits and increased spending are associated with higher-tier cards. In terms of age, the data shows that individuals in their 30s and 40s are more likely to possess premium cards, while younger or older individuals tend to have Blue cards. The gender analysis reveals that females tend to hold more Blue cards, while males have a higher proportion of higher-tier cards like Gold and Platinum. Overall, income, credit limit, and transaction history are the primary factors influencing card category, with age and gender also playing secondary roles.

Question 2: How do males and females differ in their financial behavior, or are they not statistically different?

The study of if there are differences in behavior between different genders regarding certain activities has been very common across different disciplines, whether it related to physical attributes, psychological behaviors, or text and sentiment analysis, etc. As for this question “How do males and females differ in their financial behavior, or are they not statistically different?”, we want to explore if within banking and financial behaviors there are differences regarding male and female customers or not, and what kind of differences if it is the former answer to the first question. The exploration and analysis made in order to answer this question can bring us a better understanding of customer behavior, specifically regarding different genders, and might aid banks in attracting and retaining customers of different genders with different strategies.

Bar chart of Months inactive (past 12 months) by gender

Before diving into the potential differences between genders in terms of financial-related numbers, we might want to visualize the distribution of gender by months the customer has been inactive in the past 12 months. As we would investigate more thorough later, banks would want to retain customers and keep them active as long as possible, so learning about gender distribution might help banks cater methods of retaining customers when they have different length of inactivity.

In the above bar chart, we visualized the proportion of different length of inactivity, from 0 months to 6 months, of male and female customers. For 0 months of inactivity, which we assume that suggests the customer is active, the female proportion is higher than that of male, which could suggesting that females are more likely to be active in credit card usage. And if we look at other values for months inactive, for 1, 2, and 3 months of inactive, the proportions of male customers are higher than these of female customers, while proportion of female customers inactive for 4, 5, and 6 months are higher than those of male customers. Therefore, we might suggest that female customers are more likely to be on the two ends: either active, or inactive for a long time (4 to 6 months), and male customers are more likely to be inactive for a shorter period of time (1 to 3 months). This might help the banks to decide when to contact the customers to bring them back by different genders.

Boxplot of Average transaction amount of male and female customers

One of the financial behaviors that we can look into is the transaction behavior of customers. As we have the total transaction amount and total transaction count of a customer in the past 12 months in this dataset, we can explore the average amount per transaction a customer makes and see if differences exist between different genders with data visualization and statistical test.

For this box plot, we have the x-axis the gender of the customer, and y-axis the average transaction amount of the customer in the past 12 months. With the box plots we can see the median, interquartile range (IQR), and the outliers of each category of gender. Although we have a large amount of outliers, we can still observe the main box that represents the IQR. For female customers, the median is higher than that of male customers, which might suggest that they tend to have higher transaction amounts than male customers.. However, for male customer the IQR box is wider, suggesting that male customers have a higher variability in terms of average amount per transaction with their credit card. This is consistent with the whiskers of the two categories, as the spread of average amount per transaction is more varied for male customers then female customers.

In terms of the variance between central tendency of the two gender groups, we can perform a t-test to investigate if the means of the average amount per transaction across two gender groups are different to establish if male and female customers have at least partially different spending patterns.

## 
##  Welch Two Sample t-test
## 
## data:  Avg_Trans_Amt by Gender
## t = -4.3912, df = 8940, p-value = 1.14e-05
## alternative hypothesis: true difference in means between group F and group M is not equal to 0
## 95 percent confidence interval:
##  -3.384722 -1.295487
## sample estimates:
## mean in group F mean in group M 
##        61.51072        63.85082

We see that from the t-test, whose null hypothesis is that the means of the average amount per transaction is equal to each other, we obtained a p-value of 1.14e-05, which is almost zero, that is smaller than the alpha value of 0.05. Therefore, we can reject the null hypothesis and conclude that the menas of average amount per transaction is different between male and female customers. We can also see that the sample estimates that the mean for female customers is 61.51, and the mean for male customers is 63.85. Combining the boxplot and the t-test, we see that female customers have a higher median than male customers, but male customers have a higher mean than female customers. This is likely because of the wider spread of transaction amounts of males. Therefore, we see that female customers tend to be more centralized on a higher amount per transaction, while male customers could be more spread out that even though the median is lower than that of female customers, the mean is higher.

Density plot of Average Card Utilization Ratio by Gender

Another financial behavior that we can discuss here is the average card utilization ratio, which happens to be available in our dataset. The average card utilization ratio is an important metric that can provide valuable insights into a customer’s financial behavior, especially for banks. The utilization ratio is the ratio of a customer’s outstanding credit card balance to their credit limit. This metric is calculated by \(\text{Average Card Utilization Ratio} = \frac{\text{Total Revolving Balance}}{\text{Credit Limit}}\). In other words, it measures how much of the available credit a customer is using on their card. We would want to see if the distribution of this metric is different for male and female customers.

With this histogram, we can see that the two distributions seem to be somewhat different between the two genders, with male customers have a more gradual decreasing distribution as the ratio increases, and female customers a very high peak at the lowest bin followed by a somewhat bimodal distribution. We can switch to a density graph to inspect the distribution in a smoothed form:

After smoothing out, we see that the partly bimodal distribution we saw before seems to not be the case for female customers. Still we can see from the density graph that seems like male and female customers have different distribution in terms of the average card utilization ratio as instead of gradual decrease like male customers, female customers have another local peak at ratio of about 0.62. If we link to the common insights banks gain from this metric, this might suggest that female customers might have higher credit risk, as high average card utilization ratio might be a sign of financial stress or poor money management. But it could also be an indication of high reliance on credit card, which might be linked to the probability of retainment of the customer. To test out the legitimacy of the claim of different distribution, we can use a two sample Kolmogorov–Smirnov test.

## 
##  Asymptotic two-sample Kolmogorov-Smirnov test
## 
## data:  male_ratio and female_ratio
## D = 0.25858, p-value < 2.2e-16
## alternative hypothesis: two-sided

The null hypothesis of a two sample Kolmogorov–Smirnov test is that the two samples come from the same continuous distribution. Although we had a warning of the presence of ties, we can see that the p-value of the test is less than 2.2e-16, approximately 0, and less than the alpha of 0.05. Therefore we could reject the null hypothesis and say that the two distribution of average card utilization ratio of male and female customers are different.

Overall Summary

From the above analysis based on the gender of the customers, we see that indeed there are differences between male and female customers in terms of credit card usage and their relationship with the banks. Female customers are more likely to be either active or inactive for longer periods, while male customers tend to have shorter inactivity spans. Female customers show higher median transaction amounts, indicating consistent spending, whereas male customers display greater variability and a higher mean transaction amount. In terms of card utilization, female customers exhibit peaks at low and moderate levels, while male customers show a gradual decline. These findings suggest that banks can benefit from gender-specific strategies for customer retention, personalized marketing, and credit management to better cater to these behavioral differences, but still more studies and analysis could be conducted to solidify if banks want to answer more financial behavior-related questions like a more thorough analysis of potential difference in credit risk, spending patterns, or saving patterns across genders.

Question 3: What are Some Key Factors Influencing Customer Attrition?

The purpose of this research question is to identify key factors that influence whether a customer is likely to be classified as an “Attrited Customer” or “Existing Customer.” By exploring relationships variables that represent customer behaviour and their relationship with the bank, we aim to discern which features play a significant role in determining attrition. These insights can guide the bank in implementing strategies to reduce customer attrition.

Scatter Plot: Total Transactions Count vs. Total Transactions Amount

First of all, we are interested in how the customer’s regular spending behavior would influence their attitude about whether to keep the card. Therefore, we select the variables Total_Trans_Ct, Total_Trans_Ct, and Attrition_Flag to make a scatter plot with density curves.

As expected, there is a clear positive relationship between the number of transactions and the total transaction amount. This suggests that customers who transact more frequently also tend to have higher transaction values.

Attrited Customers, which is represented by red contour and dots are concentrated in the lower-left region of the plot, indicating fewer transactions (less than 50) and lower transaction amounts (below $5,000). Existing Customers, which is represented by blue contour and dots These customers dominate the upper-right region, with significantly higher transaction counts and amounts. They also exhibit a broader spread, suggesting greater diversity in transaction behaviors.

In addition, the density contours highlight that attrited customers and existing customers form distinct clusters, with little overlap at higher transaction counts and amounts. This may be the result of the cluster of income or card type.

Box Plot: Relationship, Contact Count, and Months Inactive

Besides, the relationship between customers and the bank could also plays an important role in preventing customers end their membership, Therefore, we select the variables Total_Relationship_Count, Contacts_Count_12_mon, Months_Inactive_12_mon, and Attrition_Flag to make a box plot.

Contact Count: The median contact count is notably higher for attrited customers compared to existing customers. Attrited customers have a median contact count centered around 3, while existing customers show a lower median, approximately 2. The wider range of contact counts for attrited customers suggests they might have had more frequent interactions with the bank, potentially indicating dissatisfaction or unresolved issues.

Months Inactive: Attrited customers exhibit higher inactivity levels, with a median of 3 months inactive compared to 2 months for existing customers. This finding highlights the importance of customer engagement and activity in predicting attrition. Prolonged periods of inactivity might signify reduced interest or dissatisfaction.

Relationship Count: In contrast to the other variables, existing customers tend to have higher relationship counts, with a median of 4, compared to around 3 for attrited customers. This suggests that stronger and more sustained relationships with the bank could be a protective factor against attrition. A wider distribution among attrited customers indicates variability in their relationship strength, suggesting that some may have weak ties to the bank, making them more likely to leave.

Principal Component Analysis (PCA)

In the context of analyzing customer attrition, PCA is essential for uncovering the underlying structure of the data and simplifying the analysis of multiple variables. Variables such as transaction counts, credit limits, and relationship counts are often correlated, which can make it challenging to interpret their individual contributions to attrition. PCA reduces these correlated variables into uncorrelated principal components, allowing us to focus on the most influential factors driving attrition. By visualizing these components, we can better understand the relationships between variables and identify patterns, such as clusters of attrited and non-attrited customers.

The first principal component accounts for 34.3% of the total variance, capturing the most significant variation in the data. The second component explains an additional 17.2%, bringing the cumulative explained variance to approximately 51.5%. Together, the first four components explain 83.3% of the total variance, indicating they effectively summarize the majority of the data’s variability. In contrast, the fifth and sixth components contribute only 13.7% and 3%, respectively, providing diminishing returns. This suggests that focusing on the first four components is sufficient for dimensionality reduction, retaining most of the essential information while simplifying the dataset for further analysis.

This PCA biplot visually represents the relationships between the principal components (Dim1 and Dim2), the original variables, and the two categories of customers: “Attrited Customers” (red points) and “Existing Customers” (green points).

The PCA biplot highlights key patterns in customer behavior related to attrition. Attrited customers are concentrated in regions with lower transaction-related values (Total_Trans_Amt, Total_Trans_Ct), more interactions (Contacts_Count_12_mon), and shorter relationships (Total_Relationship_Count). The arrow for Months_Inactive_12_mon points toward the cluster of attrited customers, indicating that these individuals have had recent inactivity despite their higher frequency of interactions. This suggests that while attrited customers may engage more frequently in a short period, their engagement lacks long-term consistency.

In contrast, existing customers cluster on the side of higher transactional activity, fewer but consistent interactions, and longer relationships, as indicated by the alignment of Total_Trans_Amt, Total_Trans_Ct, and Total_Relationship_Count with the existing customer group. The 95% confidence ellipses emphasize the broader behavioral variability in existing customers compared to the tighter, more constrained behavior patterns seen in attrited customers.

Recommendations for Reducing Customer Attrition

Based on the analysis, the bank can have the following strategies:

  1. Engagement Programs: Focus on increasing transaction frequency and amount through loyalty programs or promotional offers targeting less engaged customers.
  2. Proactive Contact: Enhance contact strategies, particularly for customers showing signs of inactivity (e.g., “Months Inactive” > 4).
  3. Strengthening Relationships: Personalize banking experiences for customers with low “Total Relationship Count.”
  4. Credit Utilization Monitoring: Monitor customers with low “Credit Limit” utilization for targeted support or product offerings.

Conclusion

This project has demonstrated the importance of financial and behavioral factors in understanding customer dynamics in the banking industry. Key findings reveal that credit limit, income category, and transaction behavior are critical determinants of card categories. Gender differences were observed, with female customers showing more consistent spending and males displaying higher variability. Attrition analysis highlighted that inactivity, contact frequency, and weaker relationships with the bank are strong predictors of customer churn.

These insights suggest that banks can improve customer retention by personalizing their engagement strategies, tailoring marketing campaigns to gender-specific behaviors, and proactively addressing inactivity patterns. Future work could include integrating external data, such as credit scores or spending patterns across industries, to refine predictions further. By building on these findings, banks can strengthen customer relationships and enhance their financial services.