Overall Theme/Introduction

Analyzing the credit card customers’ characteristics can help the bank decrease the number of churned customers by giving suggestions on how to improve their service and provide specialized service to different customers. Throughout our study, we want to observe the relationships between different categorical and quantitative variables to help us better understand the nature of the customers, and predict what types of customers should we target to reduce the churning rate.

Description of the dataset

The dataset we examined in this report describes the credit card customers’ behaviors and consists of 10,000 customers with their age, income, marital status, credit card limit, amount of transactions, etc. There are 21 features included:

Variables

CLIENTNUM: Client number. Unique identifier for the customer holding the account.
Attrition_Flag: Internal event (customer activity) variable - if the account is closed then 1 else 0.
Customer_Age: Demographic variable - Customer’s Age in Years.
Gender: Demographic variable - M=Male, F=Female.
Dependent_count: Demographic variable - Number of dependents.
Education_Level: Demographic variable - Educational Qualification of the account holder (example: high school, college graduate, etc.).
Marital_Status: Demographic variable - Married, Single, Divorced, Unknown.
Income_Category: Demographic variable - Annual Income Category of the account holder (< $40K, $40K - 60K, $60K - $80K, $80K-$120K, > $120K, Unknown).
Card_Category: Product Variable - Type of Card (Blue, Silver, Gold, Platinum).
Months_on_book: Period of relationship with bank.
Total_Relationship_Count: Total no. of products held by the customer
Months_Inactive_12_mon: No. of months inactive in the last 12 months
Contacts_Count_12_mon: No. of Contacts in the last 12 months
Credit_Limit: Credit Limit on the Credit Card
Total_Revolving_Bal: Total Revolving Balance on the Credit Card
Avg_Open_To_Buy: Open-to-Buy Credit Line (Average of last 12 months)
Total_Amt_Chng_Q4_Q1: Change in Transaction Amount (Q4 over Q1)
Total_Trans_Amt: Total Transaction Amount (Last 12 months)
Total_Trans_Ct: Total Transaction Count (Last 12 months)
Total_Ct_Chng_Q4_Q1: Change in Transaction Count (Q4 over Q1)
Avg_Utilization_Ratio: Average Card Utilization Ratio

Research Questions

Throughout this report, we aimed to answer three questions:

Are there any relationships between the quantitative variables, and how are the quantitative variables total transaction amount, credit limit, and open-to-buy credit line related to female and male customers?
Are there any gender discrepancies between the income category and Credit Limit on the credit card?
For Card holders with higher Credit Card Limits thus higher consuming power, how are their consumption models look like in data and are there specific patterns or preferences for banks to capture and market upon?

Question 1

Are there any relationships between the quantitative variables, and how are the quantitative variables total transaction amount, credit limit, and open-to-buy credit line related to female and male customers?

1.1 PCA

## Importance of components:
##                           PC1    PC2    PC3    PC4    PC5     PC6     PC7
## Standard deviation     1.6044 1.4312 1.3426 1.1989 1.1125 1.00726 0.99112
## Proportion of Variance 0.1839 0.1463 0.1288 0.1027 0.0884 0.07247 0.07017
## Cumulative Proportion  0.1839 0.3302 0.4589 0.5616 0.6500 0.72247 0.79264
##                            PC8     PC9    PC10    PC11    PC12    PC13
## Standard deviation     0.94447 0.90136 0.77278 0.47260 0.45769 0.41057
## Proportion of Variance 0.06372 0.05803 0.04266 0.01595 0.01496 0.01204
## Cumulative Proportion  0.85635 0.91439 0.95704 0.97300 0.98796 1.00000
##                             PC14
## Standard deviation     6.287e-16
## Proportion of Variance 0.000e+00
## Cumulative Proportion  1.000e+00

With a lot of variables in the data set, we wanted to use dimension-reduction techniques, such as principal components analysis, along with the elbow plot to determine whether all the quantitative variables provide meaningful information and whether it is possible to reduce the dimension of the data while preserving the maximum amount of information gained. Based on the PCA, 95% of the variation in the data set is captured by the first ten PCs and it is unnecessary to add more PCs beyond 10. So it is reasonable to conclude that we should use only 10 principal components.

Then, we plotted the biplot of PC1 and PC2 to explore the relationship between the quantitative variables and the categorical variable gender. First, the two ellipses for the female group and the male group are significantly overlapped. So the two gender groups are very similar in terms of their principal components, and they have similar patterns. Since many points of males are on the left of the plot, we concluded that males tend to have a higher credit limit and higher open-to-buy credit line. Also, the angle between the arrow of credit limit and the arrow of the total transaction amount is slightly smaller than but close to 90 degrees. The angle indicates that there is an approximately zero correlation between the credit limit and the total transaction amount. In other words, the amount spent by a customer is uncorrelated with how much limit he/she has for his/her credit card. Overall, there is no significant association between the total transaction amount and other quantitative variables besides credit limit and open-to-buy credit line. Those quantitative variables have similar patterns under the two gender groups.

1.2 Facetted Smoothed Density Plot with a Rug

We wanted to learn about what profile of clients is likely to have a large amount of transactions so that we can determine what groups of people are the stable credit card users who will pay their credit card bills. Thus, we examined the total transaction amount and variables related to users’ demographic characteristics, such as gender, and compared these variables between attrited customers and existing customers.

The above graph suggests that for both the attrited and existing customers, there are more females than males. There are more female users who have transaction amounts less than 6000, and there are more male users who have transaction amounts greater than 12000. However, the female and male groups have a similar pattern for both attrited and existing customers, and they differ mostly by the population in each gender group. From the rug plot, there are no attributed users who have transactions greater than around 11000, but many existing customers have transactions greater than 12000. Customers who have transaction amounts greater than 12000 are less likely to quit using the credit card service because they have a greater and more stable ability to pay for the credit card bill. Also, there are more female users in both the existing and attrited customers, but there are few females who have transaction amount greater than 12000. Hence, the bank should target users who have a total transaction amount greater than 12000 and should investigate how to retain more female users.

Based on the results of the two graphs, the quantitative variables besides credit limit and open-to-buy credit line have similar patterns under the two gender groups. Hence in order to explore the overall pattern of gender on the quantitative variables, we decided to examine the credit limit which is not affected by gender, and the total transaction amount, which has a similar pattern to all the other quantitative variables under gender.

Question 2

Are there any gender discrepancies between the income category and Credit Limit on the credit card?

We would like to explore the effect of any gender differences so that we can determine if any gender is more likely to pay for their credit bills. A higher credit limit indicates a more stable, low-risk consumer.

2.1 Boxplot

To gain more insights into the consumers, we decided to find any potential associations between consumers’ income category, credit limit, and gender. Therefore, we created a boxplot with income categories on the x-axis and credit limit on the y-axis, colored by gender, where “female” is represented by a red boxplot and “male” is represented by a blue boxplot.

The first thing we noticed is that female consumers are absent in half of the income categories, including $120K+, $60-$80K, and $80K-$120K. Female tends to have lower income compared to male, with all of their incomes below $60K. The median, Q1, and Q3 of females’ credit limits in the three income categories they are present are all lower than those of the male ones. Ignoring the unknown income group, the higher consumers’ incomes are, the higher the credit limit they possess. The groups of “unknown”, “less than $40K”, and “$40K-$60K” also have many outliers as well. The credit limit maximum for income less than $60K has no gender differences, whereas for unknown income, females tend to have lots of outliers with very high values of the credit limit.

We can assert that the income categories and credit limit don’t seem to have a strong association for males, yet for the three income categories where females’ income range falls into, the credit limit can be as low as 2500. This means that female is indeed having lower incomes than males, which leads to the pattern that females are more likely to have very low credit limits. As for credit limits, the number of outliers suggests that there are perhaps more females having high credit limits.

Therefore, the bank might not want to consider setting different standards for female and male consumers as they seem to be equally low-risk. In fact, females might be more stable than males.

Question 3

For Card holders with higher Credit Card Limits thus higher consuming power, how are their consumption models look like in data and are there specific patterns or preferences for banks to capture and market upon?

3.1 Heat Map

The above graph is a heat map of the total transaction amount larger than 12000 and the credit limit. We can see that the most clustered area is around 14000 to 15000 of the total transaction amount and below 10000 for the credit limit. There are also many points clustering around 25000 to 50000 with a range of 13000 to 16500 of the total transaction amount. It is also worth noticing that with the highest credit limit value, the total transaction amount range is also around 13000 to 17000. Though not very clustered, the heat map shows that the points fall into this range have higher density. As the heat map suggests, the credit limits are either mainly below 10000 or reaching their maximum. Overall, the bank can get to know the specific values of most of its clients. They might want to consider targeting this group of people.

3.2 Dendrogram of Income Category for Credit Card users with Transaction Amount greater than $12,000

We subsetted the data into customers with monthly transaction amount that is greater than 12,000 dollars. Then we standardized the quantitative variables and then color-coded the entries by the known income categories. The six clusters identified by the dendrogram do not seem to align relatively well with the observations’ colors by Income_Category. We can see that the pink texts below are almost dominant in every colored branch. That means for those whose income falls between 60k to 80k dollars, their consumption patterns are the most varied. Remarkably, the deep blue data entries, indicating the people with more than 120k dollar income level, concentrate on the indigo cluster, which could be a signal for banks that high-income customers might share similar preferences.

3.3 Density Diagram of Log of Credit Card Limits and Transaction Amount sorted by Income Category

It is evident that the credit limits and actual transaction amount differ drastically. Although we implemented a logarithmic transformation, it is still observable that for higher income people who by default might have been given higher credit limits, their transaction aggregates do not deviate as much from other income level customers. However, there is an interesting overlap of the three highest income categories in the Total_Trans_Amt = 15,000 range,and the green curve even surpasses the blue one (the 60k-80k group has more transactions in that amount than the 80k-120k group). People with relatively lower, but still considerable credit limits are spending a comparable quantity of money in their pockets, which is behaviorally and statistically worthy of discussion and investigation. Currently, we cannot conclude if, for mid-high income groups, the comparable consumption power would be a repetitive pattern, but commercial banks can specifically analyze their users’ profiles and habits to customize subscriptions and investments.

Main conclusions and takeaways

Through our analysis of the quantitative variables across the two gender groups, we realized that males and females are not significantly different from each other besides credit limit and average open-to-buy credit. Since we knew credit limit and average open-to-buy credit do vary across the two gender groups, we used the total transaction amount, which is uncorrelated with the two variables, to check whether the two gender groups are indeed similar in terms of the quantitative variables. It proved to be the case that females and males have similar distribution patterns when we look closely at the distribution curve of total transaction amount across different gender groups and different customer types.

After we analyzed the association between gender and quantitative variables, we delved further into the correlations between gender, income, and credit limit. There are no females in high-income groups since we do not see any data points corresponding to females when the income is equal to or above $60K. Males tend to have a higher income than females since the median income for males is higher than the median income for males across all income groups. Higher incomes correlate with a higher credit limit, and this correlation matches our conclusion in Question 1 which says males tend to have higher credit limits than females.

Based on the findings for Question 1, we decided to further investigate the relationship between the credit limit and transaction amount, as well as the relationships between those two variables and other categorical variables. There is no clear relationship between the total transaction amount and the credit limit. The credit limit varies drastically across different income groups. and more people in the high-income group tend to have a higher credit limit. The distribution of the total transaction amount is approximately the same across all income groups, so the transaction amount is not affected by income. Based on the dendrogram, we concluded that the correlation between income groups and the quantitative variables is relatively weak since the income group labels do not align well with the clusters.

Future Study

In this study, we mainly focused on the relationships between credit limits, total transaction amount, and other categorical variables. Although we were able to make meaningful conclusions from these variables, we could reach a more comprehensive result if we can expand the analysis by incorporating more variables. We can potentially study the relationships between all quantitative variables and different education levels, and the relationships between quantitative variables and different card categories.

In addition to the expansion of our analysis, future studies should consider adding more variables to the dataset. The bank can evaluate the information provided by a credit card application to determine what new variables would be helpful to determine the credibility of a customer. For example, we can add in customers’ occupations since those with reliable income may be less likely to conduct credit card churning.

36-315 Final Project, Fall 2022

Yiting Chen, Yiyang Wei, Judy Xu, Iris Wang