Overall Theme/Introduction
Analyzing the credit card customers’ characteristics can help the
bank decrease the number of churned customers by giving suggestions on
how to improve their service and provide specialized service to
different customers. Throughout our study, we want to observe the
relationships between different categorical and quantitative variables
to help us better understand the nature of the customers, and predict
what types of customers should we target to reduce the churning
rate.
Description of the dataset
The dataset we examined in this report describes the credit card
customers’ behaviors and consists of 10,000 customers with their age,
income, marital status, credit card limit, amount of transactions, etc.
There are 21 features included:
Variables
CLIENTNUM
: Client number. Unique identifier for the
customer holding the account.
Attrition_Flag
: Internal event (customer activity)
variable - if the account is closed then 1 else 0.
Customer_Age
: Demographic variable - Customer’s Age in
Years.
Gender
: Demographic variable - M=Male, F=Female.
Dependent_count
: Demographic variable - Number of
dependents.
Education_Level
: Demographic variable - Educational
Qualification of the account holder (example: high school, college
graduate, etc.).
Marital_Status
: Demographic variable - Married, Single,
Divorced, Unknown.
Income_Category
: Demographic variable - Annual Income
Category of the account holder (< $40K, $40K - 60K, $60K - $80K,
$80K-$120K, > $120K, Unknown).
Card_Category
: Product Variable - Type of Card (Blue,
Silver, Gold, Platinum).
Months_on_book
: Period of relationship with bank.
Total_Relationship_Count
: Total no. of products held by
the customer
Months_Inactive_12_mon
: No. of months inactive in the
last 12 months
Contacts_Count_12_mon
: No. of Contacts in the last 12
months
Credit_Limit
: Credit Limit on the Credit Card
Total_Revolving_Bal
: Total Revolving Balance on the
Credit Card
Avg_Open_To_Buy
: Open-to-Buy Credit Line (Average of
last 12 months)
Total_Amt_Chng_Q4_Q1
: Change in Transaction Amount (Q4
over Q1)
Total_Trans_Amt
: Total Transaction Amount (Last 12
months)
Total_Trans_Ct
: Total Transaction Count (Last 12
months)
Total_Ct_Chng_Q4_Q1
: Change in Transaction Count (Q4
over Q1)
Avg_Utilization_Ratio
: Average Card Utilization
Ratio
Research Questions
Throughout this report, we aimed to answer three questions:
- Are there any relationships between the quantitative variables, and
how are the quantitative variables total transaction amount, credit
limit, and open-to-buy credit line related to female and male
customers?
- Are there any gender discrepancies between the income category and
Credit Limit on the credit card?
- For Card holders with higher Credit Card Limits thus higher
consuming power, how are their consumption models look like in data and
are there specific patterns or preferences for banks to capture and
market upon?
Question 1
Are there any relationships between the quantitative variables, and
how are the quantitative variables total transaction amount, credit
limit, and open-to-buy credit line related to female and male
customers?
1.1 PCA
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.6044 1.4312 1.3426 1.1989 1.1125 1.00726 0.99112
## Proportion of Variance 0.1839 0.1463 0.1288 0.1027 0.0884 0.07247 0.07017
## Cumulative Proportion 0.1839 0.3302 0.4589 0.5616 0.6500 0.72247 0.79264
## PC8 PC9 PC10 PC11 PC12 PC13
## Standard deviation 0.94447 0.90136 0.77278 0.47260 0.45769 0.41057
## Proportion of Variance 0.06372 0.05803 0.04266 0.01595 0.01496 0.01204
## Cumulative Proportion 0.85635 0.91439 0.95704 0.97300 0.98796 1.00000
## PC14
## Standard deviation 6.287e-16
## Proportion of Variance 0.000e+00
## Cumulative Proportion 1.000e+00


With a lot of variables in the data set, we wanted to use
dimension-reduction techniques, such as principal components analysis,
along with the elbow plot to determine whether all the quantitative
variables provide meaningful information and whether it is possible to
reduce the dimension of the data while preserving the maximum amount of
information gained. Based on the PCA, 95% of the variation in the data
set is captured by the first ten PCs and it is unnecessary to add more
PCs beyond 10. So it is reasonable to conclude that we should use only
10 principal components.
Then, we plotted the biplot of PC1 and PC2 to explore the
relationship between the quantitative variables and the categorical
variable gender. First, the two ellipses for the female group and the
male group are significantly overlapped. So the two gender groups are
very similar in terms of their principal components, and they have
similar patterns. Since many points of males are on the left of the
plot, we concluded that males tend to have a higher credit limit and
higher open-to-buy credit line. Also, the angle between the arrow of
credit limit and the arrow of the total transaction amount is slightly
smaller than but close to 90 degrees. The angle indicates that there is
an approximately zero correlation between the credit limit and the total
transaction amount. In other words, the amount spent by a customer is
uncorrelated with how much limit he/she has for his/her credit card.
Overall, there is no significant association between the total
transaction amount and other quantitative variables besides credit limit
and open-to-buy credit line. Those quantitative variables have similar
patterns under the two gender groups.
1.2 Facetted Smoothed Density Plot with a Rug
We wanted to learn about what profile of clients is likely to have a
large amount of transactions so that we can determine what groups of
people are the stable credit card users who will pay their credit card
bills. Thus, we examined the total transaction amount and variables
related to users’ demographic characteristics, such as gender, and
compared these variables between attrited customers and existing
customers.

The above graph suggests that for both the attrited and existing
customers, there are more females than males. There are more female
users who have transaction amounts less than 6000, and there are more
male users who have transaction amounts greater than 12000. However, the
female and male groups have a similar pattern for both attrited and
existing customers, and they differ mostly by the population in each
gender group. From the rug plot, there are no attributed users who have
transactions greater than around 11000, but many existing customers have
transactions greater than 12000. Customers who have transaction amounts
greater than 12000 are less likely to quit using the credit card service
because they have a greater and more stable ability to pay for the
credit card bill. Also, there are more female users in both the existing
and attrited customers, but there are few females who have transaction
amount greater than 12000. Hence, the bank should target users who have
a total transaction amount greater than 12000 and should investigate how
to retain more female users.
Based on the results of the two graphs, the quantitative variables
besides credit limit and open-to-buy credit line have similar patterns
under the two gender groups. Hence in order to explore the overall
pattern of gender on the quantitative variables, we decided to examine
the credit limit which is not affected by gender, and the total
transaction amount, which has a similar pattern to all the other
quantitative variables under gender.
Question 2
Are there any gender discrepancies between the income category and
Credit Limit on the credit card?
We would like to explore the effect of any gender differences so that
we can determine if any gender is more likely to pay for their credit
bills. A higher credit limit indicates a more stable, low-risk
consumer.
2.1 Boxplot

To gain more insights into the consumers, we decided to find any
potential associations between consumers’ income category, credit limit,
and gender. Therefore, we created a boxplot with income categories on
the x-axis and credit limit on the y-axis, colored by gender, where
“female” is represented by a red boxplot and “male” is represented by a
blue boxplot.
The first thing we noticed is that female consumers are absent in
half of the income categories, including $120K+, $60-$80K, and
$80K-$120K. Female tends to have lower income compared to male, with all
of their incomes below $60K. The median, Q1, and Q3 of females’ credit
limits in the three income categories they are present are all lower
than those of the male ones. Ignoring the unknown income group, the
higher consumers’ incomes are, the higher the credit limit they possess.
The groups of “unknown”, “less than $40K”, and “$40K-$60K” also have
many outliers as well. The credit limit maximum for income less than
$60K has no gender differences, whereas for unknown income, females tend
to have lots of outliers with very high values of the credit limit.
We can assert that the income categories and credit limit don’t seem
to have a strong association for males, yet for the three income
categories where females’ income range falls into, the credit limit can
be as low as 2500. This means that female is indeed having lower incomes
than males, which leads to the pattern that females are more likely to
have very low credit limits. As for credit limits, the number of
outliers suggests that there are perhaps more females having high credit
limits.
Therefore, the bank might not want to consider setting different
standards for female and male consumers as they seem to be equally
low-risk. In fact, females might be more stable than males.
Question 3
For Card holders with higher Credit Card Limits thus higher consuming
power, how are their consumption models look like in data and are there
specific patterns or preferences for banks to capture and market
upon?
3.1 Heat Map

The above graph is a heat map of the total transaction amount larger
than 12000 and the credit limit. We can see that the most clustered area
is around 14000 to 15000 of the total transaction amount and below 10000
for the credit limit. There are also many points clustering around 25000
to 50000 with a range of 13000 to 16500 of the total transaction amount.
It is also worth noticing that with the highest credit limit value, the
total transaction amount range is also around 13000 to 17000. Though not
very clustered, the heat map shows that the points fall into this range
have higher density. As the heat map suggests, the credit limits are
either mainly below 10000 or reaching their maximum. Overall, the bank
can get to know the specific values of most of its clients. They might
want to consider targeting this group of people.
3.2 Dendrogram of Income Category for Credit Card users with
Transaction Amount greater than $12,000

We subsetted the data into customers with monthly transaction amount
that is greater than 12,000 dollars. Then we standardized the
quantitative variables and then color-coded the entries by the known
income categories. The six clusters identified by the dendrogram do not
seem to align relatively well with the observations’ colors by
Income_Category
. We can see that the pink texts below are
almost dominant in every colored branch. That means for those whose
income falls between 60k to 80k dollars, their consumption patterns are
the most varied. Remarkably, the deep blue data entries, indicating the
people with more than 120k dollar income level, concentrate on the
indigo cluster, which could be a signal for banks that high-income
customers might share similar preferences.
3.3 Density Diagram of Log of Credit Card Limits and Transaction
Amount sorted by Income Category

It is evident that the credit limits and actual transaction amount
differ drastically. Although we implemented a logarithmic
transformation, it is still observable that for higher income people who
by default might have been given higher credit limits, their transaction
aggregates do not deviate as much from other income level customers.
However, there is an interesting overlap of the three highest income
categories in the Total_Trans_Amt
= 15,000 range,and the
green curve even surpasses the blue one (the 60k-80k group has more
transactions in that amount than the 80k-120k group). People with
relatively lower, but still considerable credit limits are spending a
comparable quantity of money in their pockets, which is behaviorally and
statistically worthy of discussion and investigation. Currently, we
cannot conclude if, for mid-high income groups, the comparable
consumption power would be a repetitive pattern, but commercial banks
can specifically analyze their users’ profiles and habits to customize
subscriptions and investments.
Main conclusions and takeaways
Through our analysis of the quantitative variables across the two
gender groups, we realized that males and females are not significantly
different from each other besides credit limit and average open-to-buy
credit. Since we knew credit limit and average open-to-buy credit do
vary across the two gender groups, we used the total transaction amount,
which is uncorrelated with the two variables, to check whether the two
gender groups are indeed similar in terms of the quantitative variables.
It proved to be the case that females and males have similar
distribution patterns when we look closely at the distribution curve of
total transaction amount across different gender groups and different
customer types.
After we analyzed the association between gender and quantitative
variables, we delved further into the correlations between gender,
income, and credit limit. There are no females in high-income groups
since we do not see any data points corresponding to females when the
income is equal to or above $60K. Males tend to have a higher income
than females since the median income for males is higher than the median
income for males across all income groups. Higher incomes correlate with
a higher credit limit, and this correlation matches our conclusion in
Question 1 which says males tend to have higher credit limits than
females.
Based on the findings for Question 1, we decided to further
investigate the relationship between the credit limit and transaction
amount, as well as the relationships between those two variables and
other categorical variables. There is no clear relationship between the
total transaction amount and the credit limit. The credit limit varies
drastically across different income groups. and more people in the
high-income group tend to have a higher credit limit. The distribution
of the total transaction amount is approximately the same across all
income groups, so the transaction amount is not affected by income.
Based on the dendrogram, we concluded that the correlation between
income groups and the quantitative variables is relatively weak since
the income group labels do not align well with the clusters.
Future Study
In this study, we mainly focused on the relationships between credit
limits, total transaction amount, and other categorical variables.
Although we were able to make meaningful conclusions from these
variables, we could reach a more comprehensive result if we can expand
the analysis by incorporating more variables. We can potentially study
the relationships between all quantitative variables and different
education levels, and the relationships between quantitative variables
and different card categories.
In addition to the expansion of our analysis, future studies should
consider adding more variables to the dataset. The bank can evaluate the
information provided by a credit card application to determine what new
variables would be helpful to determine the credibility of a customer.
For example, we can add in customers’ occupations since those with
reliable income may be less likely to conduct credit card churning.