The Data

Our data is called the Bank Customers Churn dataset. It contains 13 variables and 10000 observations.

  1. CustomerID: Identity number of each customer - Categorical
  2. Surname: Last name of customer - Categorical
  3. CreditScore: Credit Score of customer - Quantitative
  4. Geography: Country of Residence - Categorical
  5. Gender: Gender - Categorical
  6. Age: Age - Quantitative
  7. Tenure: How long they have been using account - Quantitative
  8. Balance: Amount in account - Quantitative
  9. NumOfProducts: The number of bank products used - Quantitative
  10. HasCrCard: Has a credit Card - Categorical
  11. IsActiveMember: Is active with different functionalities in bank (e.g. programs, bonds, insurance, etc) - Categorical
  12. EstimatedSalary: Salary estimated by bank - Quantitative
  13. Exited: Withdrew from bank - Categorical

Of these thirteen variables, CustomerID and Surname are simply identifiers and are not usually useful in analysis. This leaves 11 variables to work with, of which Exited is our variable of interest.

We are unsure of which bank this data comes from - just that each row represents a customer at The Bank.

Research Questions

Our variable of interest is Exited, a binary response. As such, we could use our covariates and create visualizations about customers of which country are more likely to exit their positions. Or, which gender (the dataset only contains two genders).

We have come up with six research questions that we think are interesting:

  • Does higher credit score lead to a lower probability of churning?
  • Which age group is more likely to churn?
  • Is there a difference in customer churn between male and female customers?
  • From which of the countries are individuals most likely to churn?
  • Can we use clustering algorithms to recover the two groups?
  • Is there a relationship between a customer’s account balance and their decision to churn?

These questions will help us answer the overarching research question:

How are these covariates related to whether the customer exits their position or not?

Visualizations

Viz 1: Does higher credit score lead to a lower probability of churning?

To learn about the relationship between credit score and exiting, we created a side-by-side boxplot.

The above graph seems to suggest that people with lower credit scores seem to be slightly more likely to withdraw from their bank which is to be expected.

Viz 2: Which age group is more likely to churn?

To help answer this question, we grouped the customers into three age groups. Customers who are younger than 35 were labeled as young, customers who were between the ages of 35 and 65 were labeled as middle-aged, and customers older than 65 were labeled as old. We then created a faceted bar plot based on Exited, AgeGroup, and their FICO Credit Score.

From this data visualization, we can see that the middle-aged group has the highest number of customer turnover among the three age groups followed by the young age group, although the number of customer turnover is less than half the number of turnovers in the middle-aged group. We can also see that among the middle-aged bank customers who churned, the most frequent credit scores are Fair and Very poor, which also are the lowest two credit score statuses.

Viz 3: Is there a difference in customer churn between male and female customers?

In order to learn more about this question, we decided to use stacked bar plot to visualize the gender distribution for both retained and churned customers across countries.

From the plot, we can see that regardless of countries and genders, the majority customers (~8000) retained while few customers (~2000) churned from the bank. For retained customers shown on the left, the number of French customers is the highest while the number of German customers is the lowest. We also see that male to female ratios are silimar across the three countries: the proportion of male customers is slightly over half. For churned customers shown on the right, the number of customers from France and Germany is about the same, while the number of customers from Spain is the lowest. Across the three countries, the proportion of female customers is slightly larger than that of male customers. Overall, it appears that female customers are slightly more likely to churn from the bank across all three countries.

Regardless of gender, we also notice that Germany has the lowest number of retained customers and the highest number of churned customers, which may indicate that it has the highest churn rate among all countries.

Viz 4: From which of the countries are customers most likely to churn?

In order to further investigate and compare the churn rate in each country, we decided to visualize the churn rate in a map:

From this map we can see that customers are distributed over three adjacent countries in Europe: Spain, France, and Germany. Churn rate, the percentage of customers who exited, is around 16% for France and Spain, as their circle size is about the same. On the other hand, Germany has the highest churn rate at around 32%, which is about twice as high as that in France and Spain. This results confirmed our notice at the end of visualization 3.

Viz 5: Can we use clustering algorithms to recover the two groups?

By observing the correlation between predictors with Exited, we used the four most correlated quantitative variablesCreditScore, Age, Balance and NumOfProducts to cluster our data and got the following dendrogram. We want to see if we can use the dendrogram to recover the two groups of people - those who withdrew from bank and those who did not, as identified by the Exited variable.

When we split the dendrogram into 2 different colors, we see that the left branch is much smaller than the right branch. This makes sense because the group of people who withdrew is also way smaller than the group of people who stayed within the bank in our data set.

When we color the leaves by the variable Exited, we see that there are considerably more blue leaves on the left branch than the right branch, and considerably more red leaves on the right branch than the left branch. Although the match is not perfect, we can conclude that we have recovered the two groups of people - those who exited from bank and those who did not.

We also want to know how good the clustering recovered the two groups of people, so we found the mis-classficiation rate to be 21.75%. This indicates that the two clusters created by the 4 quantitative variables - CreditScore, Age, Balance and NumOfProducts - are reasonably similar to the two groups of people identified by the Exited variable.

Viz 6: Can we use clustering algorithms to recover the two groups?

In order to further investigate how these two groups are clustered, we conducted a Principal Analysis using the same four quantitative variables used in the dendrogram and obtained the following biplot.

From the plotting of Principal Component 2 (PC2) against Principal Component 1 (PC1), we see that most of the data points with Exited = 0, representing customers who stayed within the bank, concentrate on the right half of the plot. Data points with Exited = 1, representing customers who withdrew from the bank, seem to be equally spread out the plot, but do seem to be the majority in the left half of the graph. In general, it seems that customers with a high PC1 value tend to stay within the bank while customers with a low PC1 is more likely to withdraw from the bank.

PC1 is strongly related to NumofProducts and Balance while PC2 is strongly related to Age and Credit Score. By observing the directions of the arrows in the biplot, we see that people who have higher bank balance tend to stay within the bank. However, people who have higher number of products tend to withdraw from the banks. Customers who are older are also more likely to retain with the bank but the effect is much smaller than NumOfProducts and Balance. On the other hand, Credit Score seem very orthogonal to NumofProducts and Balance, and hence its effect to the clustering is inconclusive.

Viz 7: Is there a relationship between a customer’s account balance and their decision to churn?

In our EDA, we’ve noticed something interesting about the Balance variable. We saw that there is a high number of customers who have a balance of 0 at The Bank. To investigate further, our group first created a new categorical variable that indicates whether the customer has a balance of 0 or not.

Using this new variable, we created this mosaic plot. It tells us that for customers with a balance of zero, they are a lot more likely to stay with the bank. For customers with a nonzero balance, they seem more likely to withdraw. These are statistically significant differences, according to the Pearson residuals.

Conclusion

From these visualizations, our group is able to answer our research questions.

Viz 2 shows that people under the age of 65 exit the most, with middle-aged people (age range: 35-65) exiting twice as much as young people (age range: <= 35). Among these, most people had credit scores below 669. Viz 1 supports this finding.

From Viz 3, we see that male-female ratio is consistent across all three countries for retained customers and for churned customers. Overall, female customers are slightly more likely to churn than male customers. Further analysis is needed to determine Whether this difference is significant.

From Viz 4, we are able to ascertain that German customers have the highest rate of exiting among the three countries. This may suggest that the Bank has a heavy competition in this area. Viz 3 supports this conclusion as it also shows that Germany has the lowest number of retains and the highest number of churns.

Viz 5 and 6 show that it’s possible to use clustering or dimension reduction methods to recover the two groups. Viz 6 shows that we may be able to use the first principal component to split the groups.

From the mosaic plot (Viz 7), we are able to see that there appears to be a positive relationship between having a nonzero balance and withdrawing from the bank.

Synthesizing these findings, it appears that looking at a customer’s country of residence, account balance, age, number of bank products, and credit score may help one predict whether the customer will churn or not.