As the banking industry becomes increasingly competitive and saturated over time with traditional and new ways of investing, firms are starting to utilize more marketing efforts as a key differentiator. In this report, we will specifically be analyzing the effectiveness of telemarketing campaigns led by a Portuguese banking institution.
This dataset contains information about the clients that were contacted in a marketing campaign of a Portuguese banking institution from the UC Irvine Machine Learning Repository, which is linked here. There are 45,211 rows and 17 variables in this dataset, as follows:
Age
: client’s ageJob
: client’s occupationMartial
: client’s marital status (“divorced”,
“married”, “single”, “unknown”)Education
: client’s education level (“primary”,
“secondary”, “tertiary”, “unknown”)Default
: whether the client has credit in default
(“yes”, “no”)Balance
: client’s average yearly balanceHousing
: whether the client has a housing loan (“yes”,
“no”)Loan
: whether the client has a personal loan (“yes”,
“no”)Contact
: communication method the bank used to contact
the client (“cellular”, “telephone”)Day_of_week
: day of the week of the last time the
client was contactedMonth
: last contact month of the year (“jan”, “feb”, …,
“dec”)Duration
: Last contact duration in secondsPrevious
: number of contacts performed before this
campaign, for this clientPdays
: number of days passed by after the client was
last contacted from a previous campaign (-1 indicates client was not
previously contacted)Campaign
: number of contacts made during the campaign,
for this clientPoutcome
: outcome of the previous marketing campaign
(“unknown”, “other”, “failure”, “success”)Our main response variable for the study is denoted by “y”, which is the final result of the marketing campaigns, measured by whether the client subscribed to a term deposit. The main goal is to determine the factors and interactions that have significant effects on the likelihood of a client’s subscription, so we will be focusing on the following three research questions:
1. What type of clients do banks more likely reach out to?
With the many demographic variables provided in the dataset, we want to determine if there are any significant characteristics of an individual that can make them more likely to be targeted by the Portuguese bank as a potential client.
2. What type of clients are most likely to subscribe to a term deposit?
We will specifically be looking for factors such as employment, age, and marital status to see which characteristics have an effect on the likelihood of subscription.
3. How successful are bank telemarketing campaigns based on contact communication type?
After analyzing the demographic variables that differentiate individuals from one another, we will determine if the contact communication types, cellular vs telephone, will have different effects on the success of the campaigns.
Before delving into whether the telemarketing campaigns for the Portuguese banking institution are successful, we wanted to first understand the type of potential clients the bank reaches out to. From the dataset, there are 8 demographic variables: age, job status, marital status, education level, average yearly balance, and whether the individual has credit in default, a personal loan, or a housing loan. We analyzed these variables to determine the characteristics the bank would consider to be an ideal client.
Looking at the graph above showing the marginal distribution of education faceted by marital status, we can see that most of the clients are married, followed by being single, then divorced. Married individuals are likely perceived to have the most stable incomes, followed by single working individuals, then divorced individuals. From the distributions, we can also see that regardless of marital status, those with secondary education are the most common, followed by tertiary and primary education. The overwhelming majority of individuals have also not taken out any personal loans.
These factors combined show that institutions seem to target those who exhibit the highest stability, as they are most likely financially capable of making a deposit. So, we can see from the faceted bar graph that the firm’s ideal target clients are those with no personal loans who are married and have at least secondary education.
Given that the goal of the bank is to increase subscription to their deposits, we also looked at the variables related to personal finances: age, average yearly balance, default, and housing. We included age to better visualize the relationship as we believe age influences ones personal financial decisions.
From the scatterplot, we see that the majority of clients do not have a housing nor have credit in default. We also see that many of the clients fall within the ages of 25-60 with average yearly balances of < 12,500 euros. Surprisingly, we see that there are many clients with little to no average yearly balances at every age. This scatterplot suggests that the bank was reaching out to people with lower financial stability but not at risk of defaulting.
We see that the target client that the bank has been reaching out to are individuals who have lower income but also has financial, employment, and marital stability, so they would have the ability and the need to make a deposit.
With this understanding of the types of clients banks reach out to, we can next look into what types of clients are most likely to actually subscribe to term deposits. To begin to understand this, we can first observe the proportion of clients of several employment types that subscribed to term deposits. We create the following bar graph to do this:
From the plot, we can see that the most common job among the clients that subscribed to term deposits was management, with over 25% of subscribers having careers in that field. It seems that the second most common job that roughly 16% of subscribers had was technician. On the other hand, the least common jobs were entrepreneur and housemaid, or the job was simply unknown. These results make sense because, of the jobs included in this dataset, the ones with the most financial stability and greatest salaries are management and technician, and people who have more money would generally be more likely to invest in term deposits.
Along with employment type, we can also observe how client age differs between those that subscribed and didn’t subscribe to term deposits, individually looking at both the age distribution overall and the age distributions by marital status.
Using this plot, we can first examine the differences in the age distributions between subscribers and nonsubscribers across all clients, regardless of marital status. While the peaks of these distributions are similar, it seems that the one corresponding to subscribers is at slightly greater age, suggesting that clients who subscribe to term deposits may be slightly older than those that do not. Additionally, it appears that the tails of these distributions differ; the tail of the distribution corresponding to subscribers is a bit wider than that corresponding to nonsubscribers, further supporting that there may be a larger and older age range for clients that subscribe to term deposits.
Taking it a step further and looking at the distributions corresponding to each distinct marital status, we can see that the distributions of age for each marital status look similar between clients that subscribed and did not subscribe to term deposits. Focusing on subscribers specifically however, it is clear that the greatest number of clients were around 30 years old and single, as this is where the highest peak lies on the graph. The peaks of the age distributions corresponding to the married and divorced subscribers are much closer than that of single clients, indicating that the proportions of subscribers from the married and divorced populations are closer than that of the single population. This also may be due to the fact that more single people were contacted in the marketing campaign, and thus more frequent in the dataset.
Overall, these findings suggest that there are differences in the age distributions between subscribers and nonsubscribers. They also indicate that people who are single and of a younger age, specifically around age 30, seem to be more likely to subscribe to a term deposit.
We can perform statistical tests to further confirm these findings. First, we conduct a K-S test to determine whether the distribution of age for all clients who subscribed to term deposit is statistically different than that for clients that did not subscribe to term deposits.
##
## Asymptotic two-sample Kolmogorov-Smirnov test
##
## data: bank_subscribers$balance and bank_nonsubscribers$balance
## D = 0.12911, p-value < 2.2e-16
## alternative hypothesis: two-sided
Because the p-value of this K-S test is less than 2.2e-16, which is less than our significance level of 0.05, we can reject the null hypothesis, and conclude that there is sufficient evidence that the distributions of ages between clients who subscribed to a term deposit and clients who did not subscribe to a term deposit are different, as the density plot suggested.
We can also perform a one-sided t-test to determine whether the mean age of subscribers is statistically different from the mean age of nonsubscribers. In this test, the null hypothesis is that the mean age of both subscribers and nonsubscribers is equal, and the alternative hypothesis is that the difference in the mean age of subscribers and nonsubscribers is positive, meaning that the mean age for subscribers is greater.
##
## Welch Two Sample t-test
##
## data: bank_subscribers$age and bank_nonsubscribers$age
## t = 4.3183, df = 6109.2, p-value = 7.986e-06
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## 0.5144748 Inf
## sample estimates:
## mean of x mean of y
## 41.67007 40.83899
Since the p-value of this test is 0.000007986, which is less than our significance level of 0.05, we can reject the null hypothesis. There is sufficient evidence that the mean age of clients who subscribed to term deposits is greater than that of clients who did not subscribe to term deposits, as our density plot also suggested.
Assuming that financial health can also have a strong effect on the likelihood of an individual making a term deposit, we also evaluated the relationship between default and subscription status.
Results from the mosaic plot colored by Pearson Residuals shows that the likelihood of those who have credit in default and are also subscribed to a term deposit is significantly low. This means that it is extremely unlikely for those who have credit default to also make a deposit. This is likely because their default history makes them less financially capable of making a deposit. Of those who do not have credit default, though, there is no significant relationship of their likelihood to make a deposit. Thus, we can conclude that those who have credit default are not likely to make a deposit.
Now we are interested in analyzing how successful the telemarketing campaign was for the bank. We do so by analyzing the different variables related to their telemarketing efforts and results.
This grouped boxplot helps us investigate how successfully getting someone to subscribe a term deposit varies with contact type and contact duration. Based on the graph, we can see that within each contact group, the median contact duration is higher for the group of people who ended up subscribing a deposit, suggesting that longer contact duration may lead to a higher probability of successful subscription. Across all contact types, there does not appear to be a difference in duration for those who did not subscribe. However, for those who did subscribe, contact duration was slightly higher in the unknown group in comparison to the telephone and cellular groups. We think this plot is informative because it helps us observe variables that influence whether someone subscribes or not, such as contact type and duration. While we cannot draw any conclusions regarding statistical significance from this graph alone, it does help us identify trends and relationships between the three variables.
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 1.2133 1.0510 1.0357 0.9552 0.9473 0.73537
## Proportion of Variance 0.2453 0.1841 0.1788 0.1521 0.1496 0.09013
## Cumulative Proportion 0.2453 0.4295 0.6082 0.7603 0.9099 1.00000
We were also interested in conducting a principal component analysis in order to reduce dimensionality in our dataset. We selected the quantitative variables age, balance, duration, campaign, pdays, and previous. From the summary output of our analysis, we see that 24.53% of the variance in the data is explained by the first component, 18.41% is explained by the second component, and 17.88% is explained by the third component. In order to decide the number of components we should use, we also plotted a corresponding scree plot. We see that the proportion of variance explained by the fourth component falls below the ⅙ line, suggesting that we should use 3 components. It should be noted that there is not as strong of an elbow in our plot; we suspect that this may be due to the fact that we only had 6 dimensions in our data.
##
## Call:
## glm(formula = ybinary ~ balance + duration + pdays + previous,
## family = "binomial", data = bank[bank$pdays >= 0, ])
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.465e+00 7.444e-02 -19.676 < 2e-16 ***
## balance 3.677e-05 8.817e-06 4.170 3.04e-05 ***
## duration 3.346e-03 1.269e-04 26.370 < 2e-16 ***
## pdays -3.662e-03 2.709e-04 -13.516 < 2e-16 ***
## previous 2.248e-03 5.687e-03 0.395 0.693
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 8919.8 on 8256 degrees of freedom
## Residual deviance: 7814.6 on 8252 degrees of freedom
## AIC: 7824.6
##
## Number of Fisher Scoring iterations: 4
While the visualizations above have helped us identify predictors of the successfulness of bank campaigns, they do not provide information regarding the statistical significance of the predictors. We ran a logistic regression for the success of deposit subscription on the predictors balance, duration, pdays, and previous. We subsetted the dataset to only include observations that had been contacted previously, as those who were not previously contacted were coded as -1 in the data, which would cause biased coefficients in the regression. All predictors, except for previous, had statistical significance at the \(\alpha = 0.05\) level. For example, for a 1 second increase in contact duration, the log odds of successful subscription increase by 0.003325. We see that balance and duration are positively correlated with successful deposits, and pdays is negatively correlated with successful deposits. These are in line with our visualizations, which also suggest that higher bank balance, longer contact duration, and shorter periods between contact improve deposit subscription success.
To be thorough, we also investigated the impact of contact communication type on different clusters of subscribers. We created a dendrogram using the subset of clients who subscribed a new deposit and used hierarchical clustering on 4 demographic variables: age, job, marital status, and education. We colored the leaves based on the contact communication type (unknown, cellular, telephone).
From the dendrogram, we identify four possible clusters of subscribers. From the leaves, we see that one very small group of subscribers (pink branches) is closely associated with being reached via telephone (lavender-colored leaves), whereas the other three clusters are associated mainly with cellular communication (aquamarine-colored leaves). This suggests that both forms of communication were successful but on different types of subscribers. However, the strength of the relationship between subscriber demographic and contact type is harder to interpret since the majority of the dataset is cellular outreach (64.8%). We also recognize that 28.8% of the data is unknown outreach, which may impact the findings of the dendrogram should there be additional information on the unknown data.
For our first research question, we see that the target client profile would be an individual with the financial and personal means to not only make a deposit but also be willing to make a deposit given their need for greater financial stability. More specifically, banks tend to target individuals that are married and have received at least secondary education, with no personal loans.
As for the second research question, we found that clients with jobs with greater salaries are more likely to subscribe to term deposits. We also learned that subscribers seem to have a larger age range than nonsubscribers and are generally older than nonsubscribers. Regarding credit default status, we found that those with a default are significantly less likely to make a deposit. However, we cannot assume whether this is because individuals with defaults are less willing to make a subscription given their lack of significant funds, or if it is because banks may reject a client due to their credit default history.
In terms of communication, we find that outreach that has a longer duration is correlated with a higher chance of subscription. It seems like both modes of contact — telephone and cellular — can be successful, but they are stronger with different demographics of people. Additionally, we found that higher balance, higher duration, and less time between outreach efforts are related to successful subscriptions as well.
While these results have provided insightful information regarding the effectiveness of telemarketing campaigns made by the Portuguese bank, it is important to note that these findings should not be assumed for all other financial institutions. So, we believe that future research in this area could expand to different countries and include more descriptive characteristics about the financial standing of the participants. From our research it seems that the people who subscribe deposits are not in an extremely dire financial situation, as they still have the money to subscribe, but they are also still in need of help from the bank. It would be interesting to quantify this threshold more specifically and analyze which indicators of financial status make someone more or less likely to subscribe a term deposit. We also recognize that the success rate for subscription in this dataset was very low (13.2%), so it would be helpful to seek a more comprehensive dataset to better inform our last two research questions.