Introduction

Description of Data

This dataset consists of data on borrowers of and loans issued by Lending Club There are 145 columns and 2,260,668 rows in this dataset, with each row representing 1 loan issued by the Lending Club between 2007 and 2018. This dataset is incredibly robust with details not only on the loan but the borrowers taking out the loan. Unfortunately, the member_id column is NULL; comments suggest this may have been done to protect borrower identity. This means we cannot study the borrowing habits of repeat customers, but we can and will proceed to analyze general loan data on an large scale across loan issue years and grades.

This level of detail on loan data is unbelievably useful for predicting demographic tendencies for borrowing. We could potentially find strong associations between credit rating of the loan and specific characteristics of the dataset population, such as renting versus homeownership, geographical location, financial health (several metrics), etc. The depth of information available allows us incredible flexibility in what hypotheses we would like to study as well as how we would like to study them.

Research Questions

Loan Volume and Interest Rates

For one of our research questions, we are interested in learning if the total loan amount by each state is correlated to the average interest rate in that state, given a loan’s grade. Intuitively, the total loan amount in each state might reflect the state’s economic activity, and high economic activity is sometimes indicative of good economic health. If a region (or, in this case, state) has fairly high economic activity and good economic health, then the borrowers making up the demographic in that area may qualify for lower interest rates because they are not at a high risk for default. In the following section, we will compose spatial polygons to see if this is the case.

We first choose 3 of the more representative grades, grade A, grade C, and grade F to represent loaner’s credibility criteria, good, medium, and poor. Note: we exclude Iowa from these maps as the low loan volumes skew the visual of the maps.

Grade A

Grade C

Grade F

From the spatial polygons, we can see that our intuition before analyzing and visualizing the data wasn’t very accurate, or at least not as strong of a correlation as we expected. As we can see from all 3 spatial polygons, there doesn’t seem to be a definitive positive or negative correlation between total loan amount and interest rate; there are examples of states that both prove, disprove and are neutral to our hypothesis. Take two states with very high total loan amounts, Californa and Texas; their rates are low but not notieably lower than the states with fewer loans. This, however, could be owed to the size and regional diversity in socioeconomic status present in these large states. There are examples of states that do follow our hypothesized negative correlation, like Illinois, Maine, Florida, North Dakota - as loan amounts rise (fall), interest rates fall(rise). On the other hand, there are some states that actively negate our hypothesis, showing a positive relationship between interest rates and loan amounts, like West Virginia and South Dakota for example. In order to better visualize this, we composed a linear plot for each grade to reflect the relationship between a state’s total loan amount and its average interest rate.

From the 6 plots, we can see that for most grades, there is a inverse relationship for the total loan amount of a state and that state’s interest rate, other than grade B. That is, for most grades, a state with higher total loan amount tends to have lower interest rate. This relationship appears to steepen for borrowers with lower loan grades, D, E, and F. This makes sense intuitively, because borrowers with poor credit are more likely to default, meaning lenders would be taking on more risk by extending a line of credit. To compensate for this risk, loans of poorer credit offer higher interests rates.

Loan Characteristics and the 2008 Financial Crisis

We also wanted to understand how important loan characteristics changed during and after the financial crisis. To do this, we considered three factors: loan titles (reasons for taking out a loan), interest rates, and loan grades. We also provided data from the beginning of the financial crisis (2007) up to six years after it ends (2015) because that is approximately how long it took the US economy to recover. Therefore, there are a total of nine years of data being represented in the following plots.

We first created word clouds for each year of data to see what people were taking out loans for. One issue we ran into was that the amount of data available grew exponentially with each successive year. As such, we decided to sample the data such that an equal proportion of data would be sampled from each year.

It seems that loans titles were relatively diverse during the crisis but less so after it. For example, in 2008, loans appear to have been used for debt consolidation, credit card payments, funding businesses, or paying bills, but subsequent years show that debt consolidation and credit card refinancing are among the most common reasons for taking out loans. This makes sense because people were strapped for cash during the crisis and may have needed help paying off a variety of debts, but during recovery, people may need to refinance the debt they incurred during the crisis.

This trend may also be linked to interest rates. During the crisis, the Federal Reserve cut rates to nearly zero to stimulate consumer spending. As a result, interest rates on debt fell, which made it cheaper to borrow. The relative variety in loan titles during the crisis (2007 - 2009) could be linked to people borrowing easily to make purchases. This pointed us to the second part of this research question, where we wanted to understand how interest rates changed during the crisis and the economic recovery thereafter. Although the data provides information on loan grades A through G, we decided to focus on loans of grades A through D, as that is what Lending Club does now. We expect interest rates to be low during the crisis because of lenient monetary policy during that time and subsequent increases in rates due to a strenghtening economy.

Just as we expected, interest rates during the financial crisis increased each year, but there seem to be different interest rate trends for each loan grade following the crisis. Loans of the highest grade (A) actually saw a decrease in rates while riskier loans saw rates increase. The riskier the loan, the larger the interest rate increases each year, but almost all loan grades saw rates fall after 2013.

Financial Health of Borrowers and the 2008 Financial Crisis

Preceding and following the 2008 Financial Crisis, one would imagine there would be a shift in the financial health of borrowers. Pre-2008, lenders may have been more lenient (either with fraudulent aims or under the perception of a healthy economy producing healhty lenders), but post-2008 taught financial companies to vet borrowers heavily and raise some of the standards for the loans and borrowers. Plotting an MDS scatterplot we see that the data appears to be clustered by loan grade type in a relatively positive trend. To better visualize the data, however, we’ll create a dendrogram.

The dendrograms with labels colored by grade, loan status, and employment (not pictured here) showed essentially no pattern in distribution, although this is understandable for grades and employment given the number of grades and varying years. Instead, we found the home-ownership-colored-labels dendrogram was the most properly segmented. It’s not quite cleanly split but it seems RENT comprises most of the green dendrogram grouping on the right while MORTGAGE takes over most of the pink grouping on the left. OWN is too small to discern and is scatterd throughout. The two lare segements of the dendrogram are very distinct, however, so we will use this coloring segmentation and apply it to bivariate scatterplots to see if a clear pattern arisese.

Comparing across a few continuous variables we find that the 4 most distinct pairs wherein the dendrogram groupings were clear are interest rate vs. annual income, loan amount vs. annual income, debt-to-equity ratio vs. annual income, and number of open accounts vs. annual income. All pairs include annual income, so we could possible make the claim that this is an important factor in loan analysis. Also, two comparison plots had moderately good segmentation but it was still a bit overlapped, number of open accounts vs. debt-to-equity ratio and interest rate vs. number of open accounts.

Loan Status vs. Int Rate and Other Factors

Another topic that we wanted to explore was the relationship between various factors and the loan status. Loans which are in unhealthy statuses are likely to be the ones which are more difficult to pay back. We would expect there to be several relevant factors, such as interest rates, income level, possibly purpose, and grade of the loan.

More specifically, we look at the distribution of interest rates for each loan status for different income classes. We also facet on whether or not the loan was pre/post financial crisis, as this will have a large impact. We would expect the unhealthier loans to have higher interest rates. We would also expect higher interest rates in the low-income bracket.

High Income

For high income individuals, we see both expected and unexpected results. In the post-crisis boxplot, we see that fully-paid loans have the lowest interest rates at around 11%, followed by Current loans, which is expected. However, we would expect increasing interest rates for “In Grace Period”, “Late (16-30 days)”, and “Late (31-120 days)” respectively, as these imply increasing severity of late payment. Instead, we see that the interest rate decreases across these three statuses. This is likely due to financial flexibility of high income individuals, perhaps allowing for larger loan amounts at smaller lower interest rates.

Middle Income

For middle class individuals, we see that the average interest rate level is higher across all statuses compared to upper class individuals. Again, we see that “Late (31-120 days)” is greater than “Late (16-30 days)”, although both are higher than the grace period rate.

Low Income

For low-income individuals, we see the highest average interest rate of the three income classes. We also see that the three late statuses all have interest rates that are considerably higher than other statuses. This is likely because low-income individuals are more economically burdened, and enter in a positive-feedback cycle between borrowing at higher interest and not being able to meet those higher financial obligations.

As mentioned in the above analyses, precrisis lending was generally a lot more lenient, and this can be witnessed through the similar distribution of interest rates across all classes for all loan statuses.

Now that we have looked at the varying interest rates for income bracket, we are interested in seeing how the purpose of the loan differs for various income groups. We use a mosaic plot to check for significance in rejecting the null hypothesis that income group and loan purpose are independent. We se that there is a low standard residual between the lower class and fully paid status, which makes sense as we would expect less of a stable fully paid relationship for individuals in a lower income class. For the other two classes, all cells are brightly shaded.

One more relationship we want to examine is between loan status and purpose of loan. Intuitively, we might expect to see some independence of loan status for the purpose of the loan. However, from the mosaic plot, we see there is a strong level of significance, and we reject the null that these two variables are independent.

Conclusion and Possible Future Work

Loan Volume and Interest Rates

It would be interesting to further study this relationship to see what other state-specific variable might exist that could help explain the difference in average interest rates across states. For example, it would be helpful to compare borrowers with similar financial profiles (home ownership, job, dti ratio, FICO score range, delinquencies, etc.) and similar reasons for taking out loans (this is a huge determining factor in interest rates) across states. We could run KS Tests to see which/if any states tend to produce loans with higher interest rates or have a meaningfully different distribution and progress from there.

Loan Characteristics During and After the 2008 Financial Crisis

There are many more loan characteristics with which we could visualize trends, such as loan amounts, loan status, number of delinquencies, and more. Since much of this type of analysis involves time series, it would also be helpful to not only look for trends, but also for seasonality or noise over time. We may also want to compare trends of the financial crisis to the current economic downturn; if we see similar trends in loan data today, perhaps we could make some inferences about the types of characteristics that loans will have in the near future.

Financial Health of Borrowers and the 2008 Financial Crisis

Similar to our further exploration of the loan volume and interest rates, it would be fascinating to see if we could break down Annual Income geographically and across categorical variables in order to better understand how this important variable affects loan acceptance rate and loan characteristics (term, grade, int. rate) It would also be helpful to compare this across years to discern if annual income significance changes in power over time.

Loan Status vs. Int Rate and Other Factors

This exploration could further be improved by incorporating loan amounts into the analysis. Although it provided insights into the differences of interest rates between income classes, another consideration is the size of the loan. Simply stated, a larger loan will be harder to pay back, and so perhaps binning loans into different size categories can provide further insight. In another study, we could perhaps further split the relationship between interest rate and loan status by the loan purpose type.