People take out loans for numerous reasons: To pay for college, to buy a house or pay for renovations, to purchase a car, to refinance or consolidate existing debt, and to pay for big events (e.g. weddings, vacations, etc), among other reasons.
From the perspective of the lender, it would be useful to know if there are important differences in the features of the loans or borrowers that correspond to the differences in the likelihood that the loan is paid off.
Using (slightly altered) data from Lending Club here, your team is asked the following:
You work for a company that issues loans. Your company's data servers malfunctioned, causing some of the data to be "corrupted." Important information about the loans in the corrupted data was lost, including the repayment status of these loans.
Using the complete, uncorrupted data here, your task is to predict the repayment statuses of the corrupted loans. In particular, you are asked to predict if the loan status is in one of the following groups:
Additionally, your boss is interested in knowing what features of the loans or borrowers correspond to a higher likelihood of being in the "Bad" group.
For every loan in the corrupted dataset, you must provide a prediction of which group that loan is in. The predictions can be 0 (for Group 0) or 1 (for Group 1), or the probability that the loan is in Group 1.
Additional information on the data can be found here.
You may not use any other data sources aside from the datasets provided. You may not use any other data you find online. Exactly how you justify your answer is up to you. That said, we suggest the following:
Each team should submit all of the following:
Submission constitutes permission to post (anonymized) winning team entries online.
The 15 teams with the lowest Brier Scores of their predictions will make the judging round.
Of these 15, eight teams will make the finals, as determined by a group of expert judges, who will read the reports.