Prediction of loan defaulter based on training set of more than 5L records using Python, Numpy, Pandas and XGBoost
The problem was hosted for Machine Learning Challenge on Hacker Earth. You can download the dataset from the challenge page or from the direct link to the same dataset here
The code achieves about 97.6% accuracy in predictions. My best submission stood 15th on the leaderboard. Finished 19th though, which was scored on a less accurate submission. There were > 200 participants with over 90% accuracy.
I know, beyond just accuracy there must be better ways of doing things. I'm a beginner and just getting started with Machine Learning. Suggestions/feedback on the implemention most welcome.
The Bank Indessa has not done well in last 3 quarters. Their NPAs (Non Performing Assets) have reached all time high. It is starting to lose confidence of its investors. As a result, it’s stock has fallen by 20% in the previous quarter alone.
After careful analysis, it was found that the majority of NPA was contributed by loan defaulters. With the messy data collected over all the years, this bank has decided to use machine learning to figure out a way to find these defaulters and devise a plan to reduce them.
This bank uses a pool of investors to sanction their loans. For example: If any customer has applied for a loan of $20000, along with bank, the investors perform a due diligence on the requested loan application. Keep this in mind while understanding data.
In this challenge, you will help this bank by predicting the probability that a member will default.