LIVE DEMO ->
This project aims to develop an End-to-End machine learning model to predict loan default risk for a financial services platform. The model is trained on a dataset of borrower attributes, including financial, demographic, and credit-related features. The goal is to enable data-driven decisions in the loan underwriting process.
- Predictive modeling using machine learning algorithms
- Feature engineering and selection with sklearn pipelines
- Model evaluation and hyperparameter tuning with optuna
- Model registry and tracking with MLFlow
- Testing with pytest
- Deployment of API with FastAPI
- Frontend with streamlit App
- Containerization with Docker
- Python (>=3.7)
- Poetry (>=1.8) see how to install here
- Dependencies listed in
pyproject.toml
orrequirements.txt
(not to useconda.yaml
for dependencies)
-
Clone the repository:
gh repo clone jyothisable/LoanTap-Credit-Default-Risk-Model
-
Set up virtual environment and dependencies:
In a terminal, navigate to the project directory and run the following commands:
poetry install poetry shell
β£ π data/ - Contains raw and processed data files for the project
β β£ π LCDataDictionary.xlsx - Metadata dictionary for the loan dataset
β β£ π US_zip_to_cord.csv - CSV mapping U.S. ZIP codes to geographical coordinates
β β£ π loan.csv - Main dataset containing detailed loan information (not available in GitHub because of size limitation, downloaded from [Kaggle](https://www.kaggle.com/datasets/ranadeep/credit-risk-dataset))
β β£ π loan_reduced.csv - Filtered and reduced version of the main dataset
β β£ π§ͺ test_data.csv - Dataset used for evaluating model performance
β β π train_data.csv - Dataset used for training machine learning models
⣠𧩠Prediction_Model/ - Contains source code for model development, training, and predictions
β β£ ποΈ trained_models/ - Directory for storing trained models and pipelines
β β β£ π§ XBG_model_final.pkl - Final trained XGBoost model
β β β£ π fe_eval_model.pkl - Feature engineering evaluation model
β β β£ βοΈ fe_eval_tuned_model.pkl - Tuned model for evaluating feature engineering
β β β£ π fe_pipeline_fitted_final.pkl - Fitted feature engineering pipeline
β β β π target_pipeline_fitted.pkl - Fitted target pipeline for reverse transformations after predictions
β β£ π¦ __init__.py - Initialization file for package setup
β ⣠𧩠FE_pipeline.py - Script for feature engineering pipeline configurations
β β£ βοΈ config.py - Configuration file defining project parameters and settings
β β£ π§Ή data_handling.py - Script for loading, cleaning, and managing datasets and pipelines
β β£ π§ͺ evaluation.py - Script for evaluating models and feature engineering pipelines
β β£ π get_features.py - Utility for extracting features from the data
β β£ π plotting.py - Script for generating plots and visualizations
β β£ π€ predict.py - Script for running predictions using trained models
β β ποΈ train.py - Main script for training machine learning models
β£ π notebooks/ - Directory for Jupyter notebooks, images, and analysis reports
β β£ πΌοΈ Designer.jpeg - Image file for branding or presentation purposes
β β£ π EDA_report.html - HTML report summarizing exploratory data analysis
β β£ π LC.png - Additional image resource related to the project
β β£ π·οΈ loantap_logo.png - Logo image for the project
β β π§ͺ model_prototyping.ipynb - Jupyter notebook for exploratory data analysis and model prototyping
β£ π§ͺ tests/ - Contains unit tests and integration tests to ensure code robustness
β β£ π¦ __init__.py - Initialization file for the tests package
β β£ π data_tests.py - Tests for data handling and processing functions
β β π§ͺ test_prediction.py - Tests for the prediction module
β£ π« .dockerignore - Specifies files and directories to ignore when building Docker images
β£ π« .gitignore - File to exclude specific files and directories from Git version control
β£ βοΈ MLProject - Configuration for running MLflow projects
β£ π fastapi_app.py - Script to run a FastAPI web application for serving models or APIs
β£ π¨ streamlit_app.py - Script to run a Streamlit application for visualizing data and making predictions
β£ π Dockerfile - Instructions to build a Docker container for the project
β£ π§ͺ conda.yaml - Environment configuration file for setting up dependencies of MLFlow (not to be used for project setup)
β£ π requirements.txt - List of required Python packages for the project (for use with pip)
β£ βοΈ pyproject.toml - Configuration file defining project dependencies and settings using Poetry
β£ π poetry.lock - Dependency lock file for consistent environment setup using Poetry
β£ π LICENSE.md - Legal license information for the project
β£ π README.md - Documentation file with an overview, setup, and usage instructions
Important
Make sure to activate poetry by running poetry shell
in the root directory before running any commands or add poetry run
in beginning of every command below.
To train the model, run the following command in the root directory:
python Prediction_Model/train.py # feature engg pipeline
python Prediction_Model/train.py # Traning pipeline
To make predictions on test data, run the following command in the root directory:
python Prediction_Model/predict.py
To test the model, run the following command in the root directory:
pytest tests/test_prediction.py
python fastapi_app.py
POST to localhost:8000/predict
with Postman or use localhost:8000/predict/docs
in browser for documentation / testing
streamlit run streamlit_app.py # local ```
To pull the Docker image from Docker Hub, run the following command:
# Pull the docker image
docker pull jyothisable/credit_risk_streamlit_app
To run the Docker container, use the following command:
# Run the docker container
docker run -p 8501:8501 credit_risk_streamlit_app # goto http://localhost:8501 in browser
Refer to here or data/LCDataDictionary.xlsx
The model used for this project is an XGBoost classifier. The hyperparameters used for training are as follows after tuning with optuna:
{
'max_depth': 9,
'learning_rate': 0.094,
'n_estimators': 507,
'gamma': 0.0062,
'subsample': 0.962,
'colsample_bytree': 0.795,
'lambda': 0.389,
'alpha': 0.0233,
'scale_pos_weight': 1.99,
'min_child_weight': 2,
'grow_policy': 'lossguide'
}
The model achieved an f1 score of 79.12% and recall of 85.84% on the test dataset.
This project is licensed under the MIT License.