Table of Contents
The project showcases the methods of cleaning and analysing large datasets. The analysis of this project is on a Customer Loans dataset.
- Clone the repository to your local device
- Running data_file.csv through data_file_transformation.ipynb cleans the data and creates the renditions of the database at different stages
- Running the appropriate .csv file through the querying_data.ipynb outputs important queries about the data
This project takes a database in the .csv format (specifically a database of financial and loan information). The original database is stored at data_file.csv.
-
The data_file_transformation.ipynb incrementally cleans the data at each stage through:
- Converting columns into the correct data formats
- Assessing the missing data and what columns may need to be dropped in relation to this.
- Imputing missing values which are applicable to imputation
- Checking the normality of the columns and conducting transformations in order to normalise them
- Visualising the data
- Treating the data through removal of outliers
- Checking for correlations between columns and removing columns as applicable
-
There are checkpoints at which the dataframe is saved after its transformations. In this case normalized_data.csv is created after normality transformation have been applied. The filtered_dataframe.csv is created at the end of the correlation analysis.
-
Data analysis :
- The querying_data.ipynb is the notebook to use in order to query the data. It works side-by-side with the Query class and VisualiseQuery class in db_utils.py in order to provide functions for analysing the data.
- It visualises and answers:
- The percentage of recoveries based on investor funding and total funding
- The percentage of loans that have been a loss for the company
- The percetnage of loans which have been charged off historically
- The total amount which was paid towards these loans before they were charged off
- The projected loss of the loans marked as charged off
- The loss in revenue for the loans which were charged off if they had finished their term
- The loss of revenue month-on-month
- The percentage of users which present a risk to the company revenue
- The amount of customers who are deemed a risk and how much loss the company would incur if their status was changed to 'charged off'
- The percentage of total expected revenue represented by risky customers and customers who have already defaulted on their loans
- Analysis pointing out what the indicators of loss may be based on the information of the customers who were 'charged off'
Here is the list of definitions for each data-file.csv columns:
List of definitions for each column
- **id**: unique id of the loan
- **member_id**: id of the member to took out the loan
- **loan_amount**: amount of loan the applicant received
- **funded_amount**: The total amount committed to the loan at the point in time
- **funded_amount_inv**: The total amount committed by investors for that loan at that point in time
- **term**: The number of monthly payments for the loan
- **int_rate**: Interest rate on the loan
- **instalment**: The monthly payment owned by the borrower
- **grade**: LC assigned loan grade
- **sub_grade**: LC assigned loan sub grade
- **employment_length**: Employment length in years.
- **home_ownership**: The home ownership status provided by the borrower
- **annual_inc**: The annual income of the borrower
- **verification_status**: Indicates whether the borrowers income was verified by the LC or the income source was verified
- **issue_date:** Issue date of the loan
- **loan_status**: Current status of the loan
- **payment_plan**: Indicates if a payment plan is in place for the loan. Indication borrower is struggling to pay.
- **purpose**: A category provided by the borrower for the loan request.
- **dti**: A ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower’s self-reported monthly income.
- **delinq_2yr**: The number of 30+ days past-due payment in the borrower's credit file for the past 2 years.
- **earliest_credit_line**: The month the borrower's earliest reported credit line was opened
- **inq_last_6mths**: The number of inquiries in past 6 months (excluding auto and mortgage inquiries)
- **mths_since_last_record**: The number of months since the last public record.
- **open_accounts**: The number of open credit lines in the borrower's credit file.
- **total_accounts**: The total number of credit lines currently in the borrower's credit file
- **out_prncp**: Remaining outstanding principal for total amount funded
- **out_prncp_inv**: Remaining outstanding principal for portion of total amount funded by investors
- **total_payment**: Payments received to date for total amount funded
- **total_rec_int**: Interest received to date
- **total_rec_late_fee**: Late fees received to date
- **recoveries**: post charge off gross recovery
- **collection_recovery_fee**: post charge off collection fee
- **last_payment_date**: Last month payment was received
- **last_payment_amount**: Last total payment amount received
- **next_payment_date**: Next scheduled payment date
- **last_credit_pull_date**: The most recent month LC pulled credit for this loan
- **collections_12_mths_ex_med**: Number of collections in 12 months excluding medical collections
- **mths_since_last_major_derog**: Months since most recent 90-day or worse rating
- **policy_code**: publicly available policy_code=1 new products not publicly available policy_code=2
- **application_type**: Indicates whether the loan is an individual application or a joint application with two co-borrowers
- Python
- Data analysis through python coding
- Python notebook use
- Cleaning data
- Visualisation of data insights in python
- Large dataset management in the financial context