This is a fictional project for studying purposes. The business context and the insights are not real. The dataset is from a Health Insurance company that sells various kinds of insurance. The dataset is available on Kaggle.
An insurance company sells health insurance to its customers. They want to start selling vehicle insurance to these customers in order to diversify their products. The company will call these customers and offer this new type of insurance. The company surveyed its customers to get some data from them and find out which ones would be interested in vehicle insurance to make a cross sell. The company has availability to make only two thousand calls. They believe that one of the ways to reach as many customers as possible with the least amount of calls is to make a machine learning model that sorts the list of customers to maximize the amount of contracted services. It is a type of classification problem called learn to rank.
Machine Learning Classification Model: Using the dataset from Kaggle, a machine learning classification model was created to be use for future predictions.
The notebook used to create the model is available here.Flask Prediction API: The model is available on the cloud Heroku and can be acessible by an API created using Flask. The API source code is available here.
Google Sheets Script: A Google SHeets Script was developed to br used as a way to make predictions for several custumers at once. The spreadsheet is available here. There is a button on the top menu called "Health Insurance Prediction". To make predictions the user have to click there, click on "Get Prediction" and the predictions for all the rows in the spreadsheet will appear on the prediction column.
Information about the attributes can be found here.
Attribute | Description |
---|---|
id | Unique ID for the customer |
Gender | Gender of the customer |
Age | Age of the customer |
Driving_License | 0 : Customer does not have DL, 1 : Customer already has DL |
Region_Code | Unique code for the region of the customer |
Previously_Insured | 1 : Customer already has Vehicle Insurance, 0 : Customer doesn't have Vehicle Insurance |
Vehicle_Age | Age of the Vehicle |
Vehicle_Damage | 1 : Customer got his/her vehicle damaged in the past. 0 : Customer didn't get his/her vehicle damaged in the past. |
Annual_Premium | The amount customer needs to pay as premium in the year |
PolicySalesChannel | Anonymized Code for the channel of outreaching to the customer ie. Different Agents, Over Mail, Over Phone, In Person, etc. |
Vintage | Number of Days, Customer has been associated with the company |
Response | 1 : Customer is interested, 0 : Customer is not interested |
- Cross-selling is a sales technique that involves selling an additional product or service to an existing customer.
- Learn to rank is a kind of classification problem in which the objective is to order a datatable based on the probability of some data be of an specific class.
- There is no Policy Sales Channel better than others, they should have the same weight to the model prediction.
- Understand the Business problem.
- Download the dataset from Kaggle.
- Clean the dataset removing outliers, NA values and unnecessary features.
- Explore the data to create hypothesis, think about a few insights and validate them.
- Prepare the data to be used by the modeling algorithms encoding variables, splitting train and test dataset and other necessary operations.
- Create the models using machine learning algorithms.
- Evaluate the created models to find the one that best fits to your problem.
- Tune the model to achieve a better performance.
- Deploy the model in production so that it is available to the user.
- Find possible improvements to be explored in the future.
The final result of this project is a classification model to rank the table. Therefore, six models were created: KNN (K-Nearest Neighbors), Logistic Regression, Extra Trees, Random Forest, XGBoost and LightGBM.
The Boruta algorithm was used to select features for the model and only one feature were selected by Boruta. The dataset features are not very good at explaining if the customers want or not a vehicle insurance. The features for the model were chosen based on the feature importance in an Extra Trees model, seven features were selected. The models were evaluated considering two metrics, Precision at K and Recall at K considering the two thousand first rows of the table the models should rank. The initial models performances are in the table below.
Model Name | Precision at K | Recall at K |
---|---|---|
LightGBM | 0.4153 | 0.0895 |
XGBoost | 0.4078 | 0.0879 |
Random Forest | 0.3363 | 0.0725 |
KNN | 0.3338 | 0.0719 |
Extra Trees | 0.3288 | 0.0709 |
Logistic Regression | 0.3028 | 0.0653 |
To decide which would be the final model, a cross-validation was carried out to evaluate the performance of the algorithms in a more robust way. These metrics are represented in the table below.
Model Name | Precision at K | Recall at K |
---|---|---|
LightGBM CV | 0.4222 +/- 0.0037 | 0.1128 +/- 0.0007 |
XGBoost CV | 0.4120 +/- 0.0055 | 0.1102 +/- 0.0013 |
Random Forest CV | 0.3526 +/- 0.0106 | 0.0942 +/- 0.0031 |
KNN CV | 0.3374 +/- 0.0059 | 0.0904 +/- 0.0015 |
Extra Trees CV | 0.3216 +/- 0.0039 | 0.0860 +/- 0.0011 |
Logistic Regression CV | 0.2958 +/- 0.0111 | 0.0792 +/- 0.0032 |
The LightGBM model was the best among all the models created. It was the one selected to be deployed. After choosing which would be the final model, a random search hyperparameter optimization was used to improve the performance of the model. The final model evaluation metrics are in the table below.
Model Name | Precision at K | Recall at K |
---|---|---|
LightGBM | 0.433 +/- 0.0067 | 0.1158 +/- 0.0018 |
The model, when applied to the initial dataset with 381,109 clients, would include 701 more clients that would want a vahicle insurance than picking 2000 clients randomly in the database. It would represent an increase of 297,03% in the number of sucessed calls.
Although the dataset is not very good at creating classification models to predict whether or not customers would like vehicle insurance, a model was created that managed to sort the table better than a random sort. The model can help the company achieve a higher success rate when calling customers. However, it would be of great help to have more features to enhance the model predictability.
- Improve model prediction capabilities by adding new features.
- Explore the dataset to find possible insights.
- Try other machine learning algorithms.