Ahmet Zamanis
- Introduction
- Setup
- Data cleaning
- Feature engineering
- Preprocessing pipeline
- Modeling & hyperparameter optimization
- Model testing
- Sensitivity analysis
- Conclusions
In this report, we’ll build and test binary classification algorithms, on a dataset of used cars up for sale. Our goal is to predict if a car is a “kick”: A car that is heavily damaged or unusable, and should not be purchased. The dataset was sourced from OpenML, and was apparently used for a past Kaggle competition, and possibly for a Chalearn competition.
We will test logistic regression, support vector machine and XGBoost classifiers with the scikit-learn package. As the dataset is quite large, we’ll train our logistic regression & SVM models with Stochastic Gradient Descent. We’ll also build a deep neural network model with PyTorch Lightning. We’ll perform hyperparameter tuning for all models with the Optuna framework. We’ll compare the performances of all models, both with classification performance metrics, and with a sensitivity analysis based on a hypothetical business scenario.
Our target variable is highly imbalanced, as positive observations are strongly in the minority. Training classifiers and evaluating their performance can be tricky in such cases. To try and address the imbalance, we’ll train our scikit-learn models with class weights, use focal loss as our loss function for the neural network, and use suitable metrics & plots to correctly assess performance.
Show packages
# Data handling
import pandas as pd
import numpy as np
from scipy.io.arff import loadarff
# Plotting
import matplotlib.pyplot as plt
import seaborn as sns
# Categorical encoding
from feature_engine.encoding import OneHotEncoder
from feature_engine.creation import CyclicalFeatures
from category_encoders.target_encoder import TargetEncoder
# Preprocessing pipeline
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split, StratifiedKFold
# Modeling
from sklearn.utils.class_weight import compute_class_weight
from sklearn.kernel_approximation import RBFSampler
from sklearn.calibration import CalibratedClassifierCV
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import SGDClassifier
from xgboost import XGBClassifier
# Neural network modeling
import torch, torchvision, torchmetrics
import lightning.pytorch as pl
from optuna.integration.pytorch_lightning import PyTorchLightningPruningCallback
# Hyperparameter tuning
import optuna
# Loss functions & performance metrics
from sklearn.metrics import log_loss
from sklearn.metrics import hinge_loss
from sklearn.metrics import average_precision_score, brier_score_loss
from sklearn.metrics import PrecisionRecallDisplay, precision_recall_curve
# Utilities
import warnings
from itertools import product
Show settings
# Set printing options
np.set_printoptions(suppress=True, precision=4)
pd.options.display.float_format = '{:.4f}'.format
pd.set_option('display.max_columns', None)
# Set plotting options
plt.rcParams['figure.dpi'] = 300
plt.rcParams['savefig.dpi'] = 300
plt.rcParams["figure.autolayout"] = True
sns.set_style("darkgrid")
# Set Torch settings
torch.set_default_dtype(torch.float32)
torch.set_float32_matmul_precision('high')
pl.seed_everything(1923, workers = True)
warnings.filterwarnings("ignore", ".*does not have many workers.*")
We load our dataset, which is in .arff format, and carry out some data cleaning operations. Most of the operations are explained in the code comments, so I’ll keep the text to a minimum in this section.
# Load raw data
raw_data = loadarff("./RawData/kick.arff")
# Convert to pandas dataframe
df = pd.DataFrame(raw_data[0])
# Print first & last 5 rows, all columns
print(df)
IsBadBuy PurchDate Auction VehYear VehicleAge Make
0 b'0' 1260144000.0000 b'ADESA' 2006.0000 3.0000 b'MAZDA' \
1 b'0' 1260144000.0000 b'ADESA' 2004.0000 5.0000 b'DODGE'
2 b'0' 1260144000.0000 b'ADESA' 2005.0000 4.0000 b'DODGE'
3 b'0' 1260144000.0000 b'ADESA' 2004.0000 5.0000 b'DODGE'
4 b'0' 1260144000.0000 b'ADESA' 2005.0000 4.0000 b'FORD'
... ... ... ... ... ... ...
72978 b'1' 1259712000.0000 b'ADESA' 2001.0000 8.0000 b'MERCURY'
72979 b'0' 1259712000.0000 b'ADESA' 2007.0000 2.0000 b'CHEVROLET'
72980 b'0' 1259712000.0000 b'ADESA' 2005.0000 4.0000 b'JEEP'
72981 b'0' 1259712000.0000 b'ADESA' 2006.0000 3.0000 b'CHEVROLET'
72982 b'0' 1259712000.0000 b'ADESA' 2006.0000 3.0000 b'MAZDA'
Model Trim SubModel Color
0 b'MAZDA3' b'i' b'4D SEDAN I' b'RED' \
1 b'1500 RAM PICKUP 2WD' b'ST' b'QUAD CAB 4.7L SLT' b'WHITE'
2 b'STRATUS V6' b'SXT' b'4D SEDAN SXT FFV' b'MAROON'
3 b'NEON' b'SXT' b'4D SEDAN' b'SILVER'
4 b'FOCUS' b'ZX3' b'2D COUPE ZX3' b'SILVER'
... ... ... ... ...
72978 b'SABLE' b'GS' b'4D SEDAN GS' b'BLACK'
72979 b'MALIBU 4C' b'LS' b'4D SEDAN LS' b'SILVER'
72980 b'GRAND CHEROKEE 2WD V' b'Lar' b'4D WAGON LAREDO' b'SILVER'
72981 b'IMPALA' b'LS' b'4D SEDAN LS' b'WHITE'
72982 b'MAZDA6' b's' b'4D SEDAN S' b'SILVER'
Transmission WheelTypeID WheelType VehOdo Nationality
0 b'AUTO' b'1' b'Alloy' 89046.0000 b'OTHER ASIAN' \
1 b'AUTO' b'1' b'Alloy' 93593.0000 b'AMERICAN'
2 b'AUTO' b'2' b'Covers' 73807.0000 b'AMERICAN'
3 b'AUTO' b'1' b'Alloy' 65617.0000 b'AMERICAN'
4 b'MANUAL' b'2' b'Covers' 69367.0000 b'AMERICAN'
... ... ... ... ... ...
72978 b'AUTO' b'1' b'Alloy' 45234.0000 b'AMERICAN'
72979 b'AUTO' b'?' b'?' 71759.0000 b'AMERICAN'
72980 b'AUTO' b'1' b'Alloy' 88500.0000 b'AMERICAN'
72981 b'AUTO' b'1' b'Alloy' 79554.0000 b'AMERICAN'
72982 b'AUTO' b'1' b'Alloy' 66855.0000 b'OTHER ASIAN'
Size TopThreeAmericanName MMRAcquisitionAuctionAveragePrice
0 b'MEDIUM' b'OTHER' 8155.0000 \
1 b'LARGE TRUCK' b'CHRYSLER' 6854.0000
2 b'MEDIUM' b'CHRYSLER' 3202.0000
3 b'COMPACT' b'CHRYSLER' 1893.0000
4 b'COMPACT' b'FORD' 3913.0000
... ... ... ...
72978 b'MEDIUM' b'FORD' 1996.0000
72979 b'MEDIUM' b'GM' 6418.0000
72980 b'MEDIUM SUV' b'CHRYSLER' 8545.0000
72981 b'LARGE' b'GM' 6420.0000
72982 b'MEDIUM' b'OTHER' 7535.0000
MMRAcquisitionAuctionCleanPrice MMRAcquisitionRetailAveragePrice
0 9829.0000 11636.0000 \
1 8383.0000 10897.0000
2 4760.0000 6943.0000
3 2675.0000 4658.0000
4 5054.0000 7723.0000
... ... ...
72978 2993.0000 2656.0000
72979 7325.0000 7431.0000
72980 9959.0000 9729.0000
72981 7604.0000 7434.0000
72982 8771.0000 8638.0000
MMRAcquisitonRetailCleanPrice MMRCurrentAuctionAveragePrice
0 13600.0000 7451.0000 \
1 12572.0000 7456.0000
2 8457.0000 4035.0000
3 5690.0000 1844.0000
4 8707.0000 3247.0000
... ... ...
72978 3732.0000 2190.0000
72979 8411.0000 6785.0000
72980 11256.0000 8375.0000
72981 8712.0000 6590.0000
72982 9973.0000 7730.0000
MMRCurrentAuctionCleanPrice MMRCurrentRetailAveragePrice
0 8552.0000 11597.0000 \
1 9222.0000 11374.0000
2 5557.0000 7146.0000
3 2646.0000 4375.0000
4 4384.0000 6739.0000
... ... ...
72978 3055.0000 4836.0000
72979 8132.0000 10151.0000
72980 9802.0000 11831.0000
72981 7684.0000 10099.0000
72982 9102.0000 11954.0000
MMRCurrentRetailCleanPrice PRIMEUNIT AUCGUART BYRNO VNZIP1
0 12409.0000 b'?' b'?' b'21973' b'33619' \
1 12791.0000 b'?' b'?' b'19638' b'33619'
2 8702.0000 b'?' b'?' b'19638' b'33619'
3 5518.0000 b'?' b'?' b'19638' b'33619'
4 7911.0000 b'?' b'?' b'19638' b'33619'
... ... ... ... ... ...
72978 5937.0000 b'?' b'?' b'18111' b'30212'
72979 11652.0000 b'?' b'?' b'18881' b'30212'
72980 14402.0000 b'?' b'?' b'18111' b'30212'
72981 11228.0000 b'?' b'?' b'18881' b'30212'
72982 13246.0000 b'?' b'?' b'18111' b'30212'
VNST VehBCost IsOnlineSale WarrantyCost
0 b'FL' 7100.0000 b'0' 1113.0000
1 b'FL' 7600.0000 b'0' 1053.0000
2 b'FL' 4900.0000 b'0' 1389.0000
3 b'FL' 4100.0000 b'0' 630.0000
4 b'FL' 4000.0000 b'0' 1020.0000
... ... ... ... ...
72978 b'GA' 4200.0000 b'0' 993.0000
72979 b'GA' 6200.0000 b'0' 1038.0000
72980 b'GA' 8200.0000 b'0' 1893.0000
72981 b'GA' 7000.0000 b'0' 1974.0000
72982 b'GA' 8000.0000 b'0' 1313.0000
[72983 rows x 33 columns]
The categorical variables in the dataset have the bytes datatype, and need to be converted to strings. Also, some missing values are recorded as “?”, which will be replaced with proper missing values.
# Convert object columns from bytes to string datatype
object_cols = df.select_dtypes(["object"]).columns
for column in object_cols:
df[column] = df[column].apply(lambda x: x.decode("utf-8"))
del column
# Replace "?" string values with NAs
for column in object_cols:
df.loc[df[column] == "?", column] = np.nan
del column
We have missing values in several columns. Two columns, PRIMEUNIT and AUCGUART are missing to a high degree. We’ll inspect and handle each column one by one.
# Print n. of missing values in each column
pd.isnull(df).sum()
IsBadBuy 0
PurchDate 0
Auction 0
VehYear 0
VehicleAge 0
Make 0
Model 0
Trim 2360
SubModel 8
Color 8
Transmission 9
WheelTypeID 3169
WheelType 3174
VehOdo 0
Nationality 5
Size 5
TopThreeAmericanName 5
MMRAcquisitionAuctionAveragePrice 18
MMRAcquisitionAuctionCleanPrice 18
MMRAcquisitionRetailAveragePrice 18
MMRAcquisitonRetailCleanPrice 18
MMRCurrentAuctionAveragePrice 315
MMRCurrentAuctionCleanPrice 315
MMRCurrentRetailAveragePrice 315
MMRCurrentRetailCleanPrice 315
PRIMEUNIT 69564
AUCGUART 69564
BYRNO 0
VNZIP1 0
VNST 0
VehBCost 68
IsOnlineSale 0
WarrantyCost 0
dtype: int64
The target variable is highly imbalanced.
print("Target class (im)balance: ")
df["IsBadBuy"].value_counts(normalize = True)
Target class (im)balance:
IsBadBuy
0 0.8770
1 0.1230
Name: proportion, dtype: float64
# Purchase date is in UNIX timestamp format. Convert to datetime
df["PurchDate"]
df["PurchDate"] = pd.to_datetime(df["PurchDate"], unit = "s")
# 3 unique auctioneers. ADESA, MANHEIM and other. Mostly MANHEIM.
df["Auction"].value_counts(normalize = True)
Auction
MANHEIM 0.5624
OTHER 0.2398
ADESA 0.1978
Name: proportion, dtype: float64
# Vehicle years range from 2001 to 2010
print(
df[["VehYear", "VehicleAge"]].describe()
)
# Purchase age almost always matches PurchYear - VehYear.
purch_year = df["PurchDate"].dt.year
veh_year = df["VehYear"]
print("\nCases where purchase year - vehicle year = vehicle age: " +
str(((purch_year - veh_year) == df["VehicleAge"]).sum())
)
VehYear VehicleAge
count 72983.0000 72983.0000
mean 2005.3431 4.1766
std 1.7313 1.7122
min 2001.0000 0.0000
25% 2004.0000 3.0000
50% 2005.0000 4.0000
75% 2007.0000 5.0000
max 2010.0000 9.0000
Cases where purchase year - vehicle year = vehicle age: 72976
The “Make” column records the car’s brand, and has many categorical levels (high cardinality), though there are columns with many more.
-
Some of these brands are sub-brands of others. For example, Scion is a sub-brand of Toyota, and Infiniti is a sub-brand of Nissan.
-
We could recode these to reduce cardinality, but most of these sub-brands offer cars of higher segments compared to the main brands (mainly in the US), so keeping them separate could be more useful.
# There are brands with very few observations.
print(
df["Make"].value_counts()
)
# Recode TOYOTA SCION into SCION
df.loc[df["Make"] == "TOYOTA SCION", "Make"] = "SCION"
Make
CHEVROLET 17248
DODGE 12912
FORD 11305
CHRYSLER 8844
PONTIAC 4258
KIA 2484
SATURN 2163
NISSAN 2085
HYUNDAI 1811
JEEP 1644
SUZUKI 1328
TOYOTA 1144
MITSUBISHI 1030
MAZDA 979
MERCURY 913
BUICK 720
GMC 649
HONDA 497
OLDSMOBILE 243
VOLKSWAGEN 134
ISUZU 134
SCION 129
LINCOLN 97
INFINITI 42
VOLVO 37
CADILLAC 33
ACURA 33
LEXUS 31
SUBARU 28
MINI 24
PLYMOUTH 2
TOYOTA SCION 1
HUMMER 1
Name: count, dtype: int64
The “Model” column records the car model. There are more than 1000 models, so if we want to use this column as a feature, we’ll likely use target encoding. More on this in the feature engineering & preprocessing sections.
# There are 1063 unique models, many with only 1 observation.
df["Model"].value_counts()
Model
PT CRUISER 2329
IMPALA 1990
TAURUS 1425
CALIBER 1375
CARAVAN GRAND FWD V6 1289
...
ESCAPE 1
PRIZM 1.8L I-4 SFI D 1
CAMRY V6 3.0L / 3.3L 1
LHS 1
PATRIOT 2WD 4C 2.0L 1
Name: count, Length: 1063, dtype: int64
Trim refers to slightly different versions / packages of a car model. These values are meaningless, and possibly misleading, without the associated model value.
# 134 trims, many with only 1 observation. 2360 missing values.
df["Trim"].value_counts()
Trim
Bas 13950
LS 10174
SE 9348
SXT 3825
LT 3540
...
Har 1
LL 1
JLX 1
JLS 1
L 3 1
Name: count, Length: 134, dtype: int64
We also have submodels, which are also likely meaningless or misleading without the main model value. This column offers information such as the car’s drivetrain, engine type or chassis type.
# 863 submodels, many with only 1 observation. 8 missing values.
df["SubModel"].value_counts()
SubModel
4D SEDAN 15236
4D SEDAN LS 4718
4D SEDAN SE 3859
4D WAGON 2230
MINIVAN 3.3L 1258
...
EXT CAB 5.4L FX2 SPORT 1
2D COUPE ZX5 S 1
4D WAGON OUTBACK 1
EXT CAB 6.0L SLE 1
MINIVAN 3.3L FFV EL 1
Name: count, Length: 863, dtype: int64
# 2784 unique model & submodel combinations, many with only 1 observation.
(df["Model"] + " " + df["SubModel"]).value_counts()
PT CRUISER 4D SEDAN 1894
IMPALA 4D SEDAN LS 1300
CALIBER 4D WAGON 1097
PT CRUISER 2.4L I4 S 4D SEDAN 957
STRATUS V6 2.7L V6 M 4D SEDAN SXT FFV 770
...
F150 PICKUP 2WD V8 5 REG CAB 5.4L W/T 1
VIBE 4D WAGON 2.4L 1
XL-7 4WD 2.7L V6 MPI 4D SPORT UTILITY 1
F150 PICKUP 4WD V8 CREW CAB 5.4L KING RANCH 1
PATRIOT 2WD 4C 2.0L 4D SUV SPORT 1
Name: count, Length: 2784, dtype: int64
# 3000+ unique cars in dataset (some could be different namings / spellings of the same car)
(df["Model"] + " " + df["Trim"] + " " + df["SubModel"]).value_counts()
PT CRUISER Bas 4D SEDAN 1355
IMPALA LS 4D SEDAN LS 1300
CALIBER SE 4D WAGON 939
STRATUS V6 2.7L V6 M SXT 4D SEDAN SXT FFV 770
TAURUS SE 4D SEDAN SE 761
...
CIVIC VP 2D COUPE DX VALUE PACKAGE AUTO 1
2500 RAM PICKUP 2WD Lar QUAD CAB 5.7L LARAMIE 1
DURANGO 4WD V8 Lim 4D SUV 4.7L FFV LIMITED 1
LEGACY L 4D WAGON L 1
PATRIOT 2WD 4C 2.0L Spo 4D SUV SPORT 1
Name: count, Length: 3136, dtype: int64
# 8 missing values for color, recode them as NOT AVAIL
print(df["Color"].value_counts())
df.loc[pd.isna(df["Color"]), "Color"] = "NOT AVAIL"
Color
SILVER 14875
WHITE 12123
BLUE 10347
GREY 7887
BLACK 7627
RED 6257
GOLD 5231
GREEN 3194
MAROON 2046
BEIGE 1584
BROWN 436
ORANGE 415
PURPLE 373
YELLOW 244
OTHER 242
NOT AVAIL 94
Name: count, dtype: int64
Some of the missing values in our data can be manually worked out from the car model, or other information. Transmission is one such column.
# 9 missing values for transmission. Try to work them out from car model. 1 occurence of manual spelled differently.
print(df["Transmission"].value_counts())
# Replace "Manual" with MANUAL in transmission
df.loc[df["Transmission"] == "Manual", "Transmission"] = "MANUAL"
Transmission
AUTO 70398
MANUAL 2575
Manual 1
Name: count, dtype: int64
We’ll print the model information for cars with missing Transmission values. We can then fill these in with manual search (or personal interest & knowledge in my case). We’ll apply this method to several other columns with missing values.
# Work out & impute transmission NAs from car model
print("Rows with Transmission values missing: ")
print(df.loc[pd.isna(df["Transmission"]), ["VehYear", "Make", "Model", "Trim", "SubModel"]])
transmission_nas = ["AUTO", "MANUAL", "MANUAL", "MANUAL", "AUTO", "MANUAL", "MANUAL", "AUTO", "AUTO"]
df.loc[pd.isna(df["Transmission"]), "Transmission"] = transmission_nas
Rows with Transmission values missing:
VehYear Make Model Trim SubModel
15906 2005.0000 MERCURY MONTEGO 3.0L V6 EFI Bas 4D SEDAN LUXURY
24567 2006.0000 DODGE STRATUS V6 2.7L V6 M SXT NaN
24578 2006.0000 DODGE STRATUS V6 2.7L V6 M SXT NaN
70432 2001.0000 CHEVROLET S10 PICKUP 2WD 4C 2. Bas NaN
70434 2004.0000 CHEVROLET IMPALA 3.4L V6 SFI Bas NaN
70437 2004.0000 PONTIAC GRAND AM V6 3.4L V6 SE NaN
70445 2002.0000 CHEVROLET CAVALIER 4C 2.2L I4 Bas NaN
70446 2002.0000 MERCURY MOUNTAINEER 2WD V8 4 NaN NaN
70450 2006.0000 FORD FREESTAR FWD V6 3.9L SE NaN
There are two columns that record essentially the same information: WheelType and WheelTypeID. We’ll keep only one, and recode the missing values simply as “Other”.
# 3169 missing values in WheelTypeID, 3174 in WheelType. Crosscheck these columns.
print(df["WheelTypeID"].value_counts())
print("\n")
print(df["WheelType"].value_counts())
WheelTypeID
1 36050
2 33004
3 755
0 5
Name: count, dtype: int64
WheelType
Alloy 36050
Covers 33004
Special 755
Name: count, dtype: int64
print("N. of WheelTypeID NAs that are also WheelType NAs: " +
str(pd.isnull(df.loc[pd.isnull(df["WheelTypeID"]), "WheelType"]).sum())
)
print("\nRemaining 5 rows with WheelType NAs are WheelTypeID = 0: \n" +
str(df.loc[df["WheelTypeID"] == "0", "WheelType"])
)
N. of WheelTypeID NAs that are also WheelType NAs: 3169
Remaining 5 rows with WheelType NAs are WheelTypeID = 0:
2992 NaN
3926 NaN
42640 NaN
47731 NaN
69331 NaN
Name: WheelType, dtype: object
print("Cases where WheelTypeID 1 = WheelType Alloy: " +
str((df.loc[df["WheelTypeID"] == "1", "WheelType"] == "Alloy").sum())
)
print("Cases where WheelTypeID 2 = WheelType Covers: " +
str((df.loc[df["WheelTypeID"] == "2", "WheelType"] == "Covers").sum())
)
print("Cases where WheelTypeID 3 = WheelType Special: " +
str((df.loc[df["WheelTypeID"] == "3", "WheelType"] == "Special").sum())
)
Cases where WheelTypeID 1 = WheelType Alloy: 36050
Cases where WheelTypeID 2 = WheelType Covers: 33004
Cases where WheelTypeID 3 = WheelType Special: 755
# Recode WheelType NAs as Other, drop WheelTypeID column
df.loc[pd.isnull(df["WheelType"]), "WheelType"] = "Other"
df = df.drop("WheelTypeID", axis = 1)
# 5 missing values in Nationality. Work these out from car make.
df["Nationality"].value_counts()
Nationality
AMERICAN 61028
OTHER ASIAN 8033
TOP LINE ASIAN 3722
OTHER 195
Name: count, dtype: int64
# Work out the 5 missing Nationality values from make
print("Makes of rows with missing Nationality values: ")
print(df.loc[pd.isnull(df["Nationality"]), "Make"])
nationality_nas = ["AMERICAN", "AMERICAN", "OTHER ASIAN", "AMERICAN", "AMERICAN"]
df.loc[pd.isnull(df["Nationality"]), "Nationality"] = nationality_nas
Makes of rows with missing Nationality values:
10888 GMC
25169 DODGE
37986 HYUNDAI
69948 JEEP
69958 JEEP
Name: Make, dtype: object
# 5 missing values in size. Work these out from the model.
df["Size"].value_counts()
Size
MEDIUM 30785
LARGE 8850
MEDIUM SUV 8090
COMPACT 7205
VAN 5854
LARGE TRUCK 3170
SMALL SUV 2276
SPECIALTY 1915
CROSSOVER 1759
LARGE SUV 1433
SMALL TRUCK 864
SPORTS 777
Name: count, dtype: int64
# Work out the 5 missing Size values from make & model
print("Rows with Size values missing: ")
print(df.loc[pd.isnull(df["Size"]), ["VehYear", "Make", "Model"]])
print("\nSize values of other rows with the same models: ")
print(
df.loc[df["Model"].str.contains("SIERRA"), "Size"].iloc[0] + "\n" +
df.loc[df["Model"].str.contains("NITRO 4WD"), "Size"].iloc[0] + "\n" +
df.loc[df["Model"].str.contains("ELANTRA"), "Size"].iloc[0] + "\n" +
df.loc[df["Model"].str.contains("PATRIOT 2WD"), "Size"].iloc[0]
)
size_nas = ["LARGE TRUCK", "MEDIUM SUV", "MEDIUM", "SMALL SUV", "SMALL SUV"]
df.loc[pd.isnull(df["Size"]), "Size"] = size_nas
Rows with Size values missing:
VehYear Make Model
10888 2002.0000 GMC SIERRA 1500
25169 2008.0000 DODGE NITRO 4WD V6
37986 2006.0000 HYUNDAI ELANTRA 2.0L I4 MPI
69948 2008.0000 JEEP PATRIOT 2WD 4C
69958 2008.0000 JEEP PATRIOT 2WD 4C
Size values of other rows with the same models:
LARGE TRUCK
MEDIUM SUV
MEDIUM
SMALL SUV
We have a column recording whether a car’s make is GM, Chrysler, Ford, or another brand. I doubt we need this, as this information is already built into the Make column.
# Unnecessary column, information already incorporated in Make. Drop it
print(df["TopThreeAmericanName"].value_counts())
df = df.drop("TopThreeAmericanName", axis = 1)
TopThreeAmericanName
GM 25314
CHRYSLER 23399
FORD 12315
OTHER 11950
Name: count, dtype: int64
There are several columns that record the MMR values of the cars. MMR stands for Manheim Market Report, a valuation offered by car auction company Manheim.
-
The MMR columns titled “Acquisition” record the MMR valuations of the car at time of purchase. There are several different valuations available, such as auction & retail prices, each for average & clean condition cars respectively.
-
The MMR columns titled “Current” are the car’s current MMR valuation prices. For sake of realism, we’ll assume we’re performing this analysis before purchasing the cars, so we’ll drop the Current MMR columns.
print("Missing values in MMR columns:")
print(pd.isnull(
df[['MMRAcquisitionAuctionAveragePrice', 'MMRAcquisitionAuctionCleanPrice',
'MMRAcquisitionRetailAveragePrice', 'MMRAcquisitonRetailCleanPrice',
'MMRCurrentAuctionAveragePrice', 'MMRCurrentAuctionCleanPrice',
'MMRCurrentRetailAveragePrice', 'MMRCurrentRetailCleanPrice']]
).sum()
)
# Drop current MMR prices to make the exercise more realistic
df = df.drop([
'MMRCurrentAuctionAveragePrice', 'MMRCurrentAuctionCleanPrice',
'MMRCurrentRetailAveragePrice', 'MMRCurrentRetailCleanPrice'], axis = 1)
Missing values in MMR columns:
MMRAcquisitionAuctionAveragePrice 18
MMRAcquisitionAuctionCleanPrice 18
MMRAcquisitionRetailAveragePrice 18
MMRAcquisitonRetailCleanPrice 18
MMRCurrentAuctionAveragePrice 315
MMRCurrentAuctionCleanPrice 315
MMRCurrentRetailAveragePrice 315
MMRCurrentRetailCleanPrice 315
dtype: int64
# Some MMR values are zero
print("N. of rows with at least one MMR value of 0: " +
str(
len(
df.loc[
(df["MMRAcquisitionAuctionAveragePrice"] == 0) |
(df["MMRAcquisitionAuctionCleanPrice"] == 0) |
(df["MMRAcquisitionRetailAveragePrice"] == 0) |
(df["MMRAcquisitonRetailCleanPrice"] == 0)
]
)
)
)
# Some MMR values are one
print("N. of rows with at least one MMR value of 1: " +
str(
len(
df.loc[
(df["MMRAcquisitionAuctionAveragePrice"] == 1) |
(df["MMRAcquisitionAuctionCleanPrice"] == 1) |
(df["MMRAcquisitionRetailAveragePrice"] == 1) |
(df["MMRAcquisitonRetailCleanPrice"] == 1)
]
)
)
)
# Drop rows with NAs in MMR
df = df.dropna(subset = [
'MMRAcquisitionAuctionAveragePrice', 'MMRAcquisitionAuctionCleanPrice',
'MMRAcquisitionRetailAveragePrice', 'MMRAcquisitonRetailCleanPrice'])
# Drop rows with 0s in MMR
df = df.loc[
(df["MMRAcquisitionAuctionAveragePrice"] > 0) &
(df["MMRAcquisitionAuctionCleanPrice"] > 0) &
(df["MMRAcquisitionRetailAveragePrice"] > 0) &
(df["MMRAcquisitonRetailCleanPrice"] > 0)].copy()
N. of rows with at least one MMR value of 0: 828
N. of rows with at least one MMR value of 1: 131
The PRIMEUNIT and AUCGUART columns mostly consist of missing values. But for rows with known values, they could be important predictors.
-
PRIMEUNIT records whether a car attracted unusually high demand. “YES” values are very rare even among non-missing rows. The missing values could simply mean “NO”.
-
AUCGUART records the level of inspection passed by a car before being auctioned. “GREEN” means fully inspected, “YELLOW” means partially inspected, “RED” means “you buy what you see”. There are no “YELLOW” values in the data, which means the missing values could simply mean “YELLOW”.
-
For both columns, we’ll recode missing values as “UNKNOWN”. Later we’ll have options to handle these based on the categorical encoding method we use.
# 95% missing column
print(df["PRIMEUNIT"].value_counts(normalize = True))
# Fill NAs in PRIMEUNIT with UNKNOWN.
df.loc[pd.isnull(df["PRIMEUNIT"]), "PRIMEUNIT"] = "UNKNOWN"
PRIMEUNIT
NO 0.9817
YES 0.0183
Name: proportion, dtype: float64
# 95% missing column
print(df["AUCGUART"].value_counts(normalize = True))
# Fill NAs in AUCGUART with UNKNOWN.
df.loc[pd.isnull(df["AUCGUART"]), "AUCGUART"] = "UNKNOWN"
AUCGUART
GREEN 0.9770
RED 0.0230
Name: proportion, dtype: float64
# BYRNO is buyer no. 74 unique buyers, some with only 1 observation.
df["BYRNO"].value_counts()
BYRNO
99761 3896
18880 3553
835 2951
3453 2888
22916 2795
...
99741 1
1157 1
3582 1
10425 1
1086 1
Name: count, Length: 74, dtype: int64
# VNZIP1 is zipcode of purchase location, 153 locations, some with only 1 obs.
df["VNZIP1"].value_counts()
VNZIP1
32824 3676
27542 3368
75236 2410
74135 2286
80022 2097
...
76101 1
85248 1
85260 1
80112 1
25071 1
Name: count, Length: 153, dtype: int64
# VNST is purchase state.
df["VNST"].value_counts()
VNST
TX 13482
FL 10318
CA 7009
NC 6971
AZ 6102
CO 4938
SC 4232
OK 3552
GA 2425
TN 1737
VA 1652
MD 1124
UT 870
PA 806
OH 793
MO 752
AL 684
NV 542
MS 488
IN 486
IA 482
IL 454
LA 347
NJ 317
WV 291
NM 237
KY 228
OR 208
ID 187
WA 135
NH 97
AR 68
MN 62
NE 26
MA 15
MI 14
NY 6
Name: count, dtype: int64
Vehicle purchase price is likely an important predictor, which is why we’ll drop the few rows with missing purchase prices. There is one car with a purchase price of 1, which may mean it was a symbolic transaction. We’ll keep this row for now.
# VehBCost is purchase price. 68 missing values, drop these rows. One car has a purchase price of 1.
print(df["VehBCost"].describe())
df = df.dropna(subset = "VehBCost")
count 72069.0000
mean 6727.6698
std 1766.7283
min 1.0000
25% 5430.0000
50% 6700.0000
75% 7900.0000
max 45469.0000
Name: VehBCost, dtype: float64
Warranty cost is paid for 36 months worth of warranty, or until the car racks up 36k miles. Likely another important predictor, especially relative to the purchase price.
# Warranty cost is for 36 months, or until 36k miles
df["WarrantyCost"].describe()
count 72069.0000
mean 1276.4975
std 597.7557
min 462.0000
25% 837.0000
50% 1169.0000
75% 1623.0000
max 7498.0000
Name: WarrantyCost, dtype: float64
We can create some new features from our existing ones. We can also explicitly map some interaction terms, to make them easier to discover for some models.
# Time features from date: Purchase year, month, day of week
df["PurchaseYear"] = df["PurchDate"].dt.year
df["PurchaseMonth"] = df["PurchDate"].dt.month
df["PurchaseDay"] = df["PurchDate"].dt.weekday
df = df.drop("PurchDate", axis = 1)
The model and submodel names incorporate plenty of technical information on the cars. We’ll create numerous binary columns that flag various traits of a car: Engine type, drivetrain type, chassis type and number of doors. I also thought about retrieving engine displacement in liters as a numeric feature, but this information is not openly stated for many cars, and would raise many missing values.
# Engine type features from Model: V6, V8, I4/I-4, 4C, 6C
df["EngineV6"] = df["Model"].str.contains("V6").astype(int)
df["EngineV8"] = df["Model"].str.contains("V8").astype(int)
df["EngineI4"] = df["Model"].str.contains("I4|I-4", regex = True).astype(int)
df["Engine4C"] = df["Model"].str.contains("4C").astype(int)
df["Engine6C"] = df["Model"].str.contains("6C").astype(int)
# Drivetrain type features from Model: 2WD, 4WD, AWD, FWD, RWD
df["2WD"] = df["Model"].str.contains("2WD").astype(int)
df["4WD"] = df["Model"].str.contains("4WD").astype(int)
df["AWD"] = df["Model"].str.contains("AWD").astype(int)
df["FWD"] = df["Model"].str.contains("FWD").astype(int)
df["RWD"] = df["Model"].str.contains("RWD").astype(int)
# Chassis type features from SubModel: WAGON, SEDAN, COUPE, HATCHBACK, CONVERTIBLE
# Work out and recode these features manually for rows where SubModel is missing
print("Rows where SubModel is missing:")
print(df.loc[pd.isnull(df["SubModel"]), "Model"])
df["ChassisWagon"] = df["SubModel"].str.contains("WAGON")
df.loc[pd.isnull(df["ChassisWagon"]), "ChassisWagon"] = [
False, False, False, False, False, False, False, True]
df["ChassisWagon"] = df["ChassisWagon"].astype(int)
df["ChassisSedan"] = df["SubModel"].str.contains("SEDAN")
df.loc[pd.isnull(df["ChassisSedan"]), "ChassisSedan"] = [
True, True, False, True, True, True, False, False]
df["ChassisSedan"] = df["ChassisSedan"].astype(int)
df["ChassisCoupe"] = df["SubModel"].str.contains("COUPE")
df.loc[pd.isnull(df["ChassisCoupe"]), "ChassisCoupe"] = False
df["ChassisCoupe"] = df["ChassisCoupe"].astype(int)
df["ChassisHatch"] = df["SubModel"].str.contains("HATCHBACK")
df.loc[pd.isnull(df["ChassisHatch"]), "ChassisHatch"] = False
df["ChassisHatch"] = df["ChassisHatch"].astype(int)
df["ChassisConvertible"] = df["SubModel"].str.contains("CONVERTIBLE")
df.loc[pd.isnull(df["ChassisConvertible"]), "ChassisConvertible"] = False
df["ChassisConvertible"] = df["ChassisConvertible"].astype(int)
Rows where SubModel is missing:
24567 STRATUS V6 2.7L V6 M
24578 STRATUS V6 2.7L V6 M
70432 S10 PICKUP 2WD 4C 2.
70434 IMPALA 3.4L V6 SFI
70437 GRAND AM V6 3.4L V6
70445 CAVALIER 4C 2.2L I4
70446 MOUNTAINEER 2WD V8 4
70450 FREESTAR FWD V6 3.9L
Name: Model, dtype: object
# Door type feature from SubModel: 4D
df["FourDoors"] = df["SubModel"].str.contains("4D")
# Work out and recode this feature manually for rows where SubModel is missing (displayed in previous cell)
df.loc[pd.isnull(df["FourDoors"]), "FourDoors"] = [
True, True, False, True, True, True, False, False]
df["FourDoors"] = df["FourDoors"].astype(int)
I decided to combine the Model & SubModel columns to have one “car” feature. This column will have very high cardinality, but we’ll use target encoding, and models with this feature performed slightly better than models without it. I did not include Trim in this feature, but it’s possible to give it a try.
# Recode SubModel NAs into empty string
df.loc[pd.isnull(df["SubModel"]), "SubModel"] = ""
# Combine Model & SubModel as one feature
df["ModelSubModel"] = df["Model"] + " " + df["SubModel"]
# Drop trim, submodel
df = df.drop(["Trim", "SubModel"], axis = 1)
Some obvious interaction features we can add are miles traveled per year, the premium / discount paid on the MMR prices, and the warranty cost to purchase price ratio.
# Miles per year feature
df["MilesPerYear"] = df["VehOdo"] / df["VehicleAge"]
df.loc[df["MilesPerYear"] == np.inf, "MilesPerYear"] = df["VehOdo"] # Replace inf values raised when vehicle age = 0
# Premiums / discounts paid on MMR prices: VehBCost - MMR price / MMR price
df["PremiumAuctionAvg"] = (df["VehBCost"] - df["MMRAcquisitionAuctionAveragePrice"]) / df["MMRAcquisitionAuctionAveragePrice"]
df["PremiumAuctionClean"] = (df["VehBCost"] - df["MMRAcquisitionAuctionCleanPrice"]) / df["MMRAcquisitionAuctionCleanPrice"]
df["PremiumRetailAvg"] = (df["VehBCost"] - df["MMRAcquisitionRetailAveragePrice"]) / df["MMRAcquisitionRetailAveragePrice"]
df["PremiumRetailClean"] = (df["VehBCost"] - df["MMRAcquisitonRetailCleanPrice"]) / df["MMRAcquisitonRetailCleanPrice"]
The row with a purchase price of 1 results in an astronomically high value for the warranty cost / purchase price ratio, so we drop it after all. We have plenty of data, and it’s likely better to keep it accurate.
# Warranty ratio to purchase price
df["WarrantyRatio"] = df["WarrantyCost"] / df["VehBCost"]
# The observation with purchase price = 1 skews the WarrantyRatio feature greatly. Drop it.
print(df[["VehBCost", "WarrantyRatio"]].sort_values(by = "WarrantyRatio", ascending = False).iloc[0])
df = df.loc[df["VehBCost"] != 1].copy()
VehBCost 1.0000
WarrantyRatio 1590.0000
Name: 20442, dtype: float64
Most machine learning algorithms require all features to be numeric. We’ll use one-hot encoding for most of our categorical columns.
-
This can be done before splitting training & testing data, as it won’t leak information from the testing set. However, we can’t use this approach in future prediction sets if they have new categorical levels unseen in our training data.
-
Adding too many columns may cause our models to suffer from the curse of dimensionality, which affects different algorithms to different extents. We’ll one-hot encode only the columns with relatively fewer levels, and leave the high-cardinality columns to target encoding.
-
A categorical column with N levels can be encoded with N-1 binary columns, as the Nth value will be reflected by all columns taking the value of 0. We’ll take advantage of this to keep our column count low.
-
We’ll also create binary columns only for known values of PRIMEUNIT and AUCGUART, effectively encoding the missing values as 0 for all binary columns.
# One hot encode some categoricals in-place
encoder_onehot = OneHotEncoder(
drop_last = True, # Create N-1 binary columns to encode N categorical levels
drop_last_binary = True,
variables = ['Auction', 'VehYear', 'Color', 'Transmission', 'WheelType',
'Nationality', 'Size', 'PurchaseYear'],
ignore_format = True)
df = encoder_onehot.fit_transform(df)
# One hot encode PRIMEUNIT and AUCGUART only using known values (effectively making "unknown" the default / 0 case)
df["PRIMEUNIT_YES"] = (df["PRIMEUNIT"] == "YES").astype(int)
df["PRIMEUNIT_NO"] = (df["PRIMEUNIT"] == "NO").astype(int)
df["AUCGUART_GREEN"] = (df["AUCGUART"] == "GREEN").astype(int)
df["AUCGUART_RED"] = (df["AUCGUART"] == "RED").astype(int)
df = df.drop(["PRIMEUNIT", "AUCGUART"], axis = 1)
Month and day of week are cyclical features: The “distance” between month 1 and month 12 is not eleven, it is one. Simply label encoding the months as 1-12 , or weekdays as 1-7 would mislead our models.
-
Instead, we’ll use cyclical encoding, which encodes each feature using a sine and cosine pair, conveying their cyclical nature to our algorithms.
-
Another option is one-hot encoding, but it creates considerably more columns than cyclical encoding. It could also be advantageous to avoid “splitting” related information into many columns, to ensure they are evaluated together.
# Cyclical encode month and day of week features
encoder_cyclical = CyclicalFeatures(
variables = ["PurchaseMonth", "PurchaseDay"], drop_original = True)
df = encoder_cyclical.fit_transform(df)
The remaining, high-cardinality categorical features will be target encoded, which won’t change the number of columns.
# View final feature set before preprocessing
print(df.head())
IsBadBuy VehicleAge Make Model VehOdo
0 0 3.0000 MAZDA MAZDA3 89046.0000 \
1 0 5.0000 DODGE 1500 RAM PICKUP 2WD 93593.0000
2 0 4.0000 DODGE STRATUS V6 73807.0000
3 0 5.0000 DODGE NEON 65617.0000
4 0 4.0000 FORD FOCUS 69367.0000
MMRAcquisitionAuctionAveragePrice MMRAcquisitionAuctionCleanPrice
0 8155.0000 9829.0000 \
1 6854.0000 8383.0000
2 3202.0000 4760.0000
3 1893.0000 2675.0000
4 3913.0000 5054.0000
MMRAcquisitionRetailAveragePrice MMRAcquisitonRetailCleanPrice BYRNO
0 11636.0000 13600.0000 21973 \
1 10897.0000 12572.0000 19638
2 6943.0000 8457.0000 19638
3 4658.0000 5690.0000 19638
4 7723.0000 8707.0000 19638
VNZIP1 VNST VehBCost IsOnlineSale WarrantyCost EngineV6 EngineV8
0 33619 FL 7100.0000 0 1113.0000 0 0 \
1 33619 FL 7600.0000 0 1053.0000 0 0
2 33619 FL 4900.0000 0 1389.0000 1 0
3 33619 FL 4100.0000 0 630.0000 0 0
4 33619 FL 4000.0000 0 1020.0000 0 0
EngineI4 Engine4C Engine6C 2WD 4WD AWD FWD RWD ChassisWagon
0 0 0 0 0 0 0 0 0 0 \
1 0 0 0 1 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0
ChassisSedan ChassisCoupe ChassisHatch ChassisConvertible FourDoors
0 1 0 0 0 1 \
1 0 0 0 0 0
2 1 0 0 0 1
3 1 0 0 0 1
4 0 1 0 0 0
ModelSubModel MilesPerYear PremiumAuctionAvg
0 MAZDA3 4D SEDAN I 29682.0000 -0.1294 \
1 1500 RAM PICKUP 2WD QUAD CAB 4.7L SLT 18718.6000 0.1088
2 STRATUS V6 4D SEDAN SXT FFV 18451.7500 0.5303
3 NEON 4D SEDAN 13123.4000 1.1659
4 FOCUS 2D COUPE ZX3 17341.7500 0.0222
PremiumAuctionClean PremiumRetailAvg PremiumRetailClean WarrantyRatio
0 -0.2776 -0.3898 -0.4779 0.1568 \
1 -0.0934 -0.3026 -0.3955 0.1386
2 0.0294 -0.2943 -0.4206 0.2835
3 0.5327 -0.1198 -0.2794 0.1537
4 -0.2085 -0.4821 -0.5406 0.2550
Auction_ADESA Auction_OTHER VehYear_2006.0 VehYear_2004.0
0 1 0 1 0 \
1 1 0 0 1
2 1 0 0 0
3 1 0 0 1
4 1 0 0 0
VehYear_2005.0 VehYear_2007.0 VehYear_2001.0 VehYear_2003.0
0 0 0 0 0 \
1 0 0 0 0
2 1 0 0 0
3 0 0 0 0
4 1 0 0 0
VehYear_2002.0 VehYear_2008.0 VehYear_2009.0 Color_RED Color_WHITE
0 0 0 0 1 0 \
1 0 0 0 0 1
2 0 0 0 0 0
3 0 0 0 0 0
4 0 0 0 0 0
Color_MAROON Color_SILVER Color_BLACK Color_GOLD Color_GREY
0 0 0 0 0 0 \
1 0 0 0 0 0
2 1 0 0 0 0
3 0 1 0 0 0
4 0 1 0 0 0
Color_BLUE Color_BEIGE Color_PURPLE Color_ORANGE Color_GREEN
0 0 0 0 0 0 \
1 0 0 0 0 0
2 0 0 0 0 0
3 0 0 0 0 0
4 0 0 0 0 0
Color_BROWN Color_YELLOW Color_NOT AVAIL Transmission_AUTO
0 0 0 0 1 \
1 0 0 0 1
2 0 0 0 1
3 0 0 0 1
4 0 0 0 0
WheelType_Alloy WheelType_Covers WheelType_Other
0 1 0 0 \
1 1 0 0
2 0 1 0
3 1 0 0
4 0 1 0
Nationality_OTHER ASIAN Nationality_AMERICAN Nationality_TOP LINE ASIAN
0 1 0 0 \
1 0 1 0
2 0 1 0
3 0 1 0
4 0 1 0
Size_MEDIUM Size_LARGE TRUCK Size_COMPACT Size_LARGE Size_VAN
0 1 0 0 0 0 \
1 0 1 0 0 0
2 1 0 0 0 0
3 0 0 1 0 0
4 0 0 1 0 0
Size_MEDIUM SUV Size_LARGE SUV Size_SPECIALTY Size_SPORTS
0 0 0 0 0 \
1 0 0 0 0
2 0 0 0 0
3 0 0 0 0
4 0 0 0 0
Size_CROSSOVER Size_SMALL SUV PurchaseYear_2009 PRIMEUNIT_YES
0 0 0 1 0 \
1 0 0 1 0
2 0 0 1 0
3 0 0 1 0
4 0 0 1 0
PRIMEUNIT_NO AUCGUART_GREEN AUCGUART_RED PurchaseMonth_sin
0 0 0 0 -0.0000 \
1 0 0 0 -0.0000
2 0 0 0 -0.0000
3 0 0 0 -0.0000
4 0 0 0 -0.0000
PurchaseMonth_cos PurchaseDay_sin PurchaseDay_cos
0 1.0000 0.0000 1.0000
1 1.0000 0.0000 1.0000
2 1.0000 0.0000 1.0000
3 1.0000 0.0000 1.0000
4 1.0000 0.0000 1.0000
The rest of our preprocessing will take place in a scikit-learn pipeline, to ensure we avoid any target leakage, and to keep our modeling workflow clean.
-
First, we’ll set aside 20% of our data for testing. We’ll ensure the split is performed with stratification: We want to maintain the target class balance as much as possible.
-
We’ll use the remaining 80% for hyperparameter tuning and validation. We’ll test the final models with the best hyperparameters on the testing data.
# Split target and features
y = df["IsBadBuy"].astype(int)
x = df.drop("IsBadBuy", axis = 1)
# Perform train-test split
x_train, x_test, y_train, y_test = train_test_split(
x, y, test_size = 0.2, random_state = 1923, stratify = y
)
We will use several target encoders to encode the remaining, high-cardinality categorical features.
-
Target encoders derive a numeric value based on the target values for each categorical level. This brings the obvious risk of leaking the target values from the testing data if not applied properly.
-
Another potential issue is overfitting: If a categorical level has very few observations, the value derived from the target will be unreliable.
-
The target encoder applies a form of regularization to alleviate this: The encoded value for infrequent categorical levels are blended with the values for all observations. The degree of this regularization can be controlled by the parameters
min_sample_leaf
andsmoothing
in target_encoder. We’ll use the default settings. -
The target encoder can also make use of a hierarchy of columns: Categorical levels of one column may be subsets of levels of another column. We have two such hierarchies in our data: The make-model-submodel hierarchy, and the states-zipcodes hierarchy.
-
We’ll map the target encoders for these columns with their respective higher hierarchies, so regularization can be applied by blending in the higher hierarchy values instead of values for all observations.
-
-
Aside from these risks and considerations, target encoding brings several benefits:
-
It does not create any new columns, and it does not split the information from one feature into several columns.
-
It is also able to handle any missing values, or previously unseen categorical levels in prediction data, by applying the encoded values for all observations, or the values for the hierarchy, as done in regularization.
-
# Target encoder: ModelSubModel encoded with Model, Make hierarchy
hier_submodels = pd.DataFrame(x[["Make", "Model"]]).rename({
"Make": "HIER_ModelSubModel_1", # Make as first (top) level of hierarchy
"Model": "HIER_ModelSubModel_2"}, # Model as second level of hierarchy
axis = 1)
encode_target_submodel = TargetEncoder(
cols = ["ModelSubModel"], # Column to encode (bottom level of hierarchy)
hierarchy = hier_submodels)
# Target encoder: Model encoded with Make hierarchy
hier_models = pd.DataFrame(x["Make"]).rename({"Make": "HIER_Model_1"}, axis = 1)
encode_target_model = TargetEncoder(cols = ["Model"], hierarchy = hier_models)
# Target encoder: Zipcodes encoded with states hierarchy
hier_states = pd.DataFrame(x["VNST"]).rename({"VNST": "HIER_VNZIP1_1"}, axis = 1)
encode_target_zip = TargetEncoder(cols = ["VNZIP1"], hierarchy = hier_states)
# Target encoder: Make, buyer, state encoded without hierarchy (apply last):
encode_target = TargetEncoder(cols = ["Make", "BYRNO", "VNST"])
Many machine learning algorithms can be sensitive to the value scales of features.
-
Those with very large absolute values may dominate training, and prevent smaller-scale values from having much effect.
-
All our models except the tree-based XGBoost algorithm can be sensitive to scale. Especially neural networks can suffer from various numerical stability issues in training.
-
We’ll scale all features between 0 and 1 as our final preprocessing step.
For hyperparameter validation, we will use 3-fold crossvalidation:
-
The 80% training data will be split to 3 equal folds. Each hyperparameter configuration we evaluate will be trained on two folds, and validated on the third, for a total of three different combinations. Stratification will be applied just like in the train-test split.
-
We could have more robust validation scores with more folds, but the neural network tuning can be computationally expensive, and I wanted to compare each model on “equal terms”.
-
An even more reliable tuning & testing scheme is nested crossvalidation: Splitting the full data into N training-testing folds (outer resampling), and splitting each training fold into K folds for parameter tuning (inner resampling). This is likely not necessary with such a big dataset.
-
Since we’ll build a custom model validation function for each model, we’ll retrieve the train-test indices from the 3-fold CV splitter, for a total of three train-test index pairs.
# Scaling method
scaler_minmax = MinMaxScaler()
# Full preprocessing pipeline
pipe_process = Pipeline(steps = [
("target_encoder_zipcode", encode_target_zip),
("target_encoder_submodel", encode_target_submodel),
("target_encoder_model", encode_target_model),
("target_encoder", encode_target),
("minmax_scaler", scaler_minmax)
])
# Inner validation method for hyperparameter tuning
cv_kfold = StratifiedKFold(n_splits = 3)
# Get training / inner validation split indices (3 pairs) from k-fold splitter
cv_indices = list(cv_kfold.split(x_train, y_train))
The hyperparameter optimization will follow 3 main steps for every model:
-
A validation function will be defined, to perform 3-fold crossvalidation with a given hyperparameter set.
-
An Optuna objective function will be defined, which will suggest hyperparameter values from a predefined range, validate the parameter set (with our validation function), and report the average validation score of the parameter set.
- One iteration of the objective function for one set of hyperparameters is called an Optuna trial.
-
An Optuna study will be created and performed. The study performs a given number of trials and records their results.
- The study uses a sampler algorithm to intelligently suggest parameter values for trials, and a pruner algorithm to stop unpromising trials early on and save time.
Since the process is very similar for all models, we’ll go over the full process only for the first model, logistic regression. For the other models, only the differences will be explained. Of course, the actual tuning process takes hours, so I’ll just display the code I used without running it and load previously saved results.
Logistic regression predicts the log-odds of an event occurring (also called logits) with a linear model. It’s a simple, robust method that can perform very well compared to more advanced models.
-
The main assumption of logistic regression is a linear relationship between the predictors and the logits. The less linear the relationships are, the less accurate logistic regression will be.
-
Logistic regression won’t find and model interaction effects unless they are explicitly mapped as new features (we did this for some features).
-
Plain linear regression models can suffer from multicollinearity and curse of dimensionality in the feature set, but regularization is almost always applied to model coefficients to avoid overfitting.
-
L1 (lasso) regularization can potentially shrink model coefficients to zero, effectively dropping predictors from the model and performing feature selection.
-
L2 (ridge) regularization can’t shrink coefficients to zero, but just reduces their contribution to the model.
-
Elastic net regularization is a blend of L1 and L2 regularization, controlled by a ratio parameter. This is what we will use.
-
Logistic regression (implemented with LogisticRegression in scikit-learn) is normally fit on all training observations at once: The model coefficients are all estimated in one go using a solver algorithm. This calculation can be a bit slow with a large dataset like ours, especially since we’ll fit hundreds of models for hyperparameter tuning. To save time, we’ll train our logistic regression (and SVM) models with stochastic gradient descent:
-
In SGD, the model is trained only on one observation in one training step. The training error (loss) is calculated and the model parameters are updated at every training step. When the model sees each training observation once, we complete a training epoch.
-
We perform multiple epochs of training. Training is usually early stopped when the validation score stops improving. This is usually assessed after each epoch.
-
This approach is similar to how neural networks are trained. A key difference is NNs use minibatch SGD: A training step consists of a batch of observations, which can range from 16-32 to 1024-2048 observations (usually a power of 2). This is a faster approach than single-observation SGD. It’s possible to implement this manually in scikit-learn, but it likely won’t bring much improvement for logistic regression or SVM.
We can train linear classification models with SGD in scikit-learn using SGDClassifier. Below, we will define a validation function that takes a set of hyperparameters, builds a logistic regression model with SGDClassifier and validates the performance of the parameter set.
-
In one instance of the function, a model will be created, trained & validated for each CV fold, for a total of three times.
-
The model will be trained with the
.partial_fit()
method, which trains the SGDClassifier for only one epoch. We’ll call the method repeatedly to train the model for several epochs. Our function will handle early stopping manually by recording each epoch’s score. By default, 10 consecutive epochs with less than 0.0001 improvement in the validation error will terminate the model training.-
Class weights will be applied to both training and validation error calculations: The errors for the minority class will be penalized more heavily, forcing our models to pay more attention to them.
-
The class weight for each class will be inversely proportional to its class frequency, calculated automatically by a scikit-learn utility (only using the training data to avoid leakage). Using these class weights, we will create a sample weights vector for every training & validation set observation, and pass these to the model training and scoring functions respectively.
-
-
The function will return the average CV score of the parameter set, and the number of training epochs that achieved it. It will also perform pruning if directed by the Optuna pruner:
- Unpromising trials (determined according to the validation scores of the first epochs) will be stopped early by the Optuna pruner. To allow pruning, the score of each epoch for the first CV fold will be reported to the Optuna study.
- The best number of epochs is not a hyperparameter to be suggested & tuned by Optuna, but we will observe the value it takes during a trial, and report it back.
The SGDClassifier is a general method for training a linear model with SGD. Its parameters determine what type of linear model is trained:
-
The
loss
parameter sets the training loss function. For probabilistic logistic regression, we will use log loss, which compares the probability predictions (between 0-1) to the actual target labels (0 or 1). We will also use log loss to score our predictions on the validation set. -
The
penalty
parameter sets the regularization type. We’ll use elastic net regularization. -
learning_rate
is another key parameter in SGD training (especially for neural networks). It determines the magnitude of the coefficient updates in each training step.-
With a higher LR, the coefficients will be updated more aggressively in each training step. The models will train faster, may be more likely to find optimal solutions, but also more likely to overfit and suffer from erratic model updates.
-
With a lower LR, the coefficients will be updated more conservatively in each training step. The models will train slower, may be less likely to reach optimal solutions, but will also be more robust against overfitting.
-
In neural networks, a learning rate scheduler is often used to dynamically adjust the learning rate. We’ll explore this in more detail for our NN model, but for SGDClassifier, we’ll use a default heuristic based on the regularization strength.
-
-
We have two hyperparameters to tune for the logistic regression model:
alpha
andl1_ratio
.-
Alpha is the regularization strength. A higher value means the model coefficients will be penalized more heavily to combat overfitting. Generally, we’d need a higher alpha value with more redundancy in the feature set. In practice, the optimal value is usually a small fraction, and too high regularization leads to the opposite problem of underfitting.
-
L1 ratio will determine what fraction of our elastic net regularization will be L1 regularization.
-
Show validation function definition
# Define model validation function for logistic regression trained with SGD
def validate_logistic(
alpha, # Regularization strength hyperparameter
l1_ratio, # Ratio of L1 regularization hyperparameter
trial, # Pass the Optuna trial of the parameter set
tol = 1e-4, # Min. required change in validation score to count as improvement
verbose = 0 # Set to 1 to print model training progress
):
# Record the validation scores of the parameter set on each CV fold
cv_scores = []
# Record best epoch number for each CV fold
best_epochs = []
# For each CV fold,
for i, (train_index, val_index) in enumerate(cv_indices):
# Split training-validation data
x_tr = x_train.iloc[train_index, ]
y_tr = y_train.iloc[train_index, ]
x_val = x_train.iloc[val_index, ]
y_val = y_train.iloc[val_index, ]
# Perform preprocessing
x_tr = pipe_process.fit_transform(x_tr, y_tr)
x_val = pipe_process.transform(x_val)
# Compute class weight & sample weight vectors
classes = list(set(y_tr))
class_weight = compute_class_weight("balanced", classes = classes, y = y_tr)
sample_weight = np.where(y_tr == 1, class_weight[1], class_weight[0])
sample_weight_val = np.where(y_val == 1, class_weight[1], class_weight[0])
# Define Logistic Regression classifier with SGD
model_logistic = SGDClassifier(
loss = "log_loss", # Log loss metric for probabilistic logistic regression
penalty = "elasticnet", # Elastic net regularization: Blend of L1 and L2
learning_rate = "optimal", # Dynamically adjusted learning rate, based on a heuristic that uses regularization strength
random_state = 1923, # Set random seed for replicable results
verbose = verbose, # Whether to print training progress or not
n_jobs = -1, # Use all CPU cores to parallelize
alpha = alpha, # Hyperparameters to validate
l1_ratio = l1_ratio # Hyperparameters to validate
)
# Perform epoch by epoch training with early stopping & pruning
epoch_scores = [] # Record val. score of each epoch
n_iter_no_change = 0 # Record epochs with no improvement
tol = tol # Min. change in val. score to count as improvement
for epoch in range(1000):
# Train model for 1 epoch
model_logistic.partial_fit(
x_tr, y_tr, classes = classes, sample_weight = sample_weight)
# Predict validation set & score predictions
y_pred = model_logistic.predict_proba(x_val)
epoch_score = log_loss(y_val, y_pred, sample_weight = sample_weight_val)
# For first CV fold, report intermediate score of Optuna trial
if i == 0):
trial.report(epoch_score, epoch)
# Prune trial if necessary
if trial.should_prune():
raise optuna.TrialPruned()
# Count epochs with no improvement after first 10 epochs
if (epoch > 9) and (epoch_score > min(epoch_scores) - tol):
n_iter_no_change += 1
# Reset epochs with no improvement if an improvement took place
if (epoch > 9) and (epoch_score <= min(epoch_scores) - tol):
n_iter_no_change = 0
# Append epoch score to list of epoch scores
epoch_scores.append(epoch_score)
# Early stop training if necessary
if n_iter_no_change >= 10:
print("\nEarly stopping at epoch " + str(epoch) + "\n")
break
# Append best epoch score to list of CV scores
cv_scores.append(min(epoch_scores))
# Append best epoch number to list of best epochs
best_epochs.append(epoch_scores.index(min(epoch_scores)) + 1)
# Return the average CV score & median best epochs
return np.mean(cv_scores), np.median(best_epochs)
Next, we define the Optuna objective function for logistic regression. Here, we decide the search space for each hyperparameter we want to tune.
-
We’ll tune
alpha
between a very small float value of 5e-5 and a relatively large value of 0.5, though the optimal value is likely close to 0. Therefore, we will instruct Optuna to sample from the log domain:- The logarithm of the search space will be taken, values will be sampled from this space, and the exponent will be taken to return the original value. This will make lower values more likely to be suggested. We’ll also do this for other hyperparameters in other models that usually take an optimal value closer to 0.
-
l1_ratio
is constrained between 0 and 1, so we will use that as our search space. A value of 0 will result in pure L2 regularization, and a value of 1 will result in pure L1 regularization. Anything in between is elastic net regularization, a blend of the two.
# Define objective function for hyperparameter tuning with Optuna
def objective_logistic(trial):
# Define hyperparameter ranges to tune over, and suggest a value for
# each hyperparameter
alpha = trial.suggest_float(
"reg_strength", # Parameter name in trials result table
5e-5, # Minimum value to try
0.5, # Maximum value to try
log = True # Sampling from the log domain makes lower values more likely
)
l1_ratio = trial.suggest_float("l1_ratio", 0, 1)
# Validate the parameter set
score, epoch = validate_logistic(
alpha = alpha, l1_ratio = l1_ratio, trial = trial)
# Report best n. of epochs to Optuna
trial.set_user_attr("n_epochs", epoch)
return score
Now we create the Optuna study. This object will manage the tuning process and store the information from trials.
-
We set a sampler algorithm: The sampler suggests hyperparameter values for trials intelligently, often based on the results of previous trials.
- In our case, we use the suggested default, the Tree-structured Parzen Estimator algorithm (see the documentation for a list of papers related to TPE.). In short, the TPE algorithm fits Gaussian Mixture Models to estimate the optimal parameters for each trial.
-
We set a pruner algorithm: The pruner decides whether to complete a trial, or prune it based on its intermediate validation scores. This saves us a great deal of time by stopping unpromising trials quickly.
- We use the Hyperband pruner algorithm, which is suggested for the TPE sampler. In short, hyperband allocates a finite budget equally to each hyperparameter configuration to be compared. There is the option to increase the total budget, at the expense of eliminating candidate configurations more quickly (see the documentation and original paper for more detail).
-
We name our study, and ensure we set the correct objective: Since a lower log loss value is better, we aim to minimize the reported metric.
# Create Optuna study
study_logistic = optuna.create_study(
sampler = optuna.samplers.TPESampler(seed = 1923), # Sampling algorithm
pruner = optuna.pruners.HyperbandPruner(), # Pruning algorithm
study_name = "tune_logistic",
direction = "minimize" # Objective is to minimize the reported metric
)
We can finally run the hyperparameter optimization. It is recommended to run the TPE algorithm with 100 to 1000 trials. The models in the final edition of this report were tuned with 500 or 1000 trials.
-
With pruning in place, even if improvement stops after a few trials, the remaining trials will mostly be pruned.
-
For logistic regression, one study of 500 trials usually took around 10-15 minutes. The SGDClassifier was parallelized with
n_jobs = -1
and 12 CPU cores available.
# Perform Optuna study
study_logistic.optimize(
objective_logistic,
n_trials = 500,
show_progress_bar = True)
The trials were exported to .csv for future use, and the best set of hyperparameters will be imported.
# Retrieve and export trials
trials_logistic = study_logistic.trials_dataframe().sort_values("value", ascending = True)
trials_logistic.to_csv("./ModifiedData/trials_logistic.csv", index = False)
# Import best trial
best_trial_logistic = pd.read_csv("./ModifiedData/trials_logistic.csv")
best_trial_logistic = best_trial_logistic.loc[
best_trial_logistic["state"] == "COMPLETE"].iloc[0,]
best_trial_logistic
number 471
value 0.5757
datetime_start 2023-07-10 14:21:53.276629
datetime_complete 2023-07-10 14:21:57.814307
duration 0 days 00:00:04.537678
params_l1_ratio 0.0003
params_reg_strength 0.0028
user_attrs_n_epochs 20.0000
system_attrs_completed_rung_0 0.5715
system_attrs_completed_rung_1 0.5715
system_attrs_completed_rung_2 NaN
system_attrs_completed_rung_3 NaN
state COMPLETE
Name: 298, dtype: object
We see the best tune has a very low L1 regularization ratio, though not zero. The regularization strength is also fairly low, but not nearly close to the lower bound of our search space. This suggests we are likely close to the optimal solution.
The median best number of epochs for the 3 CV folds was 20, which is the number of epochs we’ll use in our final logistic regression model, created below.
-
We combine our preprocessing pipeline with our logistic regression model, to perform all steps in one go.
-
We’ll train our final model with the optimal hyperparameters, for the optimal number of epochs we derived from crossvalidation, with no early stopping.
-
Even with the built-in validation disabled, SGDClassifier performs early stopping based on the training loss.
-
This can be misleading, as lower training loss doesn’t necessarily translate into better predictive performance on unseen data (in the case of overfitting). To avoid this, we set an arbitrarily high number for the
n_iter_no_change
parameter.
-
# Create pipeline with final logistic regression model
pipe_logistic = Pipeline(steps = [
("preprocessing", pipe_process), # Preprocessing steps
("Logistic", SGDClassifier( # Model step
loss = "log_loss",
penalty = "elasticnet",
alpha = best_trial_logistic["params_reg_strength"],
l1_ratio = best_trial_logistic["params_l1_ratio"],
max_iter = 20, # Best n. of epochs found from validation
n_iter_no_change = 1000, # Arbitrary high value to ensure training doesn't early stop based on training loss
verbose = 1, # Print training progress
random_state = 1923
)
)
]
)
Our second model, SVM, aims to classify observations by separating them with decision boundaries drawn in the feature space.
-
Normally, the decision boundaries are linear, but non-linear transformations can be applied to the feature space to model non-linear relationships.
- There are numerous kernel functions that can be used to transform a feature space. The best choice will depend on the nature of the relationships in the data, but we will use a general-purpose radial basis function. The model performance will greatly depend on the suitability of the chosen kernel.
-
SVM can also be susceptible to multicollinearity and curse of dimensionality, so we will apply & tune regularization parameters. Regularization affects the size of the margin around the decision boundary:
-
With higher regularization, we have a larger margin, which may generalize better to new data (reduce overfitting), but may also lead to underfitting by not separating the training observations enough.
-
With lower regularization, we have a smaller margin. We can separate the training observations more accurately, but this could lead to overfitting on unseen data.
-
-
SVM does not natively output probability predictions between 0 and 1. Instead, it calculates a decision function, based on the distance of an observation from the decision boundaries. This has implications for our performance evaluations, which we will discuss more in model testing.
In scikit-learn, a linear SVM classifier can normally be applied with LinearSVC. A non-linear SVM with built-in kernel options can be applied with SVC. Both approaches, especially the latter, will be computationally expensive with a large dataset. Just like logistic regression, we can use SGDClassifier to train our SVM much more quickly with a SGD algorithm.
-
Before we apply the SGDClassifier, we’ll add a kernel approximation step to our pipeline to transform our feature space into non-linear. RBFSampler uses random Fourier features to quickly approximate an RBF feature space (original paper).
-
This may or may not be a good kernel choice for our data. The only way to be sure is to try different options. Here, we are using a general purpose non-linear kernel to compare a non-linear SVM with other models. Generally, it is suggested to use the “most linear decision boundaries that sufficiently separate the data”, as they tend to be more robust and efficient.
-
RBFSampler has two parameters,
gamma
andn_components
.-
gamma
is a RBF function parameter that affects the reach of each training observation, and therefore the “smoothness” of the decision boundary. We leave this value to a default heuristic dependent on the number of features and their variance. -
A higher
gamma
means the decision boundary is affected mostly by the closest points, making it less linear and less smooth. This can separate the training observations better, but may also cause overfitting on unseen data. -
A lower
gamma
does the opposite: A decision boundary affected by observations further away, more linear and smoother. This may generalize better to new data, but may also cause underfitting. -
n_components
refers to the number of Monte Carlo samples per each feature used to approximate the RBF feature space. A higher value may increase the quality of the approximation, but we leave this default too.
-
-
-
After the kernel approximation gives us a non-linear feature space, the SGDClassifier will act as a linear SVM if we change our loss metric to hinge loss.
-
Hinge loss is the default loss metric used to train SVM models. It is calculated from the output of the decision function. It slightly penalizes even “correct” predictions if they were too close to the decision boundary, i.e. close to being misclassified. This property aims to maximize the distances between the observations and the decision boundary.
-
We’ll apply class weights to hinge loss, just like we did with log loss. We’ll also use class weighted hinge loss to score predictions on the validation data.
-
SVM is normally fit with L2 regularization only, but we’ll try elastic net regularization just like with logistic regression.
-
The validation function and the Optuna tuning process is pretty much the same with logistic regression. The code is available below. The Optuna study with 500 trials took roughly 15 minutes with 12 CPU cores available.
Show validation function definition
# Define model validation function for SVM
def validate_svm(alpha, l1_ratio, trial, tol = 1e-4, verbose = 0):
# Record the CV scores of the parameter set
cv_scores = []
# Record best epochs for each CV fold
best_epochs = []
for i, (train_index, val_index) in enumerate(cv_indices):
# Split training-validation data
x_tr = x_train.iloc[train_index, ]
y_tr = y_train.iloc[train_index, ]
x_val = x_train.iloc[val_index, ]
y_val = y_train.iloc[val_index, ]
# Compute class weight
classes = list(set(y_tr))
class_weight = compute_class_weight("balanced", classes = classes, y = y_tr)
sample_weight = np.where(y_tr == 1, class_weight[0], class_weight[1])
sample_weight_val = np.where(y_val == 1, class_weight[0], class_weight[1])
# Define RBF kernel approximator
kernel_rbf = RBFSampler(
gamma = "scale",
n_components = 100,
random_state = 1923
)
# Define preprocessing & kernel trick pipeline
pipe_svm = Pipeline(steps = [
("preprocessing", pipe_process),
("kernel_rbf", kernel_rbf)
])
# Perform preprocessing
x_tr = pipe_svm.fit_transform(x_tr, y_tr)
x_val = pipe_svm.transform(x_val)
# Define SVM classifier with SGD
model_svm = SGDClassifier(
loss = "hinge", # Hinge loss for SVM
penalty = "elasticnet", # Elastic net penalty as opposed to default L2
learning_rate = "optimal", # Dynamically adjusted based on reg. strength
random_state = 1923,
verbose = verbose,
n_jobs = -1,
alpha = alpha,
l1_ratio = l1_ratio
)
# Perform epoch by epoch training with early stopping & pruning
epoch_scores = []
n_iter_no_change = 0
tol = tol
for epoch in range(1000):
# Train model for 1 epoch
model_svm.partial_fit(x_tr, y_tr, classes = classes,
sample_weight = sample_weight)
# Score epoch
pred_decision = model_svm.decision_function(x_val)
epoch_score = hinge_loss(y_val, pred_decision,
sample_weight = sample_weight_val)
# For first CV fold, report intermediate score of trial to Optuna
if i == 0:
trial.report(epoch_score, epoch)
# Prune trial if necessary
if trial.should_prune():
raise optuna.TrialPruned()
# Count epochs with no improvement after first 10 epochs
if (epoch > 9) and (epoch_score > min(epoch_scores) - tol):
n_iter_no_change += 1
# Reset epochs with no improvement if an improvement takes place
if (epoch > 9) and (epoch_score <= min(epoch_scores) - tol):
n_iter_no_change = 0
# Append epoch score to list of epoch scores
epoch_scores.append(epoch_score)
# Early stop training if necessary
if n_iter_no_change >= 10:
print("Early stopping at epoch " + str(epoch))
break
# Append best epoch score to list of CV scores
cv_scores.append(min(epoch_scores))
# Append best epoch number to list of best epochs
best_epochs.append(epoch_scores.index(min(epoch_scores)) + 1)
# Return the average CV score & median best n. epochs
return np.mean(cv_scores), np.median(best_epochs)
Show Optuna objective function
# Define objective function for hyperparameter tuning
def objective_svm(trial):
# Define parameter ranges to tune over
alpha = trial.suggest_float("reg_strength", 5e-5, 0.5, log = True)
l1_ratio = trial.suggest_float("l1_ratio", 0, 1)
# Validate the parameter set
score, epoch = validate_svm(
alpha = alpha, l1_ratio = l1_ratio, trial = trial)
# Report best n. of epochs to Optuna
trial.set_user_attr("n_epochs", epoch)
return score
# Import best trial
best_trial_svm = pd.read_csv("./ModifiedData/trials_svm.csv")
best_trial_svm = best_trial_svm.loc[
best_trial_svm["state"] == "COMPLETE"].iloc[0,]
best_trial_svm
number 154
value 0.7448
datetime_start 2023-07-10 14:30:33.857628
datetime_complete 2023-07-10 14:30:38.622736
duration 0 days 00:00:04.765108
params_l1_ratio 0.1330
params_reg_strength 0.0004
user_attrs_n_epochs 16.0000
system_attrs_completed_rung_0 0.7396
system_attrs_completed_rung_1 0.7376
system_attrs_completed_rung_2 NaN
system_attrs_completed_rung_3 NaN
state COMPLETE
Name: 303, dtype: object
The optimal hyperparameter set has lower regularization strength compared to logistic regression, but a higher ratio of L1 regularization. This could suggest the RBF transformation made our feature set more redundant, but the parameter values are likely not directly comparable across different algorithms.
We create the final SVM model, combining it with the preprocessing & kernel approximation pipeline.
- We also wrap our SVM model in a CalibratedClassiferCV object. This will approximate probability values between 0 and 1 from the SVM decision function, so we can retrieve probability predictions, calculate some performance metrics that depend on them, and compare SVM with other models.
- Keep in mind the probabilities acquired in this manner may be unreliable. The documentation states the default sigmoid method may be unsuitable for imbalanced problems, so we use the non-parametric, isotonic version.
# Create pipeline with final SVM model
pipe_svm = Pipeline(steps = [
("preprocessing", pipe_process), # Preprocessing steps
("KernelTrick", RBFSampler( # Non-linear transformation of features with
# RBF kernel
gamma = "scale", # RBF kernel parameter, left to a default heuristic
n_components = 100, # N. of Monte Carlo samples used to approximate
# each feature
random_state = 1923
)
),
("SVM", CalibratedClassifierCV( # Prob. calibrator to estimate predicted probs.
estimator = SGDClassifier( # Model step
loss = "hinge", # Hinge loss to train a linear SVM
penalty = "elasticnet", # Elastic net regularization instead of default L2
alpha = best_trial_svm["params_reg_strength"],
l1_ratio = best_trial_svm["params_l1_ratio"],
max_iter = 16, # Best n. of epochs found from validation
n_iter_no_change = 1000, # Arbitrary high value to ensure training doesn't early stop based on training loss
verbose = 1, # Print training progress
random_state = 1923
),
method = "isotonic" # Non-parametric method to estimate predicted probs.
)
)
]
)
Next is our XGBoost model, a gradient boosting algorithm. Gradient boosting builds a number of models sequentially, each one trained on the prediction errors (residuals) of the previous one. The final ensemble model is a combination of all models.
-
One iteration of gradient boosting (equivalent to building one model) is called a round. By default, XGBoost builds a decision tree model in each round.
-
Tree-based models learn relationships & make predictions by splitting the data according to the predictor values.
-
This means they can natively model non-linear relationships, and find interaction effects even if they are not explicitly mapped as features (though doing so may still improve performance).
-
However, they can’t extrapolate relationships between predictors and the target outside of the observed training data.
- For example, for a feature X and target Y, a relationship of Y = 2X will be easily modeled and extrapolated by a linear model, even if X takes a very large value not seen in training. A tree-based model, however, will continue predicting according to the highest X value seen in training.
-
-
No hard assumptions are made about the data, though too many features can make it harder to identify the optimal splits and potentially hamper prediction performance.
-
Normally, a single decision tree with too few predictors & splits will tend to underfit, and one with too many predictors & splits will tend to overfit. In XGBoost, we control the complexity of each tree with several hyperparameters. Generally, we build numerous shallow trees (called weak learners) and ensemble them to arrive at a robust final prediction.
-
XGBoost can apply several types of regularization & sampling to further control the tradeoff between overfitting and underfitting. Tuning the numerous hyperparameters is crucial for good performance.
-
While model performance may not be impacted by multicollinearity in the feature set, the feature importance scores that can be extracted from tree-based models can be misleading. These only look at the relevancy between the target and the features, without taking their redundancy with one another into account (for more relevancy, redundancy & feature selection, see this article.).
-
While there are a lot of hyperparameters to tune for XGBoost, the code is simpler compared to SGDClassifier.
-
We can simply pass a validation set while training an XGBoost model, which automatically handles early stopping, records the best validation score and the best round.
-
Callbacks are objects that monitor model training and perform a certain action based on a condition, most commonly early stopping. Optuna offers a special pruning callback integration for XGBoost, so we don’t have to manually report scores to the Optuna study.
-
XGBoost supports GPU training, which will save us some training time. With 1000 trials, the Optuna study took roughly 25-30 minutes on my GPU.
-
The
scale_pos_weight
hyperparameter can be used to balance classes in binary classification problems, but it’s apparently not applied to validation score calculations (see discussion), so we create sample weight vectors and pass them to.fit()
as we did before.
Show validation function definition
# Define model validation function for XGBoost
def validate_xgb(
params_dict, # Dictionary of hyperparameter values
trial, verbose = 0):
# Record best epoch scores for each CV fold
cv_scores = []
# Record best epochs for each CV fold
best_epochs = []
for i, (train_index, val_index) in enumerate(cv_indices):
# Split training-validation data
x_tr = x_train.iloc[train_index, ]
y_tr = y_train.iloc[train_index, ]
x_val = x_train.iloc[val_index, ]
y_val = y_train.iloc[val_index, ]
# Compute class weight
classes = list(set(y_tr))
class_weight = compute_class_weight("balanced", classes = classes, y = y_tr)
sample_weight = np.where(y_tr == 1, class_weight[1], class_weight[0])
sample_weight_val = np.where(y_val == 1, class_weight[1], class_weight[0])
# Perform preprocessing
x_tr = pipe_process.fit_transform(x_tr, y_tr)
x_val = pipe_process.transform(x_val)
# Create pruning callback for first CV split
if i == 0:
callback_pruner = [optuna.integration.XGBoostPruningCallback(
trial, "validation_0-logloss")]
else:
callback_pruner = None
# Define XGBoost classifier
model_xgb = XGBClassifier(
objective = "binary:logistic",
n_estimators = 5000, # Arbitrarily high number of training rounds, as early stopping is active
early_stopping_rounds = 50, # Stop training after 50 rounds with no improvement
eval_metric = "logloss", # Validation metric
tree_method = "gpu_hist", # Train with GPU
gpu_id = 0,
verbosity = 0,
random_state = 1923,
callbacks = callback_pruner,
learning_rate = params_dict["learning_rate"],
max_depth = params_dict["max_depth"],
min_child_weight = params_dict["min_child_weight"],
gamma = params_dict["gamma"],
reg_alpha = params_dict["reg_alpha"],
reg_lambda = params_dict["reg_lambda"],
subsample = params_dict["subsample"],
colsample_bytree = params_dict["colsample_bytree"]
)
# Train & validate model
model_xgb.fit(
X = x_tr,
y = y_tr,
sample_weight = sample_weight,
eval_set = [(x_val, y_val)], # Validation set
sample_weight_eval_set = [sample_weight_val],
verbose = verbose)
# Append best epoch score to list of CV scores
cv_scores.append(model_xgb.best_score)
# Append best epoch number to list of best epochs
best_epochs.append(model_xgb.best_iteration + 1)
# Return the average CV score & median best n. rounds
return np.mean(cv_scores), np.median(best_epochs)
Let’s discuss the numerous parameters of XGBoost we will tune, and the search spaces.
-
learning_rate
is is the magnitude of the model update after each boosting round, as in SGD training. The same considerations as explained before apply. A general heuristic is to tune the XGBoost LR between 0.05 and 0.3 for most problems. It’s also possible to use LR scheduling, but we will save this for the NN model. -
max_depth
is the maximum depth of each tree, an integer value. A higher value will make the model more complex, and more likely to overfit. The default is 6, but we will tune it over a broad range. -
min_child_weight
is a parameter that affects tree complexity. A higher value will make it harder for the model to justify new splits. This is usually tuned as an integer parameter. -
gamma
is a special regularization parameter for tree models. It is the minimum loss reduction required to justify another split. The optimal value is often closer to zero. -
alpha
andlambda
are L1 and L2 regularization respectively, applied to model weights. L1 is 0 by default, and L2 is 1 by default. We’ll tune both over a wide interval around their defaults, to try different blends of L1 and L2 regularization. -
subsample
andcolsample_bytree
are the subsampling parameters, both set to 1 by default (meaning no subsampling).-
Setting a lower value for
subsample
will subsample the training observations by that ratio in every boosting round. -
A lower value for
colsample_bytree
will subsample the features by that ratio in every boosting round. -
Subsampling can help make the models more robust against overfitting. It’s possible to subsample features even more frequently than once per round using other hyperparameters (see documentation).
-
# Define objective function for hyperparameter tuning
def objective_xgb(trial):
# Suggest parameter values from parameter ranges to tune over
learning_rate = trial.suggest_float("learning_rate", 0.05, 0.3)
max_depth = trial.suggest_int("max_depth", 2, 20)
min_child_weight = trial.suggest_int("min_child_weight", 1, 20, log = True)
gamma = trial.suggest_float("gamma", 5e-5, 0.5, log = True)
reg_alpha = trial.suggest_float("l1_reg", 5e-5, 1, log = True)
reg_lambda = trial.suggest_float("l2_reg", 0, 2)
subsample = trial.suggest_float("subsample", 0.5, 1)
colsample_bytree = trial.suggest_float("colsample_bytree", 0.25, 1)
# Make dictionary of parameters
params_dict = {
"learning_rate": learning_rate,
"max_depth": max_depth,
"min_child_weight": min_child_weight,
"gamma": gamma,
"reg_alpha": reg_alpha,
"reg_lambda": reg_lambda,
"subsample": subsample,
"colsample_bytree": colsample_bytree
}
# Validate parameter set
score, rounds = validate_xgb(
params_dict = params_dict, trial = trial)
# Report best n. rounds to Optuna
trial.set_user_attr("n_rounds", rounds)
return score
# Import best trial
best_trial_xgb = pd.read_csv("./ModifiedData/trials_xgb.csv")
best_trial_xgb = best_trial_xgb.loc[
best_trial_xgb["state"] == "COMPLETE"].iloc[0,]
best_trial_xgb
number 482
value 0.5666
datetime_start 2023-07-10 14:53:42.307668
datetime_complete 2023-07-10 14:53:46.155453
duration 0 days 00:00:03.847785
params_colsample_bytree 0.4027
params_gamma 0.0029
params_l1_reg 0.0134
params_l2_reg 1.6294
params_learning_rate 0.2291
params_max_depth 5
params_min_child_weight 6
params_subsample 0.9745
user_attrs_n_rounds 22.0000
system_attrs_completed_rung_0 0.6182
system_attrs_completed_rung_1 0.5709
system_attrs_completed_rung_2 0.5678
system_attrs_completed_rung_3 NaN
state COMPLETE
Name: 0, dtype: object
The optimal set of hyperparameters yields trees on the shallow side,
with some regularization. Most interesting is the colsample_bytree
parameter: A value of roughly 0.4 means each tree is built with a
subsample of 40% of the features in our data. More evidence that we may
have many redundant (or simply useless) features in our dataset.
# Create pipeline with final XGB model
pipe_xgb = Pipeline(steps = [
("preprocessing", pipe_process), # Preprocessing steps
("XGBoost", XGBClassifier( # Model step
objective = "binary:logistic",
n_estimators = 22, # Best number of rounds from validation
eval_metric = "logloss",
tree_method = "gpu_hist",
gpu_id = 0,
verbosity = 1,
random_state = 1923,
learning_rate = best_trial_xgb["params_learning_rate"],
max_depth = best_trial_xgb["params_max_depth"],
min_child_weight = best_trial_xgb["params_min_child_weight"],
gamma = best_trial_xgb["params_gamma"],
reg_alpha = best_trial_xgb["params_l1_reg"],
reg_lambda = best_trial_xgb["params_l2_reg"],
subsample = best_trial_xgb["params_subsample"],
colsample_bytree = best_trial_xgb["params_colsample_bytree"]
)
)
]
)
Our final model is a neural network. There is much to explain about how neural networks work, and there are many complex architectures, but I’ll summarize the basic idea by comparing with a linear regression:
-
In a linear regression equation, the inputs are the features. These inputs are multiplied with their respective weights (coefficients) and summed up, along with a bias (intercept) value. This gives us the output / prediction.
-
One such linear regression equation is equivalent to one neuron / linear unit in a neural network. A layer is an organization of numerous neurons / units, or another transformation that is applied to the inputs.
-
The inputs of a neural network are equivalent to the original features, as they are fed into the network. The output layer is the last layer that gives the final output / prediction of the network.
-
A hidden layer is any layer between the network inputs and the output layer. The inputs of neurons in a hidden layer consist of the outputs of the previous layer (for the first hidden layer, simply the input features). Hidden layer neurons each apply their own weights and bias terms to the inputs, and pass their outputs to the next layer. In general, deep neural networks are those with multiple or many hidden layers.
-
We can think of a multiple linear regression model (with numerous predictors) as a single neuron / linear unit with multiple inputs.
-
When we fit a linear regression model, we estimate a single coefficient for each predictor, and one intercept term. In a neural network, we estimate a weight for each input of each neuron, and a bias term for each neuron.
-
Activation functions can be applied after a layer, to apply a mathematical transformation to its outputs before passing them on to the next layer as inputs.
-
With activation functions and hidden layers, the network can learn non-linear and complex relationships.
-
With a binary classification problem, a sigmoid activation function is applied to the network’s final, single output, which will transform it from a predicted logit to a probability value between 0 and 1.
Of course, this is the simplest summary of a basic neural network. Here is an excellent source for more information on many aspects of deep learning.
To build and train our NN model, we will use the PyTorch, and the associated Lightning framework which makes it much more easier to organize PyTorch code. We’ll define several custom PyTorch and Lightning classes according to our needs, based on existing class templates.
First, we need to define a Dataset
class. This class will store our
data as Torch tensors. Its methods will return one row at a time, as a
pair of feature tensors & target tensor, and the data length. We create
a separate class for training & testing data, as the latter won’t have
associated target values.
Show Dataset class creation
# Define TrainDataset class: Takes in preprocessed features & targets
class TrainDataset(torch.utils.data.Dataset):
# Initialize class with training features & targets
def __init__(self, x_train, y_train):
# Store features as Torch tensor
self.x = torch.tensor(x_train, dtype = torch.float32)
# Store targets as Torch tensor
self.y = torch.tensor(y_train.values, dtype = torch.float32).unsqueeze(1)
# Return data length
def __len__(self):
return len(self.x)
# Return a pair of features & target (one training observation)
def __getitem__(self, idx):
return self.x[idx], self.y[idx]
# Define TestDataset class: Takes in preprocessed features only
class TestDataset(torch.utils.data.Dataset):
# Initialize class with training features
def __init__(self, x_test):
self.x = torch.tensor(x_test, dtype = torch.float32) # Store features
# Return data length
def __len__(self):
return len(self.x)
# Return one set of features
def __getitem__(self, idx):
return self.x[idx]
Next, we define our Lightning Module
. This is the core class that will
store our model architecture & hyperparameters, define our training,
validation & prediction loops, and more. The explanation of each class
method is better left to the code comments below. We’ll discuss the
model architecture and hyperparameters afterwards.
Show Lightning Module class creation
# Define Lightning module
class SeluDropoutModel(pl.LightningModule):
# Initialize model with hyperparameters dictionary
def __init__(self, hyperparams_dict):
# Delegate function to parent class
super().__init__()
# Save external hyperparameters so they are available when loading saved models
self.save_hyperparameters(logger = False)
# Initialize validation metric: Average precision
self.val_avg_precision = torchmetrics.classification.AveragePrecision(
task = "binary")
# Define hyperparameters
self.n_hidden_layers = hyperparams_dict["n_hidden_layers"]
self.input_size = hyperparams_dict["input_size"]
self.hidden_size = hyperparams_dict["hidden_size"]
self.learning_rate = hyperparams_dict["learning_rate"]
self.l2 = hyperparams_dict["l2"]
self.dropout = hyperparams_dict["dropout"]
self.loss_alpha = hyperparams_dict["loss_alpha"]
self.loss_gamma = hyperparams_dict["loss_gamma"]
# Define network architecture
# Initialize layers list with first hidden layer
self.layers_list = torch.nn.ModuleList([
torch.nn.Linear(self.input_size, self.hidden_size), # Hidden layer 1
torch.nn.SELU(), # Activation 1
torch.nn.AlphaDropout(self.dropout) # Dropout 1
])
# Add extra hidden layers to layers list, according to hyperparameter
for n in range(0, (self.n_hidden_layers - 1)):
self.layers_list.extend([
torch.nn.Linear(self.hidden_size, self.hidden_size), # Hidden layer N
torch.nn.SELU(), # Activation N
torch.nn.AlphaDropout(self.dropout) # Dropout N
])
# Add final output layer to layers list
self.layers_list.append(
torch.nn.Linear(self.hidden_size, 1) # Output layer
# No sigmoid activation here, because the loss function has that built-in
)
# Create full network from layers list
self.network = torch.nn.Sequential(*self.layers_list)
# Sigmoid activation for prediction step only, not part of forward
# propagation as the loss function has sigmoid activation built in
self.sigmoid = torch.nn.Sequential(torch.nn.Sigmoid())
# Initialize weights to conform with self-normalizing SELU activation
for layer in self.network:
if isinstance(layer, torch.nn.Linear):
torch.nn.init.kaiming_normal_(layer.weight, nonlinearity = "linear")
torch.nn.init.zeros_(layer.bias)
# Define forward propagation
def forward(self, x):
output = self.network(x.view(x.size(0), -1))
return output # Returns logits, not probabilities
# Define training loop
def training_step(self, batch, batch_idx):
# Perform training
x, y = batch
output = self.forward(x)
# Calculate loss of training predictions
# Loss function applies sigmoid activation before calculating focal loss
loss = torchvision.ops.sigmoid_focal_loss(
output, y,
alpha = self.loss_alpha, # Positive class weight parameter
gamma = self.loss_gamma, # Parameter that discounts "easy" training examples
# in loss calculation
reduction = "mean")
# Log training loss
self.log(
"train_loss", loss,
on_step = False, on_epoch = True, prog_bar = True, logger = True)
# Return training loss
return loss
# Define validation loop
def validation_step(self, batch, batch_idx):
# Make predictions, apply sigmoid activation to get probabilities
x, y = batch
output = self.forward(x)
pred = self.sigmoid(output)
# Update & log avg. precision score, ensure y is in int32 format for metric
self.val_avg_precision(pred, y.type(torch.int32))
self.log(
"val_avg_precision", self.val_avg_precision,
on_step = True, on_epoch = True, prog_bar = True, logger = True)
# Return validation score
return self.val_avg_precision
# Define prediction method (because the default just runs forward(), which
# doesn't have sigmoid activation and doesn't return probabilities)
def predict_step(self, batch, batch_idx):
# Run the forward propagation
x = batch
output = self.forward(x)
# Apply sigmoid activation to return probability predictions
pred = self.sigmoid(output)
return pred
# Define optimization algorithm, LR scheduler
def configure_optimizers(self):
# Adam optimizer with L2 regularization
optimizer = torch.optim.Adam(
self.parameters(), lr = self.learning_rate, weight_decay = self.l2)
# Cyclic LR scheduler
lr_scheduler = torch.optim.lr_scheduler.CyclicLR(
optimizer,
base_lr = self.learning_rate, # Lowest LR at start of each cycle
max_lr = (self.learning_rate * 5), # Highest LR at the end of first cycle
mode = "exp_range", # Exponentially decreasing max. LR
gamma = 0.99995, # Reduction factor for max. LR
step_size_up = 200, # Training steps to go from base LR to max. LR. once
# Heuristic: (2-8 * (steps (batches) in one epoch))
cycle_momentum = False # Not compatible with Adam optimizer
)
return {
"optimizer": optimizer,
"lr_scheduler": {
"scheduler": lr_scheduler,
"interval": "step",
"frequency": 1
}
}
The model architecture above consists of at least one hidden layer, and
it is possible to add more hidden layers depending on the
n_hidden_layers
parameter, which we’ll tune.
-
In theory, one hidden layer enables us to solve non-linear separation problems, and two hidden layers enable us to solve problems with multiple disconnected decision boundaries. More complex problems may require more layers, but in general, we want to use the simplest architecture that solves our problem (here’s a good article that talks about this in more detail).
-
The first hidden layer will take in our features as inputs, so the
input size
hyperparameter has to equal the number of features. We’ll give this as a hyperparameter to the model, but we won’t tune it. -
The output size of the first hidden layer will equal the input & output sizes of any subsequent hidden layer(s). We will tune this as the
hidden_size
hyperparameter.-
It’s also possible to set a different hyperparameter for the output size of each subsequent hidden layer and tune them separately, so the layers constantly get smaller as we move forward in the network, performing a reduction in the feature space (this is a key part of autoencoder architectures).
-
The output size of a hidden layer also equals the number of neurons / units in that layer, as 1 neuron = 1 output. Generally, we want hidden layer sizes to be between the network input size (number of features) and the final output size (one in this case).
-
-
The output layer will only have one output, the predicted logit for the positive class. We won’t apply a sigmoid activation at the end of the network, as our loss function applies sigmoid activation before calculating the loss.
-
Each hidden layer will be followed by a SELU activation function layer. This stands for “scaled exponential linear unit”. The SELU activation will enable us to model non-linear relationships, while also enforcing a self-normalizing property on the layer outputs (see original paper).
-
Neural networks can be prone to numerical instability issues, so it’s a common approach to apply forms of scaling / normalization inside the network, to the outputs of layers, before passing them on to the next layer.
-
The most common method for in-network normalization is adding batch normalization layers, but apparently it can interact with the application of dropout layers (see a detailed discussion related to this).
-
The scaling / normalization we apply to our features in preprocessing can also be applied as the first layer of a network.
-
-
The SELU activation after each hidden layer will be followed by a dropout layer.
-
This is a form of regularization which randomly drops out a fraction of the next layer’s inputs at each training step. Lightning automatically disables dropout when predicting with a trained NN model.
-
When neural networks overfit, they tend to learn noise as a very specific combination of weights across layers. Dropout breaks these combinations, forcing the network to generalize better.
-
We will use alpha dropout, which is applied after SELU activation, and maintains the self-normalizing property of SELU outputs.
-
The probability of dropping each input is another hyperparameter we will tune, as
dropout
. Obviously, a higher probability will result in more severe regularization and possible underfitting.
-
Our training loss function is going to be focal loss, which was developed for highly imbalanced classification problems in image recognition (see original paper). This loss function discounts the errors made for “easily classified” training examples, focusing the model’s attention on difficult training examples.
-
This loss function has two parameters,
alpha
andgamma
, which we will tune as hyperparameters. -
A higher
alpha
value assigns more weight to positive class observations when calculating the loss, similar to our class weighting for previous models’ loss functions. -
The
gamma
value controls how much “easy” training examples are discounted when calculating the loss. Ifgamma = 0
, no discounting is applied, and the loss function becomes equivalent to cross entropy, which is a commonly used default loss function for NN classification.- Cross entropy is practically equivalent to log loss, which we used previously.
Previously, we used the same loss function both to train our models, and to score their performance on validation data.
-
In general, we want to train & tune models with a loss function that scores the probability predictions directly, instead of the class predictions (more on this in model testing).
-
However, when we tune the parameters of focal loss, our loss function becomes part of the optimization problem (just like the class weights we applied to the loss function for the previous models). If we use focal loss also as our validation metric, this will be a form of leakage: We’ll simply search for the focal loss parameters that minimize focal loss by changing the metric itself, not necessarily those that maximize model performance (moving the goalposts in a sense).
-
Therefore, to validate our NN hyperparameters’ performance, we will use the average precision performance metric (which will be explained in model testing). Keep in mind this value is to be maximized, and not minimized like a loss function.
Finally, we choose an optimizer and an optional learning rate scheduler.
The optimizer manages how the model weights are updated after each training step.
-
Recall that NN models are usually trained with a form of minibatch gradient descent: In one training step, a batch of training observations are fed to the model, loss is computed for the predictions and the model weights are updated, taking the learning rate into account.
-
Minibatch SGD is available as an optimizer, but the commonly used default is the Adam algorithm. It is a robust general purpose optimizer that accelerates convergence and adjusts the learning rate dynamically. This is what we will use.
-
Adam can also apply L2 regularization to the model weights. We’ll tune this as our
l2
hyperparameter.
The learning rate scheduler will dynamically adjust the learning rate, to ensure we get the best tradeoff between fast & optimal convergence, and the risk of overfitting & erratic training steps.
-
A simple approach is starting training with a relatively higher LR, and gradually lowering it after a certain number of training steps / epochs are elapsed with no improvement.
-
We’ll try a more sophisticated approach: A cyclic LR scheduler will cycle the LR from a
base_lr
value to amax_lr
value, then back down to thebase_lr
, in a certain number of training steps. After each cycle, themax_lr
will be reduced in an exponential fashion. We’ll tune only thebase_lr
as our hyperparameterlearning_rate
, but it’s possible to tune other parameters of the cyclic scheduler. -
The rationale behind this approach is that a higher LR rate may appear to harm the training process initially, but yield better results later on. By cycling the LR in this fashion, we still decrease the “average” LR in the long term, while periodically increasing the LR in the short term, to avoid getting stuck in locally optimal solutions (more on this in the original paper, and a good summary article).
The last Lightning class we need to create is simply an exact copy of the existing Optuna pruner callback, due to compatibility issues (similar issues were discussed here and here).
Show Optuna pruner creation
# Optuna code uses old pytorch_lightning namespace which causes an error with the
# pruning callback integration. Create copy of the Optuna pruning callback with
# the new lightning.pytorch namespace as a workaround.
class OptunaPruning(PyTorchLightningPruningCallback, pl.Callback):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
Now, we define our validation function for the neural network model. There are a few more PyTorch and Lightning functions & classes we will use to load our data & validate our models.
-
We create two
DataLoader
instances in the loop, one each for our training & validation data. The dataloader creates minibatches with the data according to thebatch_size
parameter, and serves these batches to the model for training, validation & prediction.-
Batch size mainly affects the training speed, though it could affect the model performance as well (for example, smaller batch sizes could result in noisier model updates). Generally, it’s suggested to use the largest batch size possible with the available memory.
-
The training data is shuffled for batching, but not the validation or prediction data.
-
The validation & prediction steps could also be carried out without batching if the available memory is enough.
-
The
num_workers
argument controls the number of parallel processes used by the dataloader. I’ve often had issues with this parameter, so I did not use parallelization. -
We create an early stopping callback and an Optuna pruning callback.
-
-
Finally, we create the
Trainer
. The trainer takes the model, dataloaders, callbacks and additional settings, and performs training, testing and prediction.-
We set parameters such as a high number of maximum training epochs (as early stopping is active), GPU training options, and logging / checkpointing / model summary options.
-
With GPU training and no parallelization on the dataloaders, the Optuna study of 500 trials took roughly 4 hours on my machine.
-
Show validation function definition
# Define validation function for NN
def validate_nn(hyperparams_dict, trial, tol = 1e-4):
# Store the CV scores for the parameter set
cv_scores = []
# Store the best checkpoint strings from Lightning Trainer
best_epochs = []
for i, (train_index, val_index) in enumerate(cv_indices):
# Split training-validation data
x_tr = x_train.iloc[train_index, ]
y_tr = y_train.iloc[train_index, ]
x_val = x_train.iloc[val_index, ]
y_val = y_train.iloc[val_index, ]
# Perform preprocessing
x_tr = pipe_process.fit_transform(x_tr, y_tr)
x_val = pipe_process.transform(x_val)
# Load data into TrainDataset
train_data = TrainDataset(x_tr, y_tr)
val_data = TrainDataset(x_val, y_val)
# Create data loaders
train_loader = torch.utils.data.DataLoader(
train_data,
batch_size = 1024, # Large batch size for fast training
num_workers = 0, # DataLoader parallelization disabled, as it causes issues
shuffle = True # Shuffle training data before creating training batches
)
val_loader = torch.utils.data.DataLoader(
val_data, batch_size = len(x_val), num_workers = 0,
shuffle = False # Don't shuffle val. data before creating val. batches
)
# Create callbacks list
callbacks = []
# Create early stopping callback
callback_earlystop = pl.callbacks.EarlyStopping(
monitor = "val_avg_precision",
mode = "max", # Goal is to increase the val. metric, not decrease it
min_delta = tol,
patience = 10 # Early stop after 10 epochs without at least tol improvement
)
callbacks.append(callback_earlystop)
# Create Optuna pruner callback only for first CV fold
if i == 0:
callback_pruner = OptunaPruning(trial, monitor = "val_avg_precision")
callbacks.append(callback_pruner)
# Create Lightning Trainer
trainer = pl.Trainer(
max_epochs = 500, # Arbitrary high number of training epochs, as early stopping is active
log_every_n_steps = 5, # The default is 50, but there are less training steps (batches) than 50
accelerator = "gpu", devices = "auto", precision = "16-mixed", # GPU training
callbacks = callbacks,
logger = True,
enable_model_summary = False, # Disable printing model summary
enable_progress_bar = False, # Disable prog. bar for Optuna trials
enable_checkpointing = False # Disable checkpoints for Optuna trials
)
# Create & train model
model = SeluDropoutModel(hyperparams_dict = hyperparams_dict)
trainer.fit(model, train_loader, val_loader)
# Retrieve best val score and n. of epochs
score = callbacks[0].best_score.cpu().numpy()
epoch = trainer.current_epoch - callbacks[0].wait_count # Starts from 1
cv_scores.append(score)
best_epochs.append(epoch)
# Return the mean CV score, median best epoch
return np.mean(cv_scores), np.median(best_epochs)
Our tuning objective function is similar to the one for XGBoost.
-
We’ll try networks with 1 to 3 hidden layers. I’d expect 1 or 2 hidden layers to be enough for this problem, but in previous iterations, 3-4 hidden layers performed close to optimal as well.
-
We won’t sample a hidden size value directly, but we’ll sample an exponent for the base of 2, yielding a hidden size in powers of 2, from 2 to 64. All hidden layers will have the same size.
-
The default learning rate for the Adam algorithm is 1e-3. We’ll tune from a broad range around that, as learning rate scheduling is in place.
-
The focal loss parameter
alpha
takes a value from 0 to 1, which will be our search space. Agamma
value of 2 is suggested as a good setting by the authors of the original paper, so we’ll tune from a range around that.
# Define Optuna objective
def objective_nn(trial):
# Define parameter ranges to tune over & suggest param set for trial
n_hidden_layers = trial.suggest_int("n_hidden_layers", 1, 3)
hidden_size = 2 ** trial.suggest_int("hidden_size", 1, 6) # Tune hidden size as
# exponent of 2
learning_rate = trial.suggest_float("learning_rate", 5e-4, 5e-2)
l2 = trial.suggest_float("l2", 5e-5, 1e-2, log = True)
dropout = trial.suggest_float("dropout", 1e-3, 0.25, log = True)
loss_alpha = trial.suggest_float("loss_alpha", 0, 1)
loss_gamma = trial.suggest_float("loss_gamma", 0, 4)
# Create hyperparameters dict
hyperparams_dict = {
"input_size": 90, # N. of input features
"n_hidden_layers": n_hidden_layers,
"hidden_size": hidden_size,
"learning_rate": learning_rate,
"l2": l2,
"dropout": dropout,
"loss_alpha": loss_alpha,
"loss_gamma": loss_gamma
}
# Validate hyperparameter set
score, epoch = validate_nn(hyperparams_dict = hyperparams_dict, trial = trial)
# Report best n. of epochs
trial.set_user_attr("n_epochs", epoch)
return score
# Create study
study_nn = optuna.create_study(
sampler = optuna.samplers.TPESampler(seed = 1923),
pruner = optuna.pruners.HyperbandPruner(),
study_name = "tune_nn",
direction = "maximize" # Objective is to maximize reported metric, not minimize
)
# Import best trial
best_trial_nn = pd.read_csv("./ModifiedData/trials_nn.csv")
best_trial_nn = best_trial_nn.loc[
best_trial_nn["state"] == "COMPLETE"].iloc[0,]
best_trial_nn
number 371
value 0.4530
datetime_start 2023-07-10 11:07:10.155796
datetime_complete 2023-07-10 11:07:51.176352
duration 0 days 00:00:41.020556
params_dropout 0.0240
params_hidden_size 3
params_l2 0.0003
params_learning_rate 0.0167
params_loss_alpha 0.4922
params_loss_gamma 0.1408
params_n_hidden_layers 3
user_attrs_n_epochs 17.0000
system_attrs_completed_rung_0 0.4447
system_attrs_completed_rung_1 0.4515
system_attrs_completed_rung_2 0.4557
system_attrs_completed_rung_3 NaN
state COMPLETE
Name: 7, dtype: object
The best tune has 3 hidden layers, each with a size of 8 (
-
This could mean our feature space is fairly non-linear, but also on the smaller side, as the hidden layer size is much smaller compared to the number of input features.
- This effectively results in a reduction of the feature space, which was also done by XGBoost in its own way.
-
Small amounts of L2 and dropout regularization are applied, but the lower bounds of the search spaces are not reached, so we should be close to the optimal values.
-
The base learning rate is considerably higher compared to the default of 0.001.
-
An
alpha
of roughly 0.49 means our focal loss function puts considerably more weight on the positive cases, and a lowgamma
of roughly 0.14 means “easy” training examples are slightly discounted from the loss calculation.- A heuristic suggested by the authors of the original
paper was
alpha = 0.25
andgamma = 2
. Our tune is considerably far from that, but previous iterations with a similar value for alpha and gamma also performed very similarly.
- A heuristic suggested by the authors of the original
paper was
The median best number of epochs for this parameter set was 17, which
will be set later, in the Trainer
for our final model. For now, we
only create our final model instance with the best hyperparameters.
# Define NN model with best parameters (n. epochs is determined later in Trainer)
hyperparams_dict = {
"input_size": 90,
"n_hidden_layers": best_trial_nn["params_n_hidden_layers"],
"hidden_size": 2 ** best_trial_nn["params_hidden_size"],
"learning_rate": best_trial_nn["params_learning_rate"],
"l2": best_trial_nn["params_l2"],
"dropout": best_trial_nn["params_dropout"],
"loss_alpha": best_trial_nn["params_loss_alpha"],
"loss_gamma": best_trial_nn["params_loss_gamma"]
}
model_nn = SeluDropoutModel(hyperparams_dict)
Now, we’ll train the final versions of our models on the full training data, and test their performance on the testing data.
The steps for retrieving the final predictions are mostly explained in the code comments below.
# Compute class weight & sample weight vectors
classes = list(set(y_train))
class_weight = compute_class_weight("balanced", classes = classes, y = y_train)
sample_weight_train = np.where(y_train == 1, class_weight[1], class_weight[0])
sample_weight_test = np.where(y_test == 1, class_weight[1], class_weight[0])
We’ll use a dummy classifier as our baseline to compare with. It will always give the target class ratios in the training data as the probability predictions.
# Create dummy classifier which always predicts the prior class probabilities
model_dummy = DummyClassifier(strategy = "prior")
# Make dict of models
models_dict = {
"Dummy": model_dummy,
"Logistic": pipe_logistic,
"SVM": pipe_svm,
"XGBoost": pipe_xgb,
"Neural net": model_nn
}
We build, train & retrieve positive class prediction probabilities for every model in a loop.
Show training & prediction loop
# Train & predict with each model
preds_dict = {}
for key in models_dict.keys():
# Retrieve model
model = models_dict[key]
# Fit dummy classifier
if key == "Dummy":
model.fit(x_train, y_train)
# Fit NN classifier
elif key == "Neural net":
# Apply scikit preprocessing pipeline
x_tr = pipe_process.fit_transform(x_train, y_train)
x_test1 = pipe_process.transform(x_test)
# Create train & test Datasets, dataloaders
train_data = TrainDataset(x_tr, y_train)
test_data = TestDataset(x_test1)
train_loader = torch.utils.data.DataLoader(
train_data, batch_size = 1024, num_workers = 0, shuffle = True)
test_loader = torch.utils.data.DataLoader(
test_data, batch_size = len(test_data), num_workers = 0, shuffle = False)
# Create Lightning Trainer
trainer = pl.Trainer(
max_epochs = 17, # Best n. of epochs from validation
log_every_n_steps = 5, # The default is 50, but there are less training steps (batches) than 50
accelerator = "gpu", devices = "auto", precision = "16-mixed", # GPU training
logger = True,
enable_progress_bar = True, # Display training progress bar
enable_checkpointing = True # Save model checkpoint
)
# Train model
trainer.fit(model, train_loader)
# Fit scikit-learn classifiers
else:
# Create unique sample weights argument to be passed into pipeline.fit
kwargs = {model.steps[-1][0] + "__sample_weight": sample_weight_train}
# Fit pipeline & model
model.fit(x_train, y_train, **kwargs)
# Predict positive class probabilities with NN
if key == "Neural net":
y_prob = trainer.predict(model_nn, test_loader)
# Convert list of float16 Torch tensors to single float32 np.array
preds_dict[key] = np.float32(y_prob[0].numpy().reshape(1, -1)[0])
# Predict positive class probabilities with scikit-learn classifiers
else:
y_prob = model.predict_proba(x_test)
y_prob = np.array([x[1] for x in y_prob]) # Retrieve only the positive class probabilities
preds_dict[key] = y_prob
The first set of metrics we’ll calculate are all dependent on the threshold probability: Precision, recall and F1 score.
-
The threshold probability is the minimum positive class probability required to classify an observation as positive. Numerous classification metrics take changing values depending on the threshold used to calculate them.
-
For binary classification, a default threshold of 0.5 is often assumed, but it’s important to use the ideal threshold depending on the problem, especially with a major class imbalance.
-
Precision refers to the fraction of positive class predictions that were correct (true positives / true positives + false positives).
-
Recall refers to the fraction of positive cases correctly found, out of all positive cases in the data (true positives / (true positives + false negatives).
-
There is a natural tradeoff between precision and recall:
-
A higher threshold probability naturally means fewer cases are classified as positive. This generally results in higher precision and lower recall.
-
A lower threshold probability naturally means more cases are classified as positive. This generally results in higher recall and lower precision.
-
-
The F1 score is the harmonic mean of precision and recall at a given threshold probability. It aims to summarize both metrics at a given threshold. By default, precision and recall are weighted equally, but it’s possible to calculate differently weighted versions.
-
We’ll calculate precision, recall and F1 score at various threshold probabilities, for each model. We’ll retrieve the “optimal” threshold probability that maximizes F1 score, and the precision & recall values at this point.
- We will plot all three scores against the threshold probabilities to visualize the relationship between them.
-
We won’t consider plain accuracy, as it can be very misleading with a large class imbalance. To give an extreme example, simply predicting the negative class all the time will give us an accuracy of roughly 88%, as that is the ratio of negative cases in our data.
Show dynamic metric calculations
# Retrieve F1 scores at different threshold probabilities for each model
scores_f1 = {}
scores_f1_best = {}
scores_precision = {}
scores_precision_best = {}
scores_recall = {}
scores_recall_best = {}
threshold_probs = {}
threshold_probs_best = {}
for key in preds_dict.keys():
# Retrieve the precision & recall pairs at various thresholds, calculate F1
# scores from them
precision, recall, thresholds = precision_recall_curve(y_test, preds_dict[key])
f1_scores = 2 * recall * precision / (recall + precision)
# Add results to respective dictionaries
scores_f1[key] = f1_scores
scores_f1_best[key] = max(f1_scores)
scores_precision[key] = precision
scores_precision_best[key] = precision[np.argmax(f1_scores)]
scores_recall[key] = recall
scores_recall_best[key] = recall[np.argmax(f1_scores)]
threshold_probs[key] = thresholds
threshold_probs_best[key] = thresholds[np.argmax(f1_scores)]
Next, we will calculate a few metrics not dependent on the threshold probability, as a “summary” of overall model performance.
-
Average precision is also known as the PRAUC: The area under the precision-recall curve. It takes a value between 0 and 1, higher being better. We’ll explain and discuss this metric later with the precision-recall curve plot.
-
Brier score is a probability scoring metric that compares predicted probabilities with actual class labels. Similar to log loss, a lower value is better. Unlike log loss, it takes a value between 0 and 1.
-
We’ll use the class weighted version as we did with our training loss metrics. The scores will be misleading without class weights due to the imbalance, as they will be dominated by the scores for the majority class.
-
It’s always a good idea to assess the probability predictions themselves, rather than just class predictions. They may be more important and useful than class predictions in many cases (weather forecasts, risk assessments, medical tests).
- Intuitively, it also makes sense to aim for the best probability predictions regardless of the threshold probability, as they are the raw output of our models. This is why we trained & validated our models with loss metrics.
-
-
The Brier skill score compares the Brier score of a classifier with another baseline Brier score. This time, a higher score means a bigger improvement over the baseline. We’ll calculate this with our dummy classifier as the baseline.
Show static metric calculations
# Retrieve AP & Brier scores (weighted) for each model
scores_avg_precision = {}
scores_brier = {}
for key in preds_dict.keys():
# Retrieve average precision scores
avg_precision = average_precision_score(y_test, preds_dict[key])
scores_avg_precision[key] = avg_precision
# Retrieve Brier scores
brier_score = brier_score_loss(
y_test, preds_dict[key], sample_weight = sample_weight_test)
scores_brier[key] = brier_score
# Retrieve Brier skill scores for each model, with dummy classifier as reference
scores_brier_skill = {}
for key in preds_dict.keys():
brier_skill = 1 - (scores_brier[key] / scores_brier["Dummy"])
scores_brier_skill[key] = brier_skill
Now, let’s start reviewing our metrics. First, we start with a summary table.
Show code to get summary table
# Retrieve dataframe of scores
df_scores = pd.DataFrame(
{
"Avg. precision (PRAUC)": scores_avg_precision.values(),
"Brier score (class weighted)": scores_brier.values(),
"Brier skill score (class weighted)": scores_brier_skill.values(),
"Best F1 score": scores_f1_best.values(),
"Precision at best F1": scores_precision_best.values(),
"Recall at best F1": scores_recall_best.values(),
"Threshold prob. at best F1": threshold_probs_best.values()
}, index = preds_dict.keys()
)
print(df_scores)
Avg. precision (PRAUC) Brier score (class weighted)
Dummy 0.1226 0.3925 \
Logistic 0.4166 0.1961
SVM 0.3445 0.2098
XGBoost 0.4704 0.1914
Neural net 0.4376 0.2807
Brier skill score (class weighted) Best F1 score
Dummy 0.0000 0.2184 \
Logistic 0.5004 0.4060
SVM 0.4654 0.3694
XGBoost 0.5123 0.4318
Neural net 0.2848 0.4094
Precision at best F1 Recall at best F1
Dummy 0.1226 1.0000 \
Logistic 0.3877 0.4261
SVM 0.3469 0.3950
XGBoost 0.4443 0.4199
Neural net 0.4124 0.4063
Threshold prob. at best F1
Dummy 0.1226
Logistic 0.5926
SVM 0.6205
XGBoost 0.6399
Neural net 0.2233
All models are a large improvement over the baseline classifier. XGBoost seems to perform the best overall, in almost all summary metrics.
-
The NN model follows XGBoost in average precision, and has the second highest F1 score at the “optimal” threshold, but also the worst Brier score.
-
We calculated Brier scores by fully balancing the contributions of each class with class weights. All models except the NN were trained in the same fashion, but the NN was trained with less weight on the positive class.
-
This results in a “worse” weighted Brier score for the NN, but it would likely have the best unweighted Brier score as well. Which score is more important for assessing performance is dependent on the use case.
-
We also see the “optimal” threshold is much lower for the NN model than for the class weighted models. We’ll talk about this in more detail with the predicted probabilities plot.
-
-
The logistic regression model comes next overall, and beats all others in recall at the optimal threshold probability. It’s likely a very efficient alternative to the XGBoost and NN models.
-
The SVM model performs considerably worse than the other models, and this could have several reasons. The kernel approximation we performed may not be well suited to this feature set. The approximated probability predictions may not be very reliable. Still, it’s likely an effective model on its own.
We can visualize the tradeoff between precision and recall with the precision-recall curve (PRC) plot.
Show code to plot PRC curves
# Plot precision-recall curves
fig, ax = plt.subplots()
for key in preds_dict.keys():
_ = PrecisionRecallDisplay.from_predictions(y_test, preds_dict[key], name = key, ax = ax)
_ = plt.title("Precision-recall curves of classifiers")
_ = plt.legend(loc = "upper right")
plt.show()
plt.close("all")
In the plot above, each curve starts with a threshold probability of 1, and ends with a threshold probability of 0. As a consequence, precision drops and recall increases as we move to the right.
-
A very skilled classifier would lose very little precision as it increases recall. The curve would be closer to the top right corner.
-
Calculating the area under the precision-recall curve (PRAUC) summarizes this tradeoff, and the overall model performance at all thresholds. This is equivalent to the average precision metric we previously calculated.
-
We see the XGBoost model has the best PRC curve at pretty much any threshold probability. The NN model outperforms the logistic model at lower thresholds, and the logistic model outperforms the NN model at some higher threshold values.
Another way to look at the tradeoff between precision and recall is to plot both, along with the F1 scores, across various threshold probabilities.
Show code to get F1 plots dataframes
# Get dataframes for F1 score - threshold prob plots
# Logistic
df_f1_logistic = pd.DataFrame(
{"F1 score": scores_f1["Logistic"][:-1], # N. scores = N. thresholds + 1
"Precision": scores_precision["Logistic"][:-1],
"Recall": scores_recall["Logistic"][:-1],
"Threshold prob.": threshold_probs["Logistic"]
}
).melt(
value_vars = ["F1 score", "Precision", "Recall"],
var_name = "Metric",
value_name = "Score",
id_vars = "Threshold prob."
)
# SVM
df_f1_svm = pd.DataFrame(
{"F1 score": scores_f1["SVM"][:-1],
"Precision": scores_precision["SVM"][:-1],
"Recall": scores_recall["SVM"][:-1],
"Threshold prob.": threshold_probs["SVM"]
}
).melt(
value_vars = ["F1 score", "Precision", "Recall"],
var_name = "Metric",
value_name = "Score",
id_vars = "Threshold prob."
)
# XGBoost
df_f1_xgb = pd.DataFrame(
{"F1 score": scores_f1["XGBoost"][:-1],
"Precision": scores_precision["XGBoost"][:-1],
"Recall": scores_recall["XGBoost"][:-1],
"Threshold prob.": threshold_probs["XGBoost"]
}
).melt(
value_vars = ["F1 score", "Precision", "Recall"],
var_name = "Metric",
value_name = "Score",
id_vars = "Threshold prob."
)
# NN
df_f1_nn = pd.DataFrame(
{"F1 score": scores_f1["Neural net"][:-1],
"Precision": scores_precision["Neural net"][:-1],
"Recall": scores_recall["Neural net"][:-1],
"Threshold prob.": threshold_probs["Neural net"]
}
).melt(
value_vars = ["F1 score", "Precision", "Recall"],
var_name = "Metric",
value_name = "Score",
id_vars = "Threshold prob."
)
Show code to plot F1 scores
# Plot F1 score - threshold prob. plots
fig, ax = plt.subplots(4, sharex = True, sharey = True)
_ = fig.suptitle("F1 - precision - recall scores across threshold probabilities")
# Logistic
_ = sns.lineplot(
ax = ax[0],
x = "Threshold prob.", y = "Score", hue = "Metric",
data = df_f1_logistic, legend = False)
_ = ax[0].set_title("Logistic")
# SVM
_ = sns.lineplot(
ax = ax[1],
x = "Threshold prob.", y = "Score", hue = "Metric",
data = df_f1_svm, legend = False)
_ = ax[1].set_title("SVM")
# XGBoost
_ = sns.lineplot(
ax = ax[2],
x = "Threshold prob.", y = "Score", hue = "Metric",
data = df_f1_xgb, legend = False)
_ = ax[2].set_title("XGBoost")
# NN
_ = sns.lineplot(
ax = ax[3],
x = "Threshold prob.", y = "Score", hue = "Metric",
data = df_f1_nn, legend = True)
_ = ax[3].set_title("Neural net")
plt.show()
plt.close("all")
We see a major difference between the class weighted models, and the NN model trained with focal loss.
-
The class weighted models reach their maximum F1 score around a threshold probability of 0.6 - 0.65. Recall declines and precision increases mainly after 0.3.
-
In contrast, the NN model reaches its maximum F1 score around a threshold probability of 0.22. Recall sharply declines and precision sharply increases after around 0.1.
-
This suggests the class weighted models ascribe much higher positive class probabilities to predictions compared to the NN model. We’ll see this clearly in the next plot.
-
The plot above can be used in real-life applications to see how each score changes with the threshold, and to choose an ideal threshold depending on which score is more important for the problem at hand.
Let’s look at the distributions of positive class probabilities predicted by our models to better see the difference between the class weighted models and the focal loss NN model.
Show code to get predicted prob. dataframes
# Get dataframes for predicted probability plots
# Logistic
df_preds_logistic = pd.DataFrame({
"Prob. predictions": preds_dict["Logistic"],
"Actual labels": y_test,
})
# SVM
df_preds_svm = pd.DataFrame({
"Prob. predictions": preds_dict["SVM"],
"Actual labels": y_test,
})
# XGBoost
df_preds_xgb = pd.DataFrame({
"Prob. predictions": preds_dict["XGBoost"],
"Actual labels": y_test,
})
# NN
df_preds_nn = pd.DataFrame({
"Prob. predictions": preds_dict["Neural net"],
"Actual labels": y_test,
})
Show code to plot predicted prob. distributions
# Plot predicted probability distributions of classifiers
fig, ax = plt.subplots(4, sharex = True, sharey = True)
_ = fig.suptitle("Distributions of positive class probability predictions")
# Logistic
_ = sns.histplot(
ax = ax[0],
x = "Prob. predictions",
hue = "Actual labels",
multiple = "stack",
data = df_preds_logistic,
legend = False)
_ = ax[0].set_title("Logistic")
_ = ax[0].set_ylabel("")
# SVM
_ = sns.histplot(
ax = ax[1],
x = "Prob. predictions",
hue = "Actual labels",
multiple = "stack",
data = df_preds_svm,
legend = False)
_ = ax[1].set_title("SVM")
_ = ax[1].set_ylabel("")
# XGBoost
_ = sns.histplot(
ax = ax[2],
x = "Prob. predictions",
hue = "Actual labels",
multiple = "stack",
data = df_preds_xgb,
legend = False)
_ = ax[2].set_title("XGBoost")
_ = ax[2].set_ylabel("")
# NN
_ = sns.histplot(
ax = ax[3],
x = "Prob. predictions",
hue = "Actual labels",
multiple = "stack",
data = df_preds_nn,
legend = True)
_ = ax[3].set_title("Neural net")
_ = ax[3].set_xlabel("Probability predictions for positive class")
_ = ax[3].set_ylabel("N. of times predicted")
plt.show()
plt.close("all")
In the plot above, we have a stacked histogram of predicted positive class probabilities, colored by their actual target values.
-
A very skilled classifier would separate the probability predictions of the two classes as much as possible:
-
The predicted probabilities for negative cases (blue bins) would be mostly gathered to the left, close to 0.
-
The predicted probabilities for positive cases (orange bins) would be mostly gathered to the right, close to 1.
-
-
The class weighted models predict relatively higher positive class probabilities for all cases, including the negatives.
-
The probabilities for the negative cases are close to normally distributed, roughly centered around 0.35 - 0.4.
- The approximated probabilities for the SVM model are not smoothly distributed because of the isotonic approximation we used. Using the sigmoid method results in similar, normal-shaped distributions, just like the logistic and XGBoost predictions.
-
The probabilities for the positive cases are generally higher, but more uniformly distributed. Many are higher than 0.8, but many are also between 0.2 - 0.6, and overlap with many negative cases.
-
This suggests the models can easily classify the most extreme positive cases, but also have trouble telling apart less obvious positives from negative cases.
-
Due to being trained with class weights, the models are “tuned” to predict relatively high positive probabilities overall. This is why their “optimal” threshold probabilities are much higher.
-
-
The probability predictions of the NN model are distributed quite differently. The vast majority of cases have very low positive class probabilities, mostly below 0.2.
-
The probabilities for the negative cases are right skewed, with a median around maybe 0.1.
-
A fairly uniform distribution applies to the positive cases, but with a considerable number of observations at the right tail. There are “outlier” positive cases with predicted probabilities much higher than 0.8. These are likely the most extreme and easily identified positive cases.
-
This explains why the “optimal” threshold probability is much lower: The NN model, trained with focal loss, puts less weight on the positive cases, and predicts lower positive probabilities across the board, compared to the class weighted models’ predictions.
-
-
Both types of probability predictions likely have their pros and cons, and the best approach depends on the use case.
-
The probabilities predicted by the class weighted models could be considered “unrealistic”, as they are tuned to be relatively high across the board.
- For this problem, it’s unlikely most cars have a moderately high risk of being kicks. On the other hand, the XGBoost model clearly performs the best overall in terms of classification performance.
-
In contrast, the probabilities predicted by the focal loss NN model could be considered more realistic, as they are low for the vast majority of cases, and relatively much higher for a small number of cases.
- Intuitively, we’d expect most cars to have a smaller risk of being kicks, with a few high-risk ones. The NN model’s probability predictions also seems to “separate” the riskiest cars from the rest to a higher degree.
-
The classification models output probability predictions of class membership, and it’s up to us to use these probabilities for our real-life problem. We’ll illustrate this with a simple hypothetical business case.
In our scenario, we evaluate which cars to purchase from our testing data, based on their predicted probabilities of being kicks.
-
We’ll assume we make a profit from purchasing good cars, and incur a certain loss from purchasing kicks.
-
We’ll aim to maximize our equity
$E$ by changing two variables: The$N$ number of cars we intend to purchase, and the$t$ threshold probability for classifying them as kicks.-
We’ll sort the cars in ascending order according to their
$p$ predicted probabilities of being a kick. -
We will subset the top
$N$ cars, and purchase them if$p < t$ , pass them if$p \ge t$ .
-
-
If we purchase a good car, we’ll assume the car was worth what we paid, and we’ll get 10% of the purchase price
$C$ as profit, or a return on our investment from using it.- The net change in our equity will be
$-C + C + 0.1C = 0.1C$ .
- The net change in our equity will be
-
If we purchase a kick, we’ll assume the car is useless, and we can only recover 20% of the purchase price as salvage value.
- The net change in our equity will be
$-C+0.2C=-0.8C$ .
- The net change in our equity will be
With these assumptions, we can represent our equity value with the following expressions:
$E = sum_{i=1}^N (0.1 * C_i - 0.9 * C_i * k_i) * d_i$ ,where
$k_i = 0$ for good cars,$k_i = 1$ for kicks,
$d_i = 1$ if$p_i < t$ ,
$d_i = 0$ if$p_i \ge t$ .
I originally wanted to treat this as a linear programming problem, and solve for the optimal number of cars to buy, and the optimal threshold probability, using each model’s predicted probabilities. After some research, I realized this is more complex than I originally expected, as we have conditional constraints in this problem.
Instead, we’ll perform a sensitivity analysis, by calculating the equity at various combinations of our two decision variables, for each model. This will still give us quasi-optimal solutions for our problem, which shouldn’t be too far from the actual optimal values.
The code to perform the necessary calculations is available below, explained by the code comments. We will try 100 values for the number of cars to buy, from 1 to 14k+, and 100 values for the threshold probability, from 0 to 1. We’ll calculate an equity value for each combination, using each model’s predictions. With 10k calculations for each model, this takes several minutes on my machine. There could be a way to program this better.
Show code to get sensitivity analysis input dataframes
# Retrieve prob. predictions, target labels, purchase prices in dataframes, sort in descending order according to prob. predictions
# Dummy
df_dummy = pd.DataFrame({
"Price": x_test["VehBCost"],
"Kick": y_test,
"ProbKick": preds_dict["Dummy"]
})
df_dummy = df_dummy.sort_values("ProbKick", ascending = True)
# Logistic
df_logistic = pd.DataFrame({
"Price": x_test["VehBCost"],
"Kick": y_test,
"ProbKick": preds_dict["Logistic"]
})
df_logistic = df_logistic.sort_values("ProbKick", ascending = True)
# SVM
df_svm = pd.DataFrame({
"Price": x_test["VehBCost"],
"Kick": y_test,
"ProbKick": preds_dict["SVM"]
})
df_svm = df_svm.sort_values("ProbKick", ascending = True)
# XGBoost
df_xgb = pd.DataFrame({
"Price": x_test["VehBCost"],
"Kick": y_test,
"ProbKick": preds_dict["XGBoost"]
})
df_xgb = df_xgb.sort_values("ProbKick", ascending = True)
# Neural net
df_nn = pd.DataFrame({
"Price": x_test["VehBCost"],
"Kick": y_test,
"ProbKick": preds_dict["Neural net"]
})
df_nn = df_nn.sort_values("ProbKick", ascending = True)
Show code to define profit-loss calculation function
# Define function to calculate profit-loss, given threshold probability, number of cars to purchase and probability predictions
def calc_profit(threshold, num_purchases, df_probs):
# Retrieve arrays of prices, labels, predicted probs
price = df_probs["Price"].values
kick = df_probs["Kick"].values
prob = df_probs["ProbKick"].values
# Retrieve n. of cars to purchase
n = num_purchases
# Get vector of purchase decisions for the top N cars at this threshold prob.
decision = (prob[0:n] < threshold).astype(int)
# Calculate profit/loss for each car
profit = [
((0.1 * price[i]) - (0.9 * price[i] * kick[i])) * decision[i] for i in range(n)
]
# Return n. of cars actually purchased, total profit / loss
return sum(decision), sum(profit)
Show code to calculate profit-loss values
# Get combinations of thresholds - n. of cars to purchase
thresholds = np.arange(0, 1, 0.01) # 100 Threshold probabilities
num_buys = np.arange(0, len(y_test), 100) # 100 values for n. cars to purchase
thresholds_buys = list(product(thresholds, num_buys))
# Calculate n. of cars actually bought and profit at each threshold / n. purchases combination, for each model
# Dummy
output_dummy = [calc_profit(x, y, df_dummy) for x, y in thresholds_buys]
decisions_dummy = [x[0] for x in output_dummy]
profits_dummy= [x[1] for x in output_dummy]
# Logistic
output_logistic = [calc_profit(x, y, df_logistic) for x, y in thresholds_buys]
decisions_logistic = [x[0] for x in output_logistic]
profits_logistic = [x[1] for x in output_logistic]
# SVM
output_svm = [calc_profit(x, y, df_svm) for x, y in thresholds_buys]
decisions_svm = [x[0] for x in output_svm]
profits_svm = [x[1] for x in output_svm]
# XGBoost
output_xgb = [calc_profit(x, y, df_xgb) for x, y in thresholds_buys]
decisions_xgb = [x[0] for x in output_xgb]
profits_xgb = [x[1] for x in output_xgb]
# NN
output_nn = [calc_profit(x, y, df_nn) for x, y in thresholds_buys]
decisions_nn = [x[0] for x in output_nn]
profits_nn = [x[1] for x in output_nn]
# Make long dataframes of threshold-purchases-profit values for each model
# Dummy
df_long_dummy = pd.DataFrame({
"Threshold": [x[0] for x in thresholds_buys],
"Purchases": decisions_dummy,
"Profits": profits_dummy,
"Model": "Dummy"
})
# Logistic
df_long_logistic = pd.DataFrame({
"Threshold": [x[0] for x in thresholds_buys],
"Purchases": decisions_logistic,
"Profits": profits_logistic,
"Model": "Logistic"
})
# SVM
df_long_svm = pd.DataFrame({
"Threshold": [x[0] for x in thresholds_buys],
"Purchases": decisions_svm,
"Profits": profits_svm,
"Model": "SVM"
})
# XGBoost
df_long_xgb = pd.DataFrame({
"Threshold": [x[0] for x in thresholds_buys],
"Purchases": decisions_xgb,
"Profits": profits_xgb,
"Model": "XGBoost"
})
# Neural net
df_long_nn = pd.DataFrame({
"Threshold": [x[0] for x in thresholds_buys],
"Purchases": decisions_nn,
"Profits": profits_nn,
"Model": "Neural net"
})
# Combine long dataframes into one
df_long = pd.concat(
[df_long_dummy, df_long_logistic, df_long_svm, df_long_xgb, df_long_nn])
# Drop rows with duplicates of purchase-profit-model columns (cases where a
# probability t results in buying n number of cars, and a higher t doesn't result
# in more cars bought, or a different profit value)
df_long = df_long.drop_duplicates(["Purchases", "Profits", "Model"])
We’ll plot and compare the results of our sensitivity analysis for each model.
Show code to plot sensitivity analysis
# 2D lineplots of thresholds-purchases-profits
fig, ax = plt.subplots(3)
_ = fig.suptitle("Sensitivity analysis of classifier threshold probability, number of cars purchased and profit")
_ = sns.lineplot(
ax = ax[0],
data = df_long, x = "Threshold", y = "Profits", hue = "Model")
_ = ax[0].set_xlabel("Threshold probability")
_ = ax[0].set_ylabel("Profits, $mil")
_ = ax[0].legend(
title = "Model", bbox_to_anchor=(1.05, 1.0), fontsize="small", loc='best')
_ = sns.lineplot(
ax = ax[1],
data = df_long, x = "Threshold", y = "Purchases", hue = "Model", legend = False)
_ = ax[1].set_xlabel("Threshold probability (lowest value that results in N. cars purchased)")
_ = ax[1].set_ylabel("N. cars purchased")
_ = sns.lineplot(
ax = ax[2],
data = df_long, x = "Purchases", y = "Profits", hue = "Model", legend = False)
_ = ax[2].set_xlabel("N. cars purchased")
_ = ax[2].set_ylabel("Profits, $mil")
plt.show()
plt.close("all")
We see very similar, smooth curves for the class weighted models.
-
The highest profits are reached at around 0.45 threshold probability, with 9.k-10k cars purchased out of more than 14k.
-
The profits increase steadily as we increase the threshold to this point, and decline steadily afterwards.
In contrast, the curves for the NN model, trained with focal loss, are much more different.
- Profits sharply rise to their highest values around 0.15 threshold
probability, also with 9k-10k cars purchased. They decline afterwards,
plateau for a bit, and decline again.
- I believe the shape of the NN model’s profit / threshold prob. curve suggests the NN model tells apart the best and worst cars more easily compared to the class weighted models.
The most profitable model overall is XGBoost, though the NN and logistic regression come close. SVM trails behind considerably.
- The NN and logistic models may reach a slightly higher profit value at some sub-optimal numbers of cars purchased. So they could still be preferred to XGBoost in some cases, if our number of purchases were to be limited.
Below, we view the quasi-optimal solutions for each model as a table.
Show code to retrieve quasi-optimal solutions
# Quasi-optimal combinations of threshold prob - n. purchases
optimal_dummy = df_long_dummy.loc[np.argmax(df_long_dummy["Profits"])]
optimal_logistic = df_long_logistic.loc[np.argmax(df_long_logistic["Profits"])]
optimal_svm = df_long_svm.loc[np.argmax(df_long_svm["Profits"])]
optimal_xgb = df_long_xgb.loc[np.argmax(df_long_xgb["Profits"])]
optimal_nn =df_long_nn.loc[np.argmax(df_long_nn["Profits"])]
# Put solutions in a dataframe
df_optimal = pd.concat([
optimal_dummy, optimal_logistic, optimal_svm, optimal_xgb, optimal_nn],
axis = 1).transpose()
# Scale profits to millions
df_optimal["Profits"] = df_optimal["Profits"] / 1e6
# Rename columns
df_optimal = df_optimal.rename({
"Threshold": "Optimal threshold prob.",
"Purchases": "N. cars purchased",
"Profits": "Profits, $mil"
}, axis = 1)
# Reorder columns
df_optimal = df_optimal[[
"Model", "Optimal threshold prob.", "N. cars purchased", "Profits, $mil"]]
print(df_optimal)
Model Optimal threshold prob. N. cars purchased Profits, $mil
1888 Dummy 0.1300 300 0.0246
6917 Logistic 0.4700 10200 3.2618
7060 SVM 0.4800 10000 2.7546
6468 XGBoost 0.4400 8800 3.3290
2269 Neural net 0.1500 9400 3.2445
We tested several classifiers, from simple logistic regression to a neural network model. We went over a typical hyperparameter tuning & performance evaluation workflow, with classification metrics suited to the problem at hand. We saw that XGBoost performed a bit better than a NN model overall, and logistic regression held up very well compared to its efficiency and simplicity. SVM trailed behind, but also performed decently for a highly imbalanced problem, and we didn’t pay much attention to the kernel choice. Training & tuning the logistic & SVM models with stochastic gradient descent saved a lot of time on a large dataset.
We applied class weights to the training loss functions to try and handle the class imbalance. We also tried focal loss, which was specifically designed for heavily imbalanced problems. It’s hard to say which approach is “better”, as we tried them across different algorithms. It’s clear they result in very different probability predictions, so that should be taken into consideration for real-life applications. It would be interesting to implement focal loss as a custom scikit-learn metric, and repeat this experiment by training the other models with focal loss.
Finally, my main goal with this project was to practice building & tuning a neural network model directly using PyTorch and Lightning, instead of using predefined architectures or wrappers. However, the final part of the analysis sparked my interest in linear programming and optimization problems. I plan to explore this topic further, as it seems complementary to machine learning methods.
Any comments or suggestions are welcome.