Imbalanced classification with deep learning - Used cars problem

Ahmet Zamanis

Introduction
Setup
Data cleaning
Feature engineering
Preprocessing pipeline
Modeling & hyperparameter optimization
Model testing
Sensitivity analysis
Conclusions

Introduction

In this report, we’ll build and test binary classification algorithms, on a dataset of used cars up for sale. Our goal is to predict if a car is a “kick”: A car that is heavily damaged or unusable, and should not be purchased. The dataset was sourced from OpenML, and was apparently used for a past Kaggle competition, and possibly for a Chalearn competition.

We will test logistic regression, support vector machine and XGBoost classifiers with the scikit-learn package. As the dataset is quite large, we’ll train our logistic regression & SVM models with Stochastic Gradient Descent. We’ll also build a deep neural network model with PyTorch Lightning. We’ll perform hyperparameter tuning for all models with the Optuna framework. We’ll compare the performances of all models, both with classification performance metrics, and with a sensitivity analysis based on a hypothetical business scenario.

Our target variable is highly imbalanced, as positive observations are strongly in the minority. Training classifiers and evaluating their performance can be tricky in such cases. To try and address the imbalance, we’ll train our scikit-learn models with class weights, use focal loss as our loss function for the neural network, and use suitable metrics & plots to correctly assess performance.

Setup

Show packages

# Data handling
import pandas as pd
import numpy as np
from scipy.io.arff import loadarff

# Plotting
import matplotlib.pyplot as plt
import seaborn as sns

# Categorical encoding
from feature_engine.encoding import OneHotEncoder
from feature_engine.creation import CyclicalFeatures
from category_encoders.target_encoder import TargetEncoder

# Preprocessing pipeline
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split, StratifiedKFold

# Modeling
from sklearn.utils.class_weight import compute_class_weight
from sklearn.kernel_approximation import RBFSampler
from sklearn.calibration import CalibratedClassifierCV
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import SGDClassifier
from xgboost import XGBClassifier

# Neural network modeling
import torch, torchvision, torchmetrics
import lightning.pytorch as pl
from optuna.integration.pytorch_lightning import PyTorchLightningPruningCallback

# Hyperparameter tuning 
import optuna

# Loss functions & performance metrics
from sklearn.metrics import log_loss
from sklearn.metrics import hinge_loss
from sklearn.metrics import average_precision_score, brier_score_loss
from sklearn.metrics import PrecisionRecallDisplay, precision_recall_curve

# Utilities
import warnings
from itertools import product

Show settings

# Set printing options
np.set_printoptions(suppress=True, precision=4)
pd.options.display.float_format = '{:.4f}'.format
pd.set_option('display.max_columns', None)

# Set plotting options
plt.rcParams['figure.dpi'] = 300
plt.rcParams['savefig.dpi'] = 300
plt.rcParams["figure.autolayout"] = True
sns.set_style("darkgrid")

# Set Torch settings
torch.set_default_dtype(torch.float32)
torch.set_float32_matmul_precision('high')
pl.seed_everything(1923, workers = True)
warnings.filterwarnings("ignore", ".*does not have many workers.*")

Data cleaning

We load our dataset, which is in .arff format, and carry out some data cleaning operations. Most of the operations are explained in the code comments, so I’ll keep the text to a minimum in this section.

# Load raw data
raw_data = loadarff("./RawData/kick.arff")

# Convert to pandas dataframe
df = pd.DataFrame(raw_data[0])

# Print first & last 5 rows, all columns
print(df)

      IsBadBuy       PurchDate   Auction   VehYear  VehicleAge          Make   
0         b'0' 1260144000.0000  b'ADESA' 2006.0000      3.0000      b'MAZDA'  \
1         b'0' 1260144000.0000  b'ADESA' 2004.0000      5.0000      b'DODGE'   
2         b'0' 1260144000.0000  b'ADESA' 2005.0000      4.0000      b'DODGE'   
3         b'0' 1260144000.0000  b'ADESA' 2004.0000      5.0000      b'DODGE'   
4         b'0' 1260144000.0000  b'ADESA' 2005.0000      4.0000       b'FORD'   
...        ...             ...       ...       ...         ...           ...   
72978     b'1' 1259712000.0000  b'ADESA' 2001.0000      8.0000    b'MERCURY'   
72979     b'0' 1259712000.0000  b'ADESA' 2007.0000      2.0000  b'CHEVROLET'   
72980     b'0' 1259712000.0000  b'ADESA' 2005.0000      4.0000       b'JEEP'   
72981     b'0' 1259712000.0000  b'ADESA' 2006.0000      3.0000  b'CHEVROLET'   
72982     b'0' 1259712000.0000  b'ADESA' 2006.0000      3.0000      b'MAZDA'   

                         Model    Trim              SubModel      Color   
0                    b'MAZDA3'    b'i'         b'4D SEDAN I'     b'RED'  \
1       b'1500 RAM PICKUP 2WD'   b'ST'  b'QUAD CAB 4.7L SLT'   b'WHITE'   
2                b'STRATUS V6'  b'SXT'   b'4D SEDAN SXT FFV'  b'MAROON'   
3                      b'NEON'  b'SXT'           b'4D SEDAN'  b'SILVER'   
4                     b'FOCUS'  b'ZX3'       b'2D COUPE ZX3'  b'SILVER'   
...                        ...     ...                   ...        ...   
72978                 b'SABLE'   b'GS'        b'4D SEDAN GS'   b'BLACK'   
72979             b'MALIBU 4C'   b'LS'        b'4D SEDAN LS'  b'SILVER'   
72980  b'GRAND CHEROKEE 2WD V'  b'Lar'    b'4D WAGON LAREDO'  b'SILVER'   
72981                b'IMPALA'   b'LS'        b'4D SEDAN LS'   b'WHITE'   
72982                b'MAZDA6'    b's'         b'4D SEDAN S'  b'SILVER'   

      Transmission WheelTypeID  WheelType     VehOdo     Nationality   
0          b'AUTO'        b'1'   b'Alloy' 89046.0000  b'OTHER ASIAN'  \
1          b'AUTO'        b'1'   b'Alloy' 93593.0000     b'AMERICAN'   
2          b'AUTO'        b'2'  b'Covers' 73807.0000     b'AMERICAN'   
3          b'AUTO'        b'1'   b'Alloy' 65617.0000     b'AMERICAN'   
4        b'MANUAL'        b'2'  b'Covers' 69367.0000     b'AMERICAN'   
...            ...         ...        ...        ...             ...   
72978      b'AUTO'        b'1'   b'Alloy' 45234.0000     b'AMERICAN'   
72979      b'AUTO'        b'?'       b'?' 71759.0000     b'AMERICAN'   
72980      b'AUTO'        b'1'   b'Alloy' 88500.0000     b'AMERICAN'   
72981      b'AUTO'        b'1'   b'Alloy' 79554.0000     b'AMERICAN'   
72982      b'AUTO'        b'1'   b'Alloy' 66855.0000  b'OTHER ASIAN'   

                 Size TopThreeAmericanName  MMRAcquisitionAuctionAveragePrice   
0           b'MEDIUM'             b'OTHER'                          8155.0000  \
1      b'LARGE TRUCK'          b'CHRYSLER'                          6854.0000   
2           b'MEDIUM'          b'CHRYSLER'                          3202.0000   
3          b'COMPACT'          b'CHRYSLER'                          1893.0000   
4          b'COMPACT'              b'FORD'                          3913.0000   
...               ...                  ...                                ...   
72978       b'MEDIUM'              b'FORD'                          1996.0000   
72979       b'MEDIUM'                b'GM'                          6418.0000   
72980   b'MEDIUM SUV'          b'CHRYSLER'                          8545.0000   
72981        b'LARGE'                b'GM'                          6420.0000   
72982       b'MEDIUM'             b'OTHER'                          7535.0000   

       MMRAcquisitionAuctionCleanPrice  MMRAcquisitionRetailAveragePrice   
0                            9829.0000                        11636.0000  \
1                            8383.0000                        10897.0000   
2                            4760.0000                         6943.0000   
3                            2675.0000                         4658.0000   
4                            5054.0000                         7723.0000   
...                                ...                               ...   
72978                        2993.0000                         2656.0000   
72979                        7325.0000                         7431.0000   
72980                        9959.0000                         9729.0000   
72981                        7604.0000                         7434.0000   
72982                        8771.0000                         8638.0000   

       MMRAcquisitonRetailCleanPrice  MMRCurrentAuctionAveragePrice   
0                         13600.0000                      7451.0000  \
1                         12572.0000                      7456.0000   
2                          8457.0000                      4035.0000   
3                          5690.0000                      1844.0000   
4                          8707.0000                      3247.0000   
...                              ...                            ...   
72978                      3732.0000                      2190.0000   
72979                      8411.0000                      6785.0000   
72980                     11256.0000                      8375.0000   
72981                      8712.0000                      6590.0000   
72982                      9973.0000                      7730.0000   

       MMRCurrentAuctionCleanPrice  MMRCurrentRetailAveragePrice   
0                        8552.0000                    11597.0000  \
1                        9222.0000                    11374.0000   
2                        5557.0000                     7146.0000   
3                        2646.0000                     4375.0000   
4                        4384.0000                     6739.0000   
...                            ...                           ...   
72978                    3055.0000                     4836.0000   
72979                    8132.0000                    10151.0000   
72980                    9802.0000                    11831.0000   
72981                    7684.0000                    10099.0000   
72982                    9102.0000                    11954.0000   

       MMRCurrentRetailCleanPrice PRIMEUNIT AUCGUART     BYRNO    VNZIP1   
0                      12409.0000      b'?'     b'?'  b'21973'  b'33619'  \
1                      12791.0000      b'?'     b'?'  b'19638'  b'33619'   
2                       8702.0000      b'?'     b'?'  b'19638'  b'33619'   
3                       5518.0000      b'?'     b'?'  b'19638'  b'33619'   
4                       7911.0000      b'?'     b'?'  b'19638'  b'33619'   
...                           ...       ...      ...       ...       ...   
72978                   5937.0000      b'?'     b'?'  b'18111'  b'30212'   
72979                  11652.0000      b'?'     b'?'  b'18881'  b'30212'   
72980                  14402.0000      b'?'     b'?'  b'18111'  b'30212'   
72981                  11228.0000      b'?'     b'?'  b'18881'  b'30212'   
72982                  13246.0000      b'?'     b'?'  b'18111'  b'30212'   

        VNST  VehBCost IsOnlineSale  WarrantyCost  
0      b'FL' 7100.0000         b'0'     1113.0000  
1      b'FL' 7600.0000         b'0'     1053.0000  
2      b'FL' 4900.0000         b'0'     1389.0000  
3      b'FL' 4100.0000         b'0'      630.0000  
4      b'FL' 4000.0000         b'0'     1020.0000  
...      ...       ...          ...           ...  
72978  b'GA' 4200.0000         b'0'      993.0000  
72979  b'GA' 6200.0000         b'0'     1038.0000  
72980  b'GA' 8200.0000         b'0'     1893.0000  
72981  b'GA' 7000.0000         b'0'     1974.0000  
72982  b'GA' 8000.0000         b'0'     1313.0000  

[72983 rows x 33 columns]

The categorical variables in the dataset have the bytes datatype, and need to be converted to strings. Also, some missing values are recorded as “?”, which will be replaced with proper missing values.

# Convert object columns from bytes to string datatype
object_cols = df.select_dtypes(["object"]).columns

for column in object_cols:
  df[column] = df[column].apply(lambda x: x.decode("utf-8"))
del column


# Replace "?" string values with NAs
for column in object_cols:
  df.loc[df[column] == "?", column] = np.nan
del column

We have missing values in several columns. Two columns, PRIMEUNIT and AUCGUART are missing to a high degree. We’ll inspect and handle each column one by one.

# Print n. of missing values in each column
pd.isnull(df).sum()

IsBadBuy                                 0
PurchDate                                0
Auction                                  0
VehYear                                  0
VehicleAge                               0
Make                                     0
Model                                    0
Trim                                  2360
SubModel                                 8
Color                                    8
Transmission                             9
WheelTypeID                           3169
WheelType                             3174
VehOdo                                   0
Nationality                              5
Size                                     5
TopThreeAmericanName                     5
MMRAcquisitionAuctionAveragePrice       18
MMRAcquisitionAuctionCleanPrice         18
MMRAcquisitionRetailAveragePrice        18
MMRAcquisitonRetailCleanPrice           18
MMRCurrentAuctionAveragePrice          315
MMRCurrentAuctionCleanPrice            315
MMRCurrentRetailAveragePrice           315
MMRCurrentRetailCleanPrice             315
PRIMEUNIT                            69564
AUCGUART                             69564
BYRNO                                    0
VNZIP1                                   0
VNST                                     0
VehBCost                                68
IsOnlineSale                             0
WarrantyCost                             0
dtype: int64

The target variable is highly imbalanced.

print("Target class (im)balance: ")
df["IsBadBuy"].value_counts(normalize = True)

Target class (im)balance: 

IsBadBuy
0   0.8770
1   0.1230
Name: proportion, dtype: float64

# Purchase date is in UNIX timestamp format. Convert to datetime
df["PurchDate"]
df["PurchDate"] = pd.to_datetime(df["PurchDate"], unit = "s")

# 3 unique auctioneers. ADESA, MANHEIM and other. Mostly MANHEIM.
df["Auction"].value_counts(normalize = True)

Auction
MANHEIM   0.5624
OTHER     0.2398
ADESA     0.1978
Name: proportion, dtype: float64

# Vehicle years range from 2001 to 2010
print(
  df[["VehYear", "VehicleAge"]].describe()
  )

# Purchase age almost always matches PurchYear - VehYear.
purch_year = df["PurchDate"].dt.year
veh_year = df["VehYear"]

print("\nCases where purchase year - vehicle year = vehicle age: " + 
      str(((purch_year - veh_year) == df["VehicleAge"]).sum())
      )

         VehYear  VehicleAge
count 72983.0000  72983.0000
mean   2005.3431      4.1766
std       1.7313      1.7122
min    2001.0000      0.0000
25%    2004.0000      3.0000
50%    2005.0000      4.0000
75%    2007.0000      5.0000
max    2010.0000      9.0000

Cases where purchase year - vehicle year = vehicle age: 72976

The “Make” column records the car’s brand, and has many categorical levels (high cardinality), though there are columns with many more.

Some of these brands are sub-brands of others. For example, Scion is a sub-brand of Toyota, and Infiniti is a sub-brand of Nissan.
We could recode these to reduce cardinality, but most of these sub-brands offer cars of higher segments compared to the main brands (mainly in the US), so keeping them separate could be more useful.

# There are brands with very few observations.
print(
  df["Make"].value_counts()
  )

# Recode TOYOTA SCION into SCION
df.loc[df["Make"] == "TOYOTA SCION", "Make"] = "SCION"

Make
CHEVROLET       17248
DODGE           12912
FORD            11305
CHRYSLER         8844
PONTIAC          4258
KIA              2484
SATURN           2163
NISSAN           2085
HYUNDAI          1811
JEEP             1644
SUZUKI           1328
TOYOTA           1144
MITSUBISHI       1030
MAZDA             979
MERCURY           913
BUICK             720
GMC               649
HONDA             497
OLDSMOBILE        243
VOLKSWAGEN        134
ISUZU             134
SCION             129
LINCOLN            97
INFINITI           42
VOLVO              37
CADILLAC           33
ACURA              33
LEXUS              31
SUBARU             28
MINI               24
PLYMOUTH            2
TOYOTA SCION        1
HUMMER              1
Name: count, dtype: int64

The “Model” column records the car model. There are more than 1000 models, so if we want to use this column as a feature, we’ll likely use target encoding. More on this in the feature engineering & preprocessing sections.

# There are 1063 unique models, many with only 1 observation.
df["Model"].value_counts()

Model
PT CRUISER              2329
IMPALA                  1990
TAURUS                  1425
CALIBER                 1375
CARAVAN GRAND FWD V6    1289
                        ... 
ESCAPE                     1
PRIZM 1.8L I-4 SFI D       1
CAMRY V6 3.0L / 3.3L       1
LHS                        1
PATRIOT 2WD 4C 2.0L        1
Name: count, Length: 1063, dtype: int64

Trim refers to slightly different versions / packages of a car model. These values are meaningless, and possibly misleading, without the associated model value.

# 134 trims, many with only 1 observation. 2360 missing values.
df["Trim"].value_counts()

Trim
Bas    13950
LS     10174
SE      9348
SXT     3825
LT      3540
       ...  
Har        1
LL         1
JLX        1
JLS        1
L 3        1
Name: count, Length: 134, dtype: int64

We also have submodels, which are also likely meaningless or misleading without the main model value. This column offers information such as the car’s drivetrain, engine type or chassis type.

# 863 submodels, many with only 1 observation. 8 missing values.
df["SubModel"].value_counts()

SubModel
4D SEDAN                  15236
4D SEDAN LS                4718
4D SEDAN SE                3859
4D WAGON                   2230
MINIVAN 3.3L               1258
                          ...  
EXT CAB 5.4L FX2 SPORT        1
2D COUPE ZX5 S                1
4D WAGON OUTBACK              1
EXT CAB 6.0L SLE              1
MINIVAN 3.3L FFV EL           1
Name: count, Length: 863, dtype: int64

# 2784 unique model & submodel combinations, many with only 1 observation.
(df["Model"] + " " + df["SubModel"]).value_counts()

PT CRUISER 4D SEDAN                            1894
IMPALA 4D SEDAN LS                             1300
CALIBER 4D WAGON                               1097
PT CRUISER 2.4L I4 S 4D SEDAN                   957
STRATUS V6 2.7L V6 M 4D SEDAN SXT FFV           770
                                               ... 
F150 PICKUP 2WD V8 5 REG CAB 5.4L W/T             1
VIBE 4D WAGON 2.4L                                1
XL-7 4WD 2.7L V6 MPI 4D SPORT UTILITY             1
F150 PICKUP 4WD V8 CREW CAB 5.4L KING RANCH       1
PATRIOT 2WD 4C 2.0L 4D SUV SPORT                  1
Name: count, Length: 2784, dtype: int64

# 3000+ unique cars in dataset (some could be different namings / spellings of the same car)
(df["Model"] + " " + df["Trim"] + " " + df["SubModel"]).value_counts()

PT CRUISER Bas 4D SEDAN                          1355
IMPALA LS 4D SEDAN LS                            1300
CALIBER SE 4D WAGON                               939
STRATUS V6 2.7L V6 M SXT 4D SEDAN SXT FFV         770
TAURUS SE 4D SEDAN SE                             761
                                                 ... 
CIVIC VP 2D COUPE DX VALUE PACKAGE AUTO             1
2500 RAM PICKUP 2WD Lar QUAD CAB 5.7L LARAMIE       1
DURANGO 4WD V8 Lim 4D SUV 4.7L FFV LIMITED          1
LEGACY L 4D WAGON L                                 1
PATRIOT 2WD 4C 2.0L Spo 4D SUV SPORT                1
Name: count, Length: 3136, dtype: int64

# 8 missing values for color, recode them as NOT AVAIL
print(df["Color"].value_counts())
df.loc[pd.isna(df["Color"]), "Color"] = "NOT AVAIL"

Color
SILVER       14875
WHITE        12123
BLUE         10347
GREY          7887
BLACK         7627
RED           6257
GOLD          5231
GREEN         3194
MAROON        2046
BEIGE         1584
BROWN          436
ORANGE         415
PURPLE         373
YELLOW         244
OTHER          242
NOT AVAIL       94
Name: count, dtype: int64

Some of the missing values in our data can be manually worked out from the car model, or other information. Transmission is one such column.

# 9 missing values for transmission. Try to work them out from car model. 1 occurence of manual spelled differently.
print(df["Transmission"].value_counts())

# Replace "Manual" with MANUAL in transmission
df.loc[df["Transmission"] == "Manual", "Transmission"] = "MANUAL"

Transmission
AUTO      70398
MANUAL     2575
Manual        1
Name: count, dtype: int64

We’ll print the model information for cars with missing Transmission values. We can then fill these in with manual search (or personal interest & knowledge in my case). We’ll apply this method to several other columns with missing values.

# Work out & impute transmission NAs from car model
print("Rows with Transmission values missing: ")
print(df.loc[pd.isna(df["Transmission"]), ["VehYear", "Make", "Model", "Trim", "SubModel"]])

transmission_nas = ["AUTO", "MANUAL", "MANUAL", "MANUAL", "AUTO", "MANUAL", "MANUAL", "AUTO", "AUTO"]
  
df.loc[pd.isna(df["Transmission"]), "Transmission"] = transmission_nas

Rows with Transmission values missing: 
        VehYear       Make                 Model Trim         SubModel
15906 2005.0000    MERCURY   MONTEGO 3.0L V6 EFI  Bas  4D SEDAN LUXURY
24567 2006.0000      DODGE  STRATUS V6 2.7L V6 M  SXT              NaN
24578 2006.0000      DODGE  STRATUS V6 2.7L V6 M  SXT              NaN
70432 2001.0000  CHEVROLET  S10 PICKUP 2WD 4C 2.  Bas              NaN
70434 2004.0000  CHEVROLET    IMPALA 3.4L V6 SFI  Bas              NaN
70437 2004.0000    PONTIAC   GRAND AM V6 3.4L V6   SE              NaN
70445 2002.0000  CHEVROLET   CAVALIER 4C 2.2L I4  Bas              NaN
70446 2002.0000    MERCURY  MOUNTAINEER 2WD V8 4  NaN              NaN
70450 2006.0000       FORD  FREESTAR FWD V6 3.9L   SE              NaN

There are two columns that record essentially the same information: WheelType and WheelTypeID. We’ll keep only one, and recode the missing values simply as “Other”.

# 3169 missing values in WheelTypeID, 3174 in WheelType. Crosscheck these columns.
print(df["WheelTypeID"].value_counts())
print("\n")
print(df["WheelType"].value_counts())

WheelTypeID
1    36050
2    33004
3      755
0        5
Name: count, dtype: int64


WheelType
Alloy      36050
Covers     33004
Special      755
Name: count, dtype: int64

print("N. of WheelTypeID NAs that are also WheelType NAs: " + 
      str(pd.isnull(df.loc[pd.isnull(df["WheelTypeID"]), "WheelType"]).sum())
      )

print("\nRemaining 5 rows with WheelType NAs are WheelTypeID = 0: \n" +
      str(df.loc[df["WheelTypeID"] == "0", "WheelType"])
      )

N. of WheelTypeID NAs that are also WheelType NAs: 3169

Remaining 5 rows with WheelType NAs are WheelTypeID = 0: 
2992     NaN
3926     NaN
42640    NaN
47731    NaN
69331    NaN
Name: WheelType, dtype: object

print("Cases where WheelTypeID 1 = WheelType Alloy: " + 
      str((df.loc[df["WheelTypeID"] == "1", "WheelType"] == "Alloy").sum())
      )

print("Cases where WheelTypeID 2 = WheelType Covers: " + 
      str((df.loc[df["WheelTypeID"] == "2", "WheelType"] == "Covers").sum())
      )

print("Cases where WheelTypeID 3 = WheelType Special: " + 
      str((df.loc[df["WheelTypeID"] == "3", "WheelType"] == "Special").sum())
      )

Cases where WheelTypeID 1 = WheelType Alloy: 36050
Cases where WheelTypeID 2 = WheelType Covers: 33004
Cases where WheelTypeID 3 = WheelType Special: 755

# Recode WheelType NAs as Other, drop WheelTypeID column
df.loc[pd.isnull(df["WheelType"]), "WheelType"] = "Other"
df = df.drop("WheelTypeID", axis = 1)

# 5 missing values in Nationality. Work these out from car make. 
df["Nationality"].value_counts()

Nationality
AMERICAN          61028
OTHER ASIAN        8033
TOP LINE ASIAN     3722
OTHER               195
Name: count, dtype: int64

# Work out the 5 missing Nationality values from make
print("Makes of rows with missing Nationality values: ")
print(df.loc[pd.isnull(df["Nationality"]), "Make"])
nationality_nas = ["AMERICAN", "AMERICAN", "OTHER ASIAN", "AMERICAN", "AMERICAN"]
df.loc[pd.isnull(df["Nationality"]), "Nationality"] = nationality_nas

Makes of rows with missing Nationality values: 
10888        GMC
25169      DODGE
37986    HYUNDAI
69948       JEEP
69958       JEEP
Name: Make, dtype: object

# 5 missing values in size. Work these out from the model.
df["Size"].value_counts()

Size
MEDIUM         30785
LARGE           8850
MEDIUM SUV      8090
COMPACT         7205
VAN             5854
LARGE TRUCK     3170
SMALL SUV       2276
SPECIALTY       1915
CROSSOVER       1759
LARGE SUV       1433
SMALL TRUCK      864
SPORTS           777
Name: count, dtype: int64

# Work out the 5 missing Size values from make & model
print("Rows with Size values missing: ")
print(df.loc[pd.isnull(df["Size"]), ["VehYear", "Make", "Model"]])

print("\nSize values of other rows with the same models: ")
print(
  df.loc[df["Model"].str.contains("SIERRA"), "Size"].iloc[0] + "\n" +
  df.loc[df["Model"].str.contains("NITRO 4WD"), "Size"].iloc[0] + "\n" +
  df.loc[df["Model"].str.contains("ELANTRA"), "Size"].iloc[0] + "\n" +
  df.loc[df["Model"].str.contains("PATRIOT 2WD"), "Size"].iloc[0]
      )

size_nas = ["LARGE TRUCK", "MEDIUM SUV", "MEDIUM", "SMALL SUV", "SMALL SUV"]
df.loc[pd.isnull(df["Size"]), "Size"] = size_nas

Rows with Size values missing: 
        VehYear     Make                Model
10888 2002.0000      GMC          SIERRA 1500
25169 2008.0000    DODGE         NITRO 4WD V6
37986 2006.0000  HYUNDAI  ELANTRA 2.0L I4 MPI
69948 2008.0000     JEEP       PATRIOT 2WD 4C
69958 2008.0000     JEEP       PATRIOT 2WD 4C

Size values of other rows with the same models: 

LARGE TRUCK
MEDIUM SUV
MEDIUM
SMALL SUV

We have a column recording whether a car’s make is GM, Chrysler, Ford, or another brand. I doubt we need this, as this information is already built into the Make column.

# Unnecessary column, information already incorporated in Make. Drop it
print(df["TopThreeAmericanName"].value_counts())
df = df.drop("TopThreeAmericanName", axis = 1)

TopThreeAmericanName
GM          25314
CHRYSLER    23399
FORD        12315
OTHER       11950
Name: count, dtype: int64

There are several columns that record the MMR values of the cars. MMR stands for Manheim Market Report, a valuation offered by car auction company Manheim.

The MMR columns titled “Acquisition” record the MMR valuations of the car at time of purchase. There are several different valuations available, such as auction & retail prices, each for average & clean condition cars respectively.
The MMR columns titled “Current” are the car’s current MMR valuation prices. For sake of realism, we’ll assume we’re performing this analysis before purchasing the cars, so we’ll drop the Current MMR columns.

print("Missing values in MMR columns:")
print(pd.isnull(
      df[['MMRAcquisitionAuctionAveragePrice',   'MMRAcquisitionAuctionCleanPrice',
       'MMRAcquisitionRetailAveragePrice', 'MMRAcquisitonRetailCleanPrice',
       'MMRCurrentAuctionAveragePrice', 'MMRCurrentAuctionCleanPrice',
       'MMRCurrentRetailAveragePrice', 'MMRCurrentRetailCleanPrice']]
        ).sum()
      )
       
# Drop current MMR prices to make the exercise more realistic
df = df.drop([
  'MMRCurrentAuctionAveragePrice', 'MMRCurrentAuctionCleanPrice',
  'MMRCurrentRetailAveragePrice', 'MMRCurrentRetailCleanPrice'], axis = 1)

Missing values in MMR columns:
MMRAcquisitionAuctionAveragePrice     18
MMRAcquisitionAuctionCleanPrice       18
MMRAcquisitionRetailAveragePrice      18
MMRAcquisitonRetailCleanPrice         18
MMRCurrentAuctionAveragePrice        315
MMRCurrentAuctionCleanPrice          315
MMRCurrentRetailAveragePrice         315
MMRCurrentRetailCleanPrice           315
dtype: int64

# Some MMR values are zero
print("N. of rows with at least one MMR value of 0: " + 
      str(
        len(
        df.loc[
        (df["MMRAcquisitionAuctionAveragePrice"] == 0) |
        (df["MMRAcquisitionAuctionCleanPrice"] == 0) |
        (df["MMRAcquisitionRetailAveragePrice"] == 0) |
        (df["MMRAcquisitonRetailCleanPrice"] == 0)
        ]
          )
        )
      ) 


# Some MMR values are one
print("N. of rows with at least one MMR value of 1: " + 
      str(
        len(
        df.loc[
        (df["MMRAcquisitionAuctionAveragePrice"] == 1) |
        (df["MMRAcquisitionAuctionCleanPrice"] == 1) |
        (df["MMRAcquisitionRetailAveragePrice"] == 1) |
        (df["MMRAcquisitonRetailCleanPrice"] == 1)
       ]
          )
        )
      ) 

  
# Drop rows with NAs in MMR
df = df.dropna(subset = [
  'MMRAcquisitionAuctionAveragePrice', 'MMRAcquisitionAuctionCleanPrice',
  'MMRAcquisitionRetailAveragePrice', 'MMRAcquisitonRetailCleanPrice'])

# Drop rows with 0s in MMR
df = df.loc[
  (df["MMRAcquisitionAuctionAveragePrice"] > 0) &
  (df["MMRAcquisitionAuctionCleanPrice"] > 0) &
  (df["MMRAcquisitionRetailAveragePrice"] > 0) &
  (df["MMRAcquisitonRetailCleanPrice"] > 0)].copy()

N. of rows with at least one MMR value of 0: 828
N. of rows with at least one MMR value of 1: 131

The PRIMEUNIT and AUCGUART columns mostly consist of missing values. But for rows with known values, they could be important predictors.

PRIMEUNIT records whether a car attracted unusually high demand. “YES” values are very rare even among non-missing rows. The missing values could simply mean “NO”.
AUCGUART records the level of inspection passed by a car before being auctioned. “GREEN” means fully inspected, “YELLOW” means partially inspected, “RED” means “you buy what you see”. There are no “YELLOW” values in the data, which means the missing values could simply mean “YELLOW”.
For both columns, we’ll recode missing values as “UNKNOWN”. Later we’ll have options to handle these based on the categorical encoding method we use.

# 95% missing column
print(df["PRIMEUNIT"].value_counts(normalize = True))

# Fill NAs in PRIMEUNIT with UNKNOWN.
df.loc[pd.isnull(df["PRIMEUNIT"]), "PRIMEUNIT"] = "UNKNOWN"

PRIMEUNIT
NO    0.9817
YES   0.0183
Name: proportion, dtype: float64

# 95% missing column
print(df["AUCGUART"].value_counts(normalize = True))

# Fill NAs in AUCGUART with UNKNOWN.
df.loc[pd.isnull(df["AUCGUART"]), "AUCGUART"] = "UNKNOWN"

AUCGUART
GREEN   0.9770
RED     0.0230
Name: proportion, dtype: float64

# BYRNO is buyer no. 74 unique buyers, some with only 1 observation.
df["BYRNO"].value_counts()

BYRNO
99761    3896
18880    3553
835      2951
3453     2888
22916    2795
         ... 
99741       1
1157        1
3582        1
10425       1
1086        1
Name: count, Length: 74, dtype: int64

# VNZIP1 is zipcode of purchase location, 153 locations, some with only 1 obs.
df["VNZIP1"].value_counts()

VNZIP1
32824    3676
27542    3368
75236    2410
74135    2286
80022    2097
         ... 
76101       1
85248       1
85260       1
80112       1
25071       1
Name: count, Length: 153, dtype: int64

# VNST is purchase state.
df["VNST"].value_counts()

VNST
TX    13482
FL    10318
CA     7009
NC     6971
AZ     6102
CO     4938
SC     4232
OK     3552
GA     2425
TN     1737
VA     1652
MD     1124
UT      870
PA      806
OH      793
MO      752
AL      684
NV      542
MS      488
IN      486
IA      482
IL      454
LA      347
NJ      317
WV      291
NM      237
KY      228
OR      208
ID      187
WA      135
NH       97
AR       68
MN       62
NE       26
MA       15
MI       14
NY        6
Name: count, dtype: int64

Vehicle purchase price is likely an important predictor, which is why we’ll drop the few rows with missing purchase prices. There is one car with a purchase price of 1, which may mean it was a symbolic transaction. We’ll keep this row for now.

# VehBCost is purchase price. 68 missing values, drop these rows. One car has a purchase price of 1.
print(df["VehBCost"].describe())
df = df.dropna(subset = "VehBCost")

count   72069.0000
mean     6727.6698
std      1766.7283
min         1.0000
25%      5430.0000
50%      6700.0000
75%      7900.0000
max     45469.0000
Name: VehBCost, dtype: float64

Warranty cost is paid for 36 months worth of warranty, or until the car racks up 36k miles. Likely another important predictor, especially relative to the purchase price.

# Warranty cost is for 36 months, or until 36k miles
df["WarrantyCost"].describe()

count   72069.0000
mean     1276.4975
std       597.7557
min       462.0000
25%       837.0000
50%      1169.0000
75%      1623.0000
max      7498.0000
Name: WarrantyCost, dtype: float64

Feature engineering

We can create some new features from our existing ones. We can also explicitly map some interaction terms, to make them easier to discover for some models.

# Time features from date: Purchase year, month, day of week
df["PurchaseYear"] = df["PurchDate"].dt.year
df["PurchaseMonth"] = df["PurchDate"].dt.month
df["PurchaseDay"] = df["PurchDate"].dt.weekday
df = df.drop("PurchDate", axis = 1)

The model and submodel names incorporate plenty of technical information on the cars. We’ll create numerous binary columns that flag various traits of a car: Engine type, drivetrain type, chassis type and number of doors. I also thought about retrieving engine displacement in liters as a numeric feature, but this information is not openly stated for many cars, and would raise many missing values.

# Engine type features from Model: V6, V8, I4/I-4, 4C, 6C
df["EngineV6"] = df["Model"].str.contains("V6").astype(int)
df["EngineV8"] = df["Model"].str.contains("V8").astype(int)
df["EngineI4"] = df["Model"].str.contains("I4|I-4", regex = True).astype(int)
df["Engine4C"] = df["Model"].str.contains("4C").astype(int)
df["Engine6C"] = df["Model"].str.contains("6C").astype(int)

# Drivetrain type features from Model: 2WD, 4WD, AWD, FWD, RWD
df["2WD"] = df["Model"].str.contains("2WD").astype(int)
df["4WD"] = df["Model"].str.contains("4WD").astype(int)
df["AWD"] = df["Model"].str.contains("AWD").astype(int)
df["FWD"] = df["Model"].str.contains("FWD").astype(int)
df["RWD"] = df["Model"].str.contains("RWD").astype(int)

# Chassis type features from SubModel: WAGON, SEDAN, COUPE, HATCHBACK, CONVERTIBLE
# Work out and recode these features manually for rows where SubModel is missing
print("Rows where SubModel is missing:")
print(df.loc[pd.isnull(df["SubModel"]), "Model"])

df["ChassisWagon"] = df["SubModel"].str.contains("WAGON")
df.loc[pd.isnull(df["ChassisWagon"]), "ChassisWagon"] = [
  False, False, False, False, False, False, False, True]
df["ChassisWagon"] = df["ChassisWagon"].astype(int)

df["ChassisSedan"] = df["SubModel"].str.contains("SEDAN")
df.loc[pd.isnull(df["ChassisSedan"]), "ChassisSedan"] = [
  True, True, False, True, True, True, False, False]
df["ChassisSedan"] = df["ChassisSedan"].astype(int)

df["ChassisCoupe"] = df["SubModel"].str.contains("COUPE")
df.loc[pd.isnull(df["ChassisCoupe"]), "ChassisCoupe"] = False
df["ChassisCoupe"] = df["ChassisCoupe"].astype(int)

df["ChassisHatch"] = df["SubModel"].str.contains("HATCHBACK")
df.loc[pd.isnull(df["ChassisHatch"]), "ChassisHatch"] = False
df["ChassisHatch"] = df["ChassisHatch"].astype(int)

df["ChassisConvertible"] = df["SubModel"].str.contains("CONVERTIBLE")
df.loc[pd.isnull(df["ChassisConvertible"]), "ChassisConvertible"] = False
df["ChassisConvertible"] = df["ChassisConvertible"].astype(int)

Rows where SubModel is missing:
24567    STRATUS V6 2.7L V6 M
24578    STRATUS V6 2.7L V6 M
70432    S10 PICKUP 2WD 4C 2.
70434      IMPALA 3.4L V6 SFI
70437     GRAND AM V6 3.4L V6
70445     CAVALIER 4C 2.2L I4
70446    MOUNTAINEER 2WD V8 4
70450    FREESTAR FWD V6 3.9L
Name: Model, dtype: object

# Door type feature from SubModel: 4D
df["FourDoors"] = df["SubModel"].str.contains("4D")

# Work out and recode this feature manually for rows where SubModel is missing (displayed in previous cell)
df.loc[pd.isnull(df["FourDoors"]), "FourDoors"] = [
  True, True, False, True, True, True, False, False]
df["FourDoors"] = df["FourDoors"].astype(int)

I decided to combine the Model & SubModel columns to have one “car” feature. This column will have very high cardinality, but we’ll use target encoding, and models with this feature performed slightly better than models without it. I did not include Trim in this feature, but it’s possible to give it a try.

# Recode SubModel NAs into empty string
df.loc[pd.isnull(df["SubModel"]), "SubModel"] = ""

# Combine Model & SubModel as one feature
df["ModelSubModel"] = df["Model"] + " " + df["SubModel"]

# Drop trim, submodel
df = df.drop(["Trim", "SubModel"], axis = 1)

Some obvious interaction features we can add are miles traveled per year, the premium / discount paid on the MMR prices, and the warranty cost to purchase price ratio.

# Miles per year feature
df["MilesPerYear"] = df["VehOdo"] / df["VehicleAge"]
df.loc[df["MilesPerYear"] == np.inf, "MilesPerYear"] = df["VehOdo"] # Replace inf values raised when vehicle age = 0

# Premiums / discounts paid on MMR prices: VehBCost - MMR price / MMR price
df["PremiumAuctionAvg"] = (df["VehBCost"] - df["MMRAcquisitionAuctionAveragePrice"]) / df["MMRAcquisitionAuctionAveragePrice"]

df["PremiumAuctionClean"] = (df["VehBCost"] - df["MMRAcquisitionAuctionCleanPrice"]) / df["MMRAcquisitionAuctionCleanPrice"]

df["PremiumRetailAvg"] = (df["VehBCost"] - df["MMRAcquisitionRetailAveragePrice"]) / df["MMRAcquisitionRetailAveragePrice"]
  
df["PremiumRetailClean"] = (df["VehBCost"] - df["MMRAcquisitonRetailCleanPrice"]) / df["MMRAcquisitonRetailCleanPrice"]

The row with a purchase price of 1 results in an astronomically high value for the warranty cost / purchase price ratio, so we drop it after all. We have plenty of data, and it’s likely better to keep it accurate.

# Warranty ratio to purchase price
df["WarrantyRatio"] = df["WarrantyCost"] / df["VehBCost"]

# The observation with purchase price = 1 skews the WarrantyRatio feature greatly. Drop it.
print(df[["VehBCost", "WarrantyRatio"]].sort_values(by = "WarrantyRatio", ascending = False).iloc[0])
df = df.loc[df["VehBCost"] != 1].copy()

VehBCost           1.0000
WarrantyRatio   1590.0000
Name: 20442, dtype: float64

Most machine learning algorithms require all features to be numeric. We’ll use one-hot encoding for most of our categorical columns.

This can be done before splitting training & testing data, as it won’t leak information from the testing set. However, we can’t use this approach in future prediction sets if they have new categorical levels unseen in our training data.
Adding too many columns may cause our models to suffer from the curse of dimensionality, which affects different algorithms to different extents. We’ll one-hot encode only the columns with relatively fewer levels, and leave the high-cardinality columns to target encoding.
A categorical column with N levels can be encoded with N-1 binary columns, as the Nth value will be reflected by all columns taking the value of 0. We’ll take advantage of this to keep our column count low.
We’ll also create binary columns only for known values of PRIMEUNIT and AUCGUART, effectively encoding the missing values as 0 for all binary columns.

# One hot encode some categoricals in-place
encoder_onehot = OneHotEncoder(
  drop_last = True, # Create N-1 binary columns to encode N categorical levels
  drop_last_binary = True,
  variables = ['Auction', 'VehYear', 'Color', 'Transmission', 'WheelType', 
  'Nationality', 'Size', 'PurchaseYear'],
  ignore_format = True)
df = encoder_onehot.fit_transform(df)

# One hot encode PRIMEUNIT and AUCGUART only using known values (effectively making "unknown" the default / 0 case)
df["PRIMEUNIT_YES"] = (df["PRIMEUNIT"] == "YES").astype(int)
df["PRIMEUNIT_NO"] = (df["PRIMEUNIT"] == "NO").astype(int)
df["AUCGUART_GREEN"] = (df["AUCGUART"] == "GREEN").astype(int)
df["AUCGUART_RED"] = (df["AUCGUART"] == "RED").astype(int)
df = df.drop(["PRIMEUNIT", "AUCGUART"], axis = 1)

Month and day of week are cyclical features: The “distance” between month 1 and month 12 is not eleven, it is one. Simply label encoding the months as 1-12 , or weekdays as 1-7 would mislead our models.

Instead, we’ll use cyclical encoding, which encodes each feature using a sine and cosine pair, conveying their cyclical nature to our algorithms.
Another option is one-hot encoding, but it creates considerably more columns than cyclical encoding. It could also be advantageous to avoid “splitting” related information into many columns, to ensure they are evaluated together.

# Cyclical encode month and day of week features
encoder_cyclical = CyclicalFeatures(
  variables = ["PurchaseMonth", "PurchaseDay"], drop_original = True)
df = encoder_cyclical.fit_transform(df)

The remaining, high-cardinality categorical features will be target encoded, which won’t change the number of columns.

# View final feature set before preprocessing
print(df.head())

  IsBadBuy  VehicleAge   Make                Model     VehOdo   
0        0      3.0000  MAZDA               MAZDA3 89046.0000  \
1        0      5.0000  DODGE  1500 RAM PICKUP 2WD 93593.0000   
2        0      4.0000  DODGE           STRATUS V6 73807.0000   
3        0      5.0000  DODGE                 NEON 65617.0000   
4        0      4.0000   FORD                FOCUS 69367.0000   

   MMRAcquisitionAuctionAveragePrice  MMRAcquisitionAuctionCleanPrice   
0                          8155.0000                        9829.0000  \
1                          6854.0000                        8383.0000   
2                          3202.0000                        4760.0000   
3                          1893.0000                        2675.0000   
4                          3913.0000                        5054.0000   

   MMRAcquisitionRetailAveragePrice  MMRAcquisitonRetailCleanPrice  BYRNO   
0                        11636.0000                     13600.0000  21973  \
1                        10897.0000                     12572.0000  19638   
2                         6943.0000                      8457.0000  19638   
3                         4658.0000                      5690.0000  19638   
4                         7723.0000                      8707.0000  19638   

  VNZIP1 VNST  VehBCost IsOnlineSale  WarrantyCost  EngineV6  EngineV8   
0  33619   FL 7100.0000            0     1113.0000         0         0  \
1  33619   FL 7600.0000            0     1053.0000         0         0   
2  33619   FL 4900.0000            0     1389.0000         1         0   
3  33619   FL 4100.0000            0      630.0000         0         0   
4  33619   FL 4000.0000            0     1020.0000         0         0   

   EngineI4  Engine4C  Engine6C  2WD  4WD  AWD  FWD  RWD  ChassisWagon   
0         0         0         0    0    0    0    0    0             0  \
1         0         0         0    1    0    0    0    0             0   
2         0         0         0    0    0    0    0    0             0   
3         0         0         0    0    0    0    0    0             0   
4         0         0         0    0    0    0    0    0             0   

   ChassisSedan  ChassisCoupe  ChassisHatch  ChassisConvertible  FourDoors   
0             1             0             0                   0          1  \
1             0             0             0                   0          0   
2             1             0             0                   0          1   
3             1             0             0                   0          1   
4             0             1             0                   0          0   

                           ModelSubModel  MilesPerYear  PremiumAuctionAvg   
0                      MAZDA3 4D SEDAN I    29682.0000            -0.1294  \
1  1500 RAM PICKUP 2WD QUAD CAB 4.7L SLT    18718.6000             0.1088   
2            STRATUS V6 4D SEDAN SXT FFV    18451.7500             0.5303   
3                          NEON 4D SEDAN    13123.4000             1.1659   
4                     FOCUS 2D COUPE ZX3    17341.7500             0.0222   

   PremiumAuctionClean  PremiumRetailAvg  PremiumRetailClean  WarrantyRatio   
0              -0.2776           -0.3898             -0.4779         0.1568  \
1              -0.0934           -0.3026             -0.3955         0.1386   
2               0.0294           -0.2943             -0.4206         0.2835   
3               0.5327           -0.1198             -0.2794         0.1537   
4              -0.2085           -0.4821             -0.5406         0.2550   

   Auction_ADESA  Auction_OTHER  VehYear_2006.0  VehYear_2004.0   
0              1              0               1               0  \
1              1              0               0               1   
2              1              0               0               0   
3              1              0               0               1   
4              1              0               0               0   

   VehYear_2005.0  VehYear_2007.0  VehYear_2001.0  VehYear_2003.0   
0               0               0               0               0  \
1               0               0               0               0   
2               1               0               0               0   
3               0               0               0               0   
4               1               0               0               0   

   VehYear_2002.0  VehYear_2008.0  VehYear_2009.0  Color_RED  Color_WHITE   
0               0               0               0          1            0  \
1               0               0               0          0            1   
2               0               0               0          0            0   
3               0               0               0          0            0   
4               0               0               0          0            0   

   Color_MAROON  Color_SILVER  Color_BLACK  Color_GOLD  Color_GREY   
0             0             0            0           0           0  \
1             0             0            0           0           0   
2             1             0            0           0           0   
3             0             1            0           0           0   
4             0             1            0           0           0   

   Color_BLUE  Color_BEIGE  Color_PURPLE  Color_ORANGE  Color_GREEN   
0           0            0             0             0            0  \
1           0            0             0             0            0   
2           0            0             0             0            0   
3           0            0             0             0            0   
4           0            0             0             0            0   

   Color_BROWN  Color_YELLOW  Color_NOT AVAIL  Transmission_AUTO   
0            0             0                0                  1  \
1            0             0                0                  1   
2            0             0                0                  1   
3            0             0                0                  1   
4            0             0                0                  0   

   WheelType_Alloy  WheelType_Covers  WheelType_Other   
0                1                 0                0  \
1                1                 0                0   
2                0                 1                0   
3                1                 0                0   
4                0                 1                0   

   Nationality_OTHER ASIAN  Nationality_AMERICAN  Nationality_TOP LINE ASIAN   
0                        1                     0                           0  \
1                        0                     1                           0   
2                        0                     1                           0   
3                        0                     1                           0   
4                        0                     1                           0   

   Size_MEDIUM  Size_LARGE TRUCK  Size_COMPACT  Size_LARGE  Size_VAN   
0            1                 0             0           0         0  \
1            0                 1             0           0         0   
2            1                 0             0           0         0   
3            0                 0             1           0         0   
4            0                 0             1           0         0   

   Size_MEDIUM SUV  Size_LARGE SUV  Size_SPECIALTY  Size_SPORTS   
0                0               0               0            0  \
1                0               0               0            0   
2                0               0               0            0   
3                0               0               0            0   
4                0               0               0            0   

   Size_CROSSOVER  Size_SMALL SUV  PurchaseYear_2009  PRIMEUNIT_YES   
0               0               0                  1              0  \
1               0               0                  1              0   
2               0               0                  1              0   
3               0               0                  1              0   
4               0               0                  1              0   

   PRIMEUNIT_NO  AUCGUART_GREEN  AUCGUART_RED  PurchaseMonth_sin   
0             0               0             0            -0.0000  \
1             0               0             0            -0.0000   
2             0               0             0            -0.0000   
3             0               0             0            -0.0000   
4             0               0             0            -0.0000   

   PurchaseMonth_cos  PurchaseDay_sin  PurchaseDay_cos  
0             1.0000           0.0000           1.0000  
1             1.0000           0.0000           1.0000  
2             1.0000           0.0000           1.0000  
3             1.0000           0.0000           1.0000  
4             1.0000           0.0000           1.0000

Preprocessing pipeline

The rest of our preprocessing will take place in a scikit-learn pipeline, to ensure we avoid any target leakage, and to keep our modeling workflow clean.

First, we’ll set aside 20% of our data for testing. We’ll ensure the split is performed with stratification: We want to maintain the target class balance as much as possible.
We’ll use the remaining 80% for hyperparameter tuning and validation. We’ll test the final models with the best hyperparameters on the testing data.

# Split target and features
y = df["IsBadBuy"].astype(int)
x = df.drop("IsBadBuy", axis = 1)


# Perform train-test split
x_train, x_test, y_train, y_test = train_test_split(
  x, y, test_size = 0.2, random_state = 1923, stratify = y
  )

We will use several target encoders to encode the remaining, high-cardinality categorical features.

Target encoders derive a numeric value based on the target values for each categorical level. This brings the obvious risk of leaking the target values from the testing data if not applied properly.
Another potential issue is overfitting: If a categorical level has very few observations, the value derived from the target will be unreliable.
- The target encoder applies a form of regularization to alleviate this: The encoded value for infrequent categorical levels are blended with the values for all observations. The degree of this regularization can be controlled by the parameters min_sample_leaf and smoothing in target_encoder. We’ll use the default settings.
- The target encoder can also make use of a hierarchy of columns: Categorical levels of one column may be subsets of levels of another column. We have two such hierarchies in our data: The make-model-submodel hierarchy, and the states-zipcodes hierarchy.
- We’ll map the target encoders for these columns with their respective higher hierarchies, so regularization can be applied by blending in the higher hierarchy values instead of values for all observations.
Aside from these risks and considerations, target encoding brings several benefits:
- It does not create any new columns, and it does not split the information from one feature into several columns.
- It is also able to handle any missing values, or previously unseen categorical levels in prediction data, by applying the encoded values for all observations, or the values for the hierarchy, as done in regularization.

# Target encoder: ModelSubModel encoded with Model, Make hierarchy
hier_submodels = pd.DataFrame(x[["Make", "Model"]]).rename({
  "Make": "HIER_ModelSubModel_1", # Make as first (top) level of hierarchy
  "Model": "HIER_ModelSubModel_2"}, # Model as second level of hierarchy
  axis = 1)
  
encode_target_submodel = TargetEncoder(
  cols = ["ModelSubModel"], # Column to encode (bottom level of hierarchy)
  hierarchy = hier_submodels)


# Target encoder: Model encoded with Make hierarchy
hier_models = pd.DataFrame(x["Make"]).rename({"Make": "HIER_Model_1"}, axis = 1)
encode_target_model = TargetEncoder(cols = ["Model"], hierarchy = hier_models)


# Target encoder: Zipcodes encoded with states hierarchy
hier_states = pd.DataFrame(x["VNST"]).rename({"VNST": "HIER_VNZIP1_1"}, axis = 1)
encode_target_zip = TargetEncoder(cols = ["VNZIP1"], hierarchy = hier_states)


# Target encoder: Make, buyer, state encoded without hierarchy (apply last):
encode_target = TargetEncoder(cols = ["Make", "BYRNO", "VNST"])

Many machine learning algorithms can be sensitive to the value scales of features.

Those with very large absolute values may dominate training, and prevent smaller-scale values from having much effect.
All our models except the tree-based XGBoost algorithm can be sensitive to scale. Especially neural networks can suffer from various numerical stability issues in training.
We’ll scale all features between 0 and 1 as our final preprocessing step.

For hyperparameter validation, we will use 3-fold crossvalidation:

The 80% training data will be split to 3 equal folds. Each hyperparameter configuration we evaluate will be trained on two folds, and validated on the third, for a total of three different combinations. Stratification will be applied just like in the train-test split.
We could have more robust validation scores with more folds, but the neural network tuning can be computationally expensive, and I wanted to compare each model on “equal terms”.
An even more reliable tuning & testing scheme is nested crossvalidation: Splitting the full data into N training-testing folds (outer resampling), and splitting each training fold into K folds for parameter tuning (inner resampling). This is likely not necessary with such a big dataset.
Since we’ll build a custom model validation function for each model, we’ll retrieve the train-test indices from the 3-fold CV splitter, for a total of three train-test index pairs.

# Scaling method
scaler_minmax = MinMaxScaler()


# Full preprocessing pipeline
pipe_process = Pipeline(steps = [
  ("target_encoder_zipcode", encode_target_zip),
  ("target_encoder_submodel", encode_target_submodel),
  ("target_encoder_model", encode_target_model),
  ("target_encoder", encode_target),
  ("minmax_scaler", scaler_minmax)
  ])

  
# Inner validation method for hyperparameter tuning
cv_kfold = StratifiedKFold(n_splits = 3)


# Get training / inner validation split indices (3 pairs) from k-fold splitter
cv_indices = list(cv_kfold.split(x_train, y_train))

Modeling & hyperparameter optimization

The hyperparameter optimization will follow 3 main steps for every model:

A validation function will be defined, to perform 3-fold crossvalidation with a given hyperparameter set.
An Optuna objective function will be defined, which will suggest hyperparameter values from a predefined range, validate the parameter set (with our validation function), and report the average validation score of the parameter set.
- One iteration of the objective function for one set of hyperparameters is called an Optuna trial.
An Optuna study will be created and performed. The study performs a given number of trials and records their results.
- The study uses a sampler algorithm to intelligently suggest parameter values for trials, and a pruner algorithm to stop unpromising trials early on and save time.

Since the process is very similar for all models, we’ll go over the full process only for the first model, logistic regression. For the other models, only the differences will be explained. Of course, the actual tuning process takes hours, so I’ll just display the code I used without running it and load previously saved results.

Logistic regression with SGD

Logistic regression predicts the log-odds of an event occurring (also called logits) with a linear model. It’s a simple, robust method that can perform very well compared to more advanced models.

The main assumption of logistic regression is a linear relationship between the predictors and the logits. The less linear the relationships are, the less accurate logistic regression will be.
Logistic regression won’t find and model interaction effects unless they are explicitly mapped as new features (we did this for some features).
Plain linear regression models can suffer from multicollinearity and curse of dimensionality in the feature set, but regularization is almost always applied to model coefficients to avoid overfitting.
- L1 (lasso) regularization can potentially shrink model coefficients to zero, effectively dropping predictors from the model and performing feature selection.
- L2 (ridge) regularization can’t shrink coefficients to zero, but just reduces their contribution to the model.
- Elastic net regularization is a blend of L1 and L2 regularization, controlled by a ratio parameter. This is what we will use.

Logistic regression (implemented with LogisticRegression in scikit-learn) is normally fit on all training observations at once: The model coefficients are all estimated in one go using a solver algorithm. This calculation can be a bit slow with a large dataset like ours, especially since we’ll fit hundreds of models for hyperparameter tuning. To save time, we’ll train our logistic regression (and SVM) models with stochastic gradient descent:

In SGD, the model is trained only on one observation in one training step. The training error (loss) is calculated and the model parameters are updated at every training step. When the model sees each training observation once, we complete a training epoch.
We perform multiple epochs of training. Training is usually early stopped when the validation score stops improving. This is usually assessed after each epoch.
This approach is similar to how neural networks are trained. A key difference is NNs use minibatch SGD: A training step consists of a batch of observations, which can range from 16-32 to 1024-2048 observations (usually a power of 2). This is a faster approach than single-observation SGD. It’s possible to implement this manually in scikit-learn, but it likely won’t bring much improvement for logistic regression or SVM.

We can train linear classification models with SGD in scikit-learn using SGDClassifier. Below, we will define a validation function that takes a set of hyperparameters, builds a logistic regression model with SGDClassifier and validates the performance of the parameter set.

In one instance of the function, a model will be created, trained & validated for each CV fold, for a total of three times.
The model will be trained with the .partial_fit() method, which trains the SGDClassifier for only one epoch. We’ll call the method repeatedly to train the model for several epochs. Our function will handle early stopping manually by recording each epoch’s score. By default, 10 consecutive epochs with less than 0.0001 improvement in the validation error will terminate the model training.
- Class weights will be applied to both training and validation error calculations: The errors for the minority class will be penalized more heavily, forcing our models to pay more attention to them.
- The class weight for each class will be inversely proportional to its class frequency, calculated automatically by a scikit-learn utility (only using the training data to avoid leakage). Using these class weights, we will create a sample weights vector for every training & validation set observation, and pass these to the model training and scoring functions respectively.
The function will return the average CV score of the parameter set, and the number of training epochs that achieved it. It will also perform pruning if directed by the Optuna pruner:
- Unpromising trials (determined according to the validation scores of the first epochs) will be stopped early by the Optuna pruner. To allow pruning, the score of each epoch for the first CV fold will be reported to the Optuna study.
- The best number of epochs is not a hyperparameter to be suggested & tuned by Optuna, but we will observe the value it takes during a trial, and report it back.

The SGDClassifier is a general method for training a linear model with SGD. Its parameters determine what type of linear model is trained:

The loss parameter sets the training loss function. For probabilistic logistic regression, we will use log loss, which compares the probability predictions (between 0-1) to the actual target labels (0 or 1). We will also use log loss to score our predictions on the validation set.
The penalty parameter sets the regularization type. We’ll use elastic net regularization.
learning_rate is another key parameter in SGD training (especially for neural networks). It determines the magnitude of the coefficient updates in each training step.
- With a higher LR, the coefficients will be updated more aggressively in each training step. The models will train faster, may be more likely to find optimal solutions, but also more likely to overfit and suffer from erratic model updates.
- With a lower LR, the coefficients will be updated more conservatively in each training step. The models will train slower, may be less likely to reach optimal solutions, but will also be more robust against overfitting.
- In neural networks, a learning rate scheduler is often used to dynamically adjust the learning rate. We’ll explore this in more detail for our NN model, but for SGDClassifier, we’ll use a default heuristic based on the regularization strength.
We have two hyperparameters to tune for the logistic regression model: alpha and l1_ratio.
- Alpha is the regularization strength. A higher value means the model coefficients will be penalized more heavily to combat overfitting. Generally, we’d need a higher alpha value with more redundancy in the feature set. In practice, the optimal value is usually a small fraction, and too high regularization leads to the opposite problem of underfitting.
- L1 ratio will determine what fraction of our elastic net regularization will be L1 regularization.

Show validation function definition

# Define model validation function for logistic regression trained with SGD
def validate_logistic(
  alpha, # Regularization strength hyperparameter
  l1_ratio, # Ratio of L1 regularization hyperparameter
  trial, # Pass the Optuna trial of the parameter set
  tol = 1e-4, # Min. required change in validation score to count as improvement
  verbose = 0 # Set to 1 to print model training progress
  ):
  
  # Record the validation scores of the parameter set on each CV fold
  cv_scores = []
  
  # Record best epoch number for each CV fold
  best_epochs = []
  
  # For each CV fold,
  for i, (train_index, val_index) in enumerate(cv_indices):
  
    # Split training-validation data
    x_tr = x_train.iloc[train_index, ]
    y_tr = y_train.iloc[train_index, ]
    x_val = x_train.iloc[val_index, ]
    y_val = y_train.iloc[val_index, ]
    
    # Perform preprocessing
    x_tr = pipe_process.fit_transform(x_tr, y_tr)
    x_val = pipe_process.transform(x_val)
    
    # Compute class weight & sample weight vectors
    classes = list(set(y_tr))
    class_weight = compute_class_weight("balanced", classes = classes, y = y_tr)
    sample_weight = np.where(y_tr == 1, class_weight[1], class_weight[0])
    sample_weight_val = np.where(y_val == 1, class_weight[1], class_weight[0])
    
    # Define Logistic Regression classifier with SGD
    model_logistic = SGDClassifier(
      loss = "log_loss", # Log loss metric for probabilistic logistic regression
      penalty = "elasticnet", # Elastic net regularization: Blend of L1 and L2
      learning_rate = "optimal", # Dynamically adjusted learning rate, based on a heuristic that uses regularization strength 
      random_state = 1923, # Set random seed for replicable results
      verbose = verbose, # Whether to print training progress or not
      n_jobs = -1, # Use all CPU cores to parallelize
      alpha = alpha, # Hyperparameters to validate
      l1_ratio = l1_ratio # Hyperparameters to validate
    )
    
    # Perform epoch by epoch training with early stopping & pruning
    epoch_scores = [] # Record val. score of each epoch
    n_iter_no_change = 0 # Record epochs with no improvement
    tol = tol # Min. change in val. score to count as improvement
    
    for epoch in range(1000):
      
      # Train model for 1 epoch
      model_logistic.partial_fit(
        x_tr, y_tr, classes = classes, sample_weight = sample_weight)
      
      # Predict validation set & score predictions
      y_pred = model_logistic.predict_proba(x_val)
      epoch_score = log_loss(y_val, y_pred, sample_weight = sample_weight_val)
      
      # For first CV fold, report intermediate score of Optuna trial 
      if i == 0):
        trial.report(epoch_score, epoch)
      
        # Prune trial if necessary
        if trial.should_prune():
          raise optuna.TrialPruned()
      
      # Count epochs with no improvement after first 10 epochs
      if (epoch > 9) and (epoch_score > min(epoch_scores) - tol):
        n_iter_no_change += 1
      
      # Reset epochs with no improvement if an improvement took place
      if (epoch > 9) and (epoch_score <= min(epoch_scores) - tol):
        n_iter_no_change = 0
      
      # Append epoch score to list of epoch scores
      epoch_scores.append(epoch_score)
      
      # Early stop training if necessary
      if n_iter_no_change >= 10:
        print("\nEarly stopping at epoch " + str(epoch) + "\n")
        break 
     
    # Append best epoch score to list of CV scores
    cv_scores.append(min(epoch_scores))
    
    # Append best epoch number to list of best epochs
    best_epochs.append(epoch_scores.index(min(epoch_scores)) + 1)
  
  # Return the average CV score & median best epochs
  return np.mean(cv_scores), np.median(best_epochs)

Next, we define the Optuna objective function for logistic regression. Here, we decide the search space for each hyperparameter we want to tune.

We’ll tune alpha between a very small float value of 5e-5 and a relatively large value of 0.5, though the optimal value is likely close to 0. Therefore, we will instruct Optuna to sample from the log domain:
- The logarithm of the search space will be taken, values will be sampled from this space, and the exponent will be taken to return the original value. This will make lower values more likely to be suggested. We’ll also do this for other hyperparameters in other models that usually take an optimal value closer to 0.
l1_ratio is constrained between 0 and 1, so we will use that as our search space. A value of 0 will result in pure L2 regularization, and a value of 1 will result in pure L1 regularization. Anything in between is elastic net regularization, a blend of the two.

# Define objective function for hyperparameter tuning with Optuna
def objective_logistic(trial):
  
  # Define hyperparameter ranges to tune over, and suggest a value for 
  # each hyperparameter
  alpha = trial.suggest_float(
    "reg_strength", # Parameter name in trials result table
    5e-5, # Minimum value to try
    0.5, # Maximum value to try
    log = True # Sampling from the log domain makes lower values more likely
    )
  l1_ratio = trial.suggest_float("l1_ratio", 0, 1)
  
  # Validate the parameter set
  score, epoch = validate_logistic(
    alpha = alpha, l1_ratio = l1_ratio, trial = trial)
    
  # Report best n. of epochs to Optuna
  trial.set_user_attr("n_epochs", epoch)
  
  return score

Now we create the Optuna study. This object will manage the tuning process and store the information from trials.

We set a sampler algorithm: The sampler suggests hyperparameter values for trials intelligently, often based on the results of previous trials.
- In our case, we use the suggested default, the Tree-structured Parzen Estimator algorithm (see the documentation for a list of papers related to TPE.). In short, the TPE algorithm fits Gaussian Mixture Models to estimate the optimal parameters for each trial.
We set a pruner algorithm: The pruner decides whether to complete a trial, or prune it based on its intermediate validation scores. This saves us a great deal of time by stopping unpromising trials quickly.
- We use the Hyperband pruner algorithm, which is suggested for the TPE sampler. In short, hyperband allocates a finite budget equally to each hyperparameter configuration to be compared. There is the option to increase the total budget, at the expense of eliminating candidate configurations more quickly (see the documentation and original paper for more detail).
We name our study, and ensure we set the correct objective: Since a lower log loss value is better, we aim to minimize the reported metric.

# Create Optuna study
study_logistic = optuna.create_study(
  sampler = optuna.samplers.TPESampler(seed = 1923), # Sampling algorithm
  pruner = optuna.pruners.HyperbandPruner(), # Pruning algorithm
  study_name = "tune_logistic",
  direction = "minimize" # Objective is to minimize the reported metric
)

We can finally run the hyperparameter optimization. It is recommended to run the TPE algorithm with 100 to 1000 trials. The models in the final edition of this report were tuned with 500 or 1000 trials.

With pruning in place, even if improvement stops after a few trials, the remaining trials will mostly be pruned.
For logistic regression, one study of 500 trials usually took around 10-15 minutes. The SGDClassifier was parallelized with n_jobs = -1 and 12 CPU cores available.

# Perform Optuna study
study_logistic.optimize(
  objective_logistic, 
  n_trials = 500,
  show_progress_bar = True)

The trials were exported to .csv for future use, and the best set of hyperparameters will be imported.

# Retrieve and export trials
trials_logistic = study_logistic.trials_dataframe().sort_values("value", ascending = True)
trials_logistic.to_csv("./ModifiedData/trials_logistic.csv", index = False)

# Import best trial
best_trial_logistic = pd.read_csv("./ModifiedData/trials_logistic.csv")
best_trial_logistic = best_trial_logistic.loc[
  best_trial_logistic["state"] == "COMPLETE"].iloc[0,]
best_trial_logistic

number                                                  471
value                                                0.5757
datetime_start                   2023-07-10 14:21:53.276629
datetime_complete                2023-07-10 14:21:57.814307
duration                             0 days 00:00:04.537678
params_l1_ratio                                      0.0003
params_reg_strength                                  0.0028
user_attrs_n_epochs                                 20.0000
system_attrs_completed_rung_0                        0.5715
system_attrs_completed_rung_1                        0.5715
system_attrs_completed_rung_2                           NaN
system_attrs_completed_rung_3                           NaN
state                                              COMPLETE
Name: 298, dtype: object

We see the best tune has a very low L1 regularization ratio, though not zero. The regularization strength is also fairly low, but not nearly close to the lower bound of our search space. This suggests we are likely close to the optimal solution.

The median best number of epochs for the 3 CV folds was 20, which is the number of epochs we’ll use in our final logistic regression model, created below.

We combine our preprocessing pipeline with our logistic regression model, to perform all steps in one go.
We’ll train our final model with the optimal hyperparameters, for the optimal number of epochs we derived from crossvalidation, with no early stopping.
- Even with the built-in validation disabled, SGDClassifier performs early stopping based on the training loss.
- This can be misleading, as lower training loss doesn’t necessarily translate into better predictive performance on unseen data (in the case of overfitting). To avoid this, we set an arbitrarily high number for the n_iter_no_change parameter.

# Create pipeline with final logistic regression model
pipe_logistic = Pipeline(steps = [
  ("preprocessing", pipe_process), # Preprocessing steps
  ("Logistic", SGDClassifier( # Model step
      loss = "log_loss",
      penalty = "elasticnet",
      alpha = best_trial_logistic["params_reg_strength"],
      l1_ratio = best_trial_logistic["params_l1_ratio"],
      max_iter = 20, # Best n. of epochs found from validation
      n_iter_no_change = 1000, # Arbitrary high value to ensure training doesn't early stop based on training loss
      verbose = 1, # Print training progress
      random_state = 1923
      )
    )
  ]
)

Support vector machine with SGD

Our second model, SVM, aims to classify observations by separating them with decision boundaries drawn in the feature space.

Normally, the decision boundaries are linear, but non-linear transformations can be applied to the feature space to model non-linear relationships.
- There are numerous kernel functions that can be used to transform a feature space. The best choice will depend on the nature of the relationships in the data, but we will use a general-purpose radial basis function. The model performance will greatly depend on the suitability of the chosen kernel.
SVM can also be susceptible to multicollinearity and curse of dimensionality, so we will apply & tune regularization parameters. Regularization affects the size of the margin around the decision boundary:
- With higher regularization, we have a larger margin, which may generalize better to new data (reduce overfitting), but may also lead to underfitting by not separating the training observations enough.
- With lower regularization, we have a smaller margin. We can separate the training observations more accurately, but this could lead to overfitting on unseen data.
SVM does not natively output probability predictions between 0 and 1. Instead, it calculates a decision function, based on the distance of an observation from the decision boundaries. This has implications for our performance evaluations, which we will discuss more in model testing.

In scikit-learn, a linear SVM classifier can normally be applied with LinearSVC. A non-linear SVM with built-in kernel options can be applied with SVC. Both approaches, especially the latter, will be computationally expensive with a large dataset. Just like logistic regression, we can use SGDClassifier to train our SVM much more quickly with a SGD algorithm.

Before we apply the SGDClassifier, we’ll add a kernel approximation step to our pipeline to transform our feature space into non-linear. RBFSampler uses random Fourier features to quickly approximate an RBF feature space (original paper).
- This may or may not be a good kernel choice for our data. The only way to be sure is to try different options. Here, we are using a general purpose non-linear kernel to compare a non-linear SVM with other models. Generally, it is suggested to use the “most linear decision boundaries that sufficiently separate the data”, as they tend to be more robust and efficient.
- RBFSampler has two parameters, gamma and n_components.
  - gamma is a RBF function parameter that affects the reach of each training observation, and therefore the “smoothness” of the decision boundary. We leave this value to a default heuristic dependent on the number of features and their variance.
  - A higher gamma means the decision boundary is affected mostly by the closest points, making it less linear and less smooth. This can separate the training observations better, but may also cause overfitting on unseen data.
  - A lower gamma does the opposite: A decision boundary affected by observations further away, more linear and smoother. This may generalize better to new data, but may also cause underfitting.
  - n_components refers to the number of Monte Carlo samples per each feature used to approximate the RBF feature space. A higher value may increase the quality of the approximation, but we leave this default too.
After the kernel approximation gives us a non-linear feature space, the SGDClassifier will act as a linear SVM if we change our loss metric to hinge loss.
- Hinge loss is the default loss metric used to train SVM models. It is calculated from the output of the decision function. It slightly penalizes even “correct” predictions if they were too close to the decision boundary, i.e. close to being misclassified. This property aims to maximize the distances between the observations and the decision boundary.
- We’ll apply class weights to hinge loss, just like we did with log loss. We’ll also use class weighted hinge loss to score predictions on the validation data.
- SVM is normally fit with L2 regularization only, but we’ll try elastic net regularization just like with logistic regression.

The validation function and the Optuna tuning process is pretty much the same with logistic regression. The code is available below. The Optuna study with 500 trials took roughly 15 minutes with 12 CPU cores available.

Show validation function definition

# Define model validation function for SVM
def validate_svm(alpha, l1_ratio, trial, tol = 1e-4, verbose = 0):
  
  # Record the CV scores of the parameter set
  cv_scores = []
  
  # Record best epochs for each CV fold
  best_epochs = []
  
  for i, (train_index, val_index) in enumerate(cv_indices):
  
    # Split training-validation data
    x_tr = x_train.iloc[train_index, ]
    y_tr = y_train.iloc[train_index, ]
    x_val = x_train.iloc[val_index, ]
    y_val = y_train.iloc[val_index, ]
    
    # Compute class weight
    classes = list(set(y_tr))
    class_weight = compute_class_weight("balanced", classes = classes, y = y_tr)
    sample_weight = np.where(y_tr == 1, class_weight[0], class_weight[1])
    sample_weight_val = np.where(y_val == 1, class_weight[0], class_weight[1])
    
    # Define RBF kernel approximator
    kernel_rbf = RBFSampler(
      gamma = "scale",
      n_components = 100,
      random_state = 1923
    )
    
    # Define preprocessing & kernel trick pipeline
    pipe_svm = Pipeline(steps = [
      ("preprocessing", pipe_process),
      ("kernel_rbf", kernel_rbf)
    ])
    
    # Perform preprocessing
    x_tr = pipe_svm.fit_transform(x_tr, y_tr)
    x_val = pipe_svm.transform(x_val)
    
    # Define SVM classifier with SGD
    model_svm = SGDClassifier(
      loss = "hinge", # Hinge loss for SVM
      penalty = "elasticnet", # Elastic net penalty as opposed to default L2
      learning_rate = "optimal", # Dynamically adjusted based on reg. strength 
      random_state = 1923,
      verbose = verbose,
      n_jobs = -1,
      alpha = alpha,
      l1_ratio = l1_ratio
    )
    
    # Perform epoch by epoch training with early stopping & pruning
    epoch_scores = []
    n_iter_no_change = 0
    tol = tol
    
    for epoch in range(1000):
      
      # Train model for 1 epoch
      model_svm.partial_fit(x_tr, y_tr, classes = classes, 
      sample_weight = sample_weight)
      
      # Score epoch
      pred_decision = model_svm.decision_function(x_val)
      epoch_score = hinge_loss(y_val, pred_decision, 
      sample_weight = sample_weight_val)
      
      # For first CV fold, report intermediate score of trial to Optuna
      if i == 0:
        trial.report(epoch_score, epoch)
      
        # Prune trial if necessary
        if trial.should_prune():
          raise optuna.TrialPruned()
      
      # Count epochs with no improvement after first 10 epochs
      if (epoch > 9) and (epoch_score > min(epoch_scores) - tol):
        n_iter_no_change += 1
      
      # Reset epochs with no improvement if an improvement takes place
      if (epoch > 9) and (epoch_score <= min(epoch_scores) - tol):
        n_iter_no_change = 0
      
      # Append epoch score to list of epoch scores
      epoch_scores.append(epoch_score)
      
      # Early stop training if necessary
      if n_iter_no_change >= 10:
        print("Early stopping at epoch " + str(epoch))
        break 
    
    # Append best epoch score to list of CV scores
    cv_scores.append(min(epoch_scores))
    
    # Append best epoch number to list of best epochs
    best_epochs.append(epoch_scores.index(min(epoch_scores)) + 1)
  
  # Return the average CV score & median best n. epochs
  return np.mean(cv_scores), np.median(best_epochs)

Show Optuna objective function

# Define objective function for hyperparameter tuning
def objective_svm(trial):
  
  # Define parameter ranges to tune over
  alpha = trial.suggest_float("reg_strength", 5e-5, 0.5, log = True)
  l1_ratio = trial.suggest_float("l1_ratio", 0, 1)
  
  # Validate the parameter set
  score, epoch = validate_svm(
    alpha = alpha, l1_ratio = l1_ratio, trial = trial)
    
  # Report best n. of epochs to Optuna
  trial.set_user_attr("n_epochs", epoch)
  
  return score

# Import best trial
best_trial_svm = pd.read_csv("./ModifiedData/trials_svm.csv")
best_trial_svm = best_trial_svm.loc[
  best_trial_svm["state"] == "COMPLETE"].iloc[0,]
best_trial_svm

number                                                  154
value                                                0.7448
datetime_start                   2023-07-10 14:30:33.857628
datetime_complete                2023-07-10 14:30:38.622736
duration                             0 days 00:00:04.765108
params_l1_ratio                                      0.1330
params_reg_strength                                  0.0004
user_attrs_n_epochs                                 16.0000
system_attrs_completed_rung_0                        0.7396
system_attrs_completed_rung_1                        0.7376
system_attrs_completed_rung_2                           NaN
system_attrs_completed_rung_3                           NaN
state                                              COMPLETE
Name: 303, dtype: object

The optimal hyperparameter set has lower regularization strength compared to logistic regression, but a higher ratio of L1 regularization. This could suggest the RBF transformation made our feature set more redundant, but the parameter values are likely not directly comparable across different algorithms.

We create the final SVM model, combining it with the preprocessing & kernel approximation pipeline.

We also wrap our SVM model in a CalibratedClassiferCV object. This will approximate probability values between 0 and 1 from the SVM decision function, so we can retrieve probability predictions, calculate some performance metrics that depend on them, and compare SVM with other models.
Keep in mind the probabilities acquired in this manner may be unreliable. The documentation states the default sigmoid method may be unsuitable for imbalanced problems, so we use the non-parametric, isotonic version.

# Create pipeline with final SVM model
pipe_svm = Pipeline(steps = [
  ("preprocessing", pipe_process), # Preprocessing steps
  ("KernelTrick", RBFSampler( # Non-linear transformation of features with 
                              # RBF kernel
      gamma = "scale", # RBF kernel parameter, left to a default heuristic
      n_components = 100, # N. of Monte Carlo samples used to approximate 
                          # each feature
      random_state = 1923
    )
  ),
  ("SVM", CalibratedClassifierCV( # Prob. calibrator to estimate predicted probs.
    estimator = SGDClassifier( # Model step
      loss = "hinge", # Hinge loss to train a linear SVM
      penalty = "elasticnet", # Elastic net regularization instead of default L2
      alpha = best_trial_svm["params_reg_strength"],
      l1_ratio = best_trial_svm["params_l1_ratio"],
      max_iter = 16, # Best n. of epochs found from validation 
      n_iter_no_change = 1000, # Arbitrary high value to ensure training doesn't early stop based on training loss
      verbose = 1, # Print training progress
      random_state = 1923
        ),
    method = "isotonic" # Non-parametric method to estimate predicted probs.
      )
    )
  ]
)

XGBoost

Next is our XGBoost model, a gradient boosting algorithm. Gradient boosting builds a number of models sequentially, each one trained on the prediction errors (residuals) of the previous one. The final ensemble model is a combination of all models.

One iteration of gradient boosting (equivalent to building one model) is called a round. By default, XGBoost builds a decision tree model in each round.
Tree-based models learn relationships & make predictions by splitting the data according to the predictor values.
- This means they can natively model non-linear relationships, and find interaction effects even if they are not explicitly mapped as features (though doing so may still improve performance).
- However, they can’t extrapolate relationships between predictors and the target outside of the observed training data.
  - For example, for a feature X and target Y, a relationship of Y = 2X will be easily modeled and extrapolated by a linear model, even if X takes a very large value not seen in training. A tree-based model, however, will continue predicting according to the highest X value seen in training.
No hard assumptions are made about the data, though too many features can make it harder to identify the optimal splits and potentially hamper prediction performance.
- Normally, a single decision tree with too few predictors & splits will tend to underfit, and one with too many predictors & splits will tend to overfit. In XGBoost, we control the complexity of each tree with several hyperparameters. Generally, we build numerous shallow trees (called weak learners) and ensemble them to arrive at a robust final prediction.
- XGBoost can apply several types of regularization & sampling to further control the tradeoff between overfitting and underfitting. Tuning the numerous hyperparameters is crucial for good performance.
- While model performance may not be impacted by multicollinearity in the feature set, the feature importance scores that can be extracted from tree-based models can be misleading. These only look at the relevancy between the target and the features, without taking their redundancy with one another into account (for more relevancy, redundancy & feature selection, see this article.).

While there are a lot of hyperparameters to tune for XGBoost, the code is simpler compared to SGDClassifier.

We can simply pass a validation set while training an XGBoost model, which automatically handles early stopping, records the best validation score and the best round.
Callbacks are objects that monitor model training and perform a certain action based on a condition, most commonly early stopping. Optuna offers a special pruning callback integration for XGBoost, so we don’t have to manually report scores to the Optuna study.
XGBoost supports GPU training, which will save us some training time. With 1000 trials, the Optuna study took roughly 25-30 minutes on my GPU.
The scale_pos_weight hyperparameter can be used to balance classes in binary classification problems, but it’s apparently not applied to validation score calculations (see discussion), so we create sample weight vectors and pass them to .fit() as we did before.

Show validation function definition

# Define model validation function for XGBoost
def validate_xgb(
  params_dict, # Dictionary of hyperparameter values
  trial, verbose = 0):
  
  # Record best epoch scores for each CV fold
  cv_scores = []
  
  # Record best epochs for each CV fold
  best_epochs = []
  
  for i, (train_index, val_index) in enumerate(cv_indices):
  
    # Split training-validation data
    x_tr = x_train.iloc[train_index, ]
    y_tr = y_train.iloc[train_index, ]
    x_val = x_train.iloc[val_index, ]
    y_val = y_train.iloc[val_index, ]
    
    # Compute class weight
    classes = list(set(y_tr))
    class_weight = compute_class_weight("balanced", classes = classes, y = y_tr)
    sample_weight = np.where(y_tr == 1, class_weight[1], class_weight[0])
    sample_weight_val = np.where(y_val == 1, class_weight[1], class_weight[0])
    
    # Perform preprocessing
    x_tr = pipe_process.fit_transform(x_tr, y_tr)
    x_val = pipe_process.transform(x_val)
    
    # Create pruning callback for first CV split
    if i == 0: 
      callback_pruner = [optuna.integration.XGBoostPruningCallback(
        trial, "validation_0-logloss")]
    
    else:
      callback_pruner = None
    
    # Define XGBoost classifier
    model_xgb = XGBClassifier(
        objective = "binary:logistic",
        n_estimators = 5000, # Arbitrarily high number of training rounds, as early stopping is active
        early_stopping_rounds = 50, # Stop training after 50 rounds with no improvement
        eval_metric = "logloss", # Validation metric
        tree_method = "gpu_hist", # Train with GPU
        gpu_id = 0,
        verbosity = 0,
        random_state = 1923,
        callbacks = callback_pruner,
        learning_rate = params_dict["learning_rate"],
        max_depth = params_dict["max_depth"],
        min_child_weight = params_dict["min_child_weight"],
        gamma = params_dict["gamma"],
        reg_alpha = params_dict["reg_alpha"],
        reg_lambda = params_dict["reg_lambda"],
        subsample = params_dict["subsample"],
        colsample_bytree = params_dict["colsample_bytree"]
        )
        
    # Train & validate model
    model_xgb.fit(
      X = x_tr, 
      y = y_tr, 
      sample_weight = sample_weight,
      eval_set = [(x_val, y_val)], # Validation set 
      sample_weight_eval_set = [sample_weight_val],
      verbose = verbose)
    
    # Append best epoch score to list of CV scores
    cv_scores.append(model_xgb.best_score)
    
    # Append best epoch number to list of best epochs
    best_epochs.append(model_xgb.best_iteration + 1)
  
   # Return the average CV score & median best n. rounds
  return np.mean(cv_scores), np.median(best_epochs)

Let’s discuss the numerous parameters of XGBoost we will tune, and the search spaces.

learning_rate is is the magnitude of the model update after each boosting round, as in SGD training. The same considerations as explained before apply. A general heuristic is to tune the XGBoost LR between 0.05 and 0.3 for most problems. It’s also possible to use LR scheduling, but we will save this for the NN model.
max_depth is the maximum depth of each tree, an integer value. A higher value will make the model more complex, and more likely to overfit. The default is 6, but we will tune it over a broad range.
min_child_weight is a parameter that affects tree complexity. A higher value will make it harder for the model to justify new splits. This is usually tuned as an integer parameter.
gamma is a special regularization parameter for tree models. It is the minimum loss reduction required to justify another split. The optimal value is often closer to zero.
alpha and lambda are L1 and L2 regularization respectively, applied to model weights. L1 is 0 by default, and L2 is 1 by default. We’ll tune both over a wide interval around their defaults, to try different blends of L1 and L2 regularization.
subsample and colsample_bytree are the subsampling parameters, both set to 1 by default (meaning no subsampling).
- Setting a lower value for subsample will subsample the training observations by that ratio in every boosting round.
- A lower value for colsample_bytree will subsample the features by that ratio in every boosting round.
- Subsampling can help make the models more robust against overfitting. It’s possible to subsample features even more frequently than once per round using other hyperparameters (see documentation).

# Define objective function for hyperparameter tuning
def objective_xgb(trial):
  
  # Suggest parameter values from parameter ranges to tune over
  learning_rate = trial.suggest_float("learning_rate", 0.05, 0.3)
  max_depth = trial.suggest_int("max_depth", 2, 20)
  min_child_weight = trial.suggest_int("min_child_weight", 1, 20, log = True)
  gamma = trial.suggest_float("gamma", 5e-5, 0.5, log = True)
  reg_alpha = trial.suggest_float("l1_reg", 5e-5, 1, log = True)
  reg_lambda = trial.suggest_float("l2_reg", 0, 2)
  subsample = trial.suggest_float("subsample", 0.5, 1)
  colsample_bytree = trial.suggest_float("colsample_bytree", 0.25, 1)
  
  # Make dictionary of parameters
  params_dict = {
    "learning_rate": learning_rate,
    "max_depth": max_depth,
    "min_child_weight": min_child_weight,
    "gamma": gamma,
    "reg_alpha": reg_alpha,
    "reg_lambda": reg_lambda,
    "subsample": subsample,
    "colsample_bytree": colsample_bytree
  }
  
  # Validate parameter set
  score, rounds = validate_xgb(
    params_dict = params_dict, trial = trial)
    
  # Report best n. rounds to Optuna
  trial.set_user_attr("n_rounds", rounds)
  
  return score

# Import best trial
best_trial_xgb = pd.read_csv("./ModifiedData/trials_xgb.csv")
best_trial_xgb  = best_trial_xgb.loc[
  best_trial_xgb["state"] == "COMPLETE"].iloc[0,]
best_trial_xgb

number                                                  482
value                                                0.5666
datetime_start                   2023-07-10 14:53:42.307668
datetime_complete                2023-07-10 14:53:46.155453
duration                             0 days 00:00:03.847785
params_colsample_bytree                              0.4027
params_gamma                                         0.0029
params_l1_reg                                        0.0134
params_l2_reg                                        1.6294
params_learning_rate                                 0.2291
params_max_depth                                          5
params_min_child_weight                                   6
params_subsample                                     0.9745
user_attrs_n_rounds                                 22.0000
system_attrs_completed_rung_0                        0.6182
system_attrs_completed_rung_1                        0.5709
system_attrs_completed_rung_2                        0.5678
system_attrs_completed_rung_3                           NaN
state                                              COMPLETE
Name: 0, dtype: object

The optimal set of hyperparameters yields trees on the shallow side, with some regularization. Most interesting is the colsample_bytree parameter: A value of roughly 0.4 means each tree is built with a subsample of 40% of the features in our data. More evidence that we may have many redundant (or simply useless) features in our dataset.

# Create pipeline with final XGB model
pipe_xgb = Pipeline(steps = [
  ("preprocessing", pipe_process), # Preprocessing steps
  ("XGBoost", XGBClassifier( # Model step
      objective = "binary:logistic",
      n_estimators = 22, # Best number of rounds from validation
      eval_metric = "logloss",
      tree_method = "gpu_hist",
      gpu_id = 0,
      verbosity = 1,
      random_state = 1923,
      learning_rate = best_trial_xgb["params_learning_rate"],
      max_depth = best_trial_xgb["params_max_depth"],
      min_child_weight = best_trial_xgb["params_min_child_weight"],
      gamma = best_trial_xgb["params_gamma"],
      reg_alpha = best_trial_xgb["params_l1_reg"],
      reg_lambda = best_trial_xgb["params_l2_reg"],
      subsample = best_trial_xgb["params_subsample"],
      colsample_bytree = best_trial_xgb["params_colsample_bytree"]
      )
    )
  ]
)

Neural network with PyTorch

Our final model is a neural network. There is much to explain about how neural networks work, and there are many complex architectures, but I’ll summarize the basic idea by comparing with a linear regression:

In a linear regression equation, the inputs are the features. These inputs are multiplied with their respective weights (coefficients) and summed up, along with a bias (intercept) value. This gives us the output / prediction.
One such linear regression equation is equivalent to one neuron / linear unit in a neural network. A layer is an organization of numerous neurons / units, or another transformation that is applied to the inputs.
The inputs of a neural network are equivalent to the original features, as they are fed into the network. The output layer is the last layer that gives the final output / prediction of the network.
A hidden layer is any layer between the network inputs and the output layer. The inputs of neurons in a hidden layer consist of the outputs of the previous layer (for the first hidden layer, simply the input features). Hidden layer neurons each apply their own weights and bias terms to the inputs, and pass their outputs to the next layer. In general, deep neural networks are those with multiple or many hidden layers.
- We can think of a multiple linear regression model (with numerous predictors) as a single neuron / linear unit with multiple inputs.
- When we fit a linear regression model, we estimate a single coefficient for each predictor, and one intercept term. In a neural network, we estimate a weight for each input of each neuron, and a bias term for each neuron.

Activation functions can be applied after a layer, to apply a mathematical transformation to its outputs before passing them on to the next layer as inputs.

With activation functions and hidden layers, the network can learn non-linear and complex relationships.
With a binary classification problem, a sigmoid activation function is applied to the network’s final, single output, which will transform it from a predicted logit to a probability value between 0 and 1.

Of course, this is the simplest summary of a basic neural network. Here is an excellent source for more information on many aspects of deep learning.

Defining PyTorch & Lightning classes

To build and train our NN model, we will use the PyTorch, and the associated Lightning framework which makes it much more easier to organize PyTorch code. We’ll define several custom PyTorch and Lightning classes according to our needs, based on existing class templates.

First, we need to define a Dataset class. This class will store our data as Torch tensors. Its methods will return one row at a time, as a pair of feature tensors & target tensor, and the data length. We create a separate class for training & testing data, as the latter won’t have associated target values.

Show Dataset class creation

# Define TrainDataset class: Takes in preprocessed features & targets
class TrainDataset(torch.utils.data.Dataset):
  
  # Initialize class with training features & targets
  def __init__(self, x_train, y_train):
    
    # Store features as Torch tensor
    self.x = torch.tensor(x_train, dtype = torch.float32) 
    
    # Store targets as Torch tensor
    self.y = torch.tensor(y_train.values, dtype = torch.float32).unsqueeze(1) 
    
  # Return data length  
  def __len__(self):
    return len(self.x) 
  
  # Return a pair of features & target (one training observation)
  def __getitem__(self, idx):
    return self.x[idx], self.y[idx]


# Define TestDataset class: Takes in preprocessed features only
class TestDataset(torch.utils.data.Dataset):
  
  # Initialize class with training features
  def __init__(self, x_test): 
    self.x = torch.tensor(x_test, dtype = torch.float32) # Store features
  
  # Return data length  
  def __len__(self):
    return len(self.x) 
  
  # Return one set of features
  def __getitem__(self, idx):
    return self.x[idx]

Next, we define our Lightning Module. This is the core class that will store our model architecture & hyperparameters, define our training, validation & prediction loops, and more. The explanation of each class method is better left to the code comments below. We’ll discuss the model architecture and hyperparameters afterwards.

Show Lightning Module class creation

# Define Lightning module
class SeluDropoutModel(pl.LightningModule):
  
  # Initialize model with hyperparameters dictionary
  def __init__(self, hyperparams_dict):
    
    # Delegate function to parent class
    super().__init__()
    
    # Save external hyperparameters so they are available when loading saved models
    self.save_hyperparameters(logger = False) 
    
    # Initialize validation metric: Average precision
    self.val_avg_precision = torchmetrics.classification.AveragePrecision(
      task = "binary")
  
    # Define hyperparameters
    self.n_hidden_layers = hyperparams_dict["n_hidden_layers"]
    self.input_size = hyperparams_dict["input_size"]
    self.hidden_size = hyperparams_dict["hidden_size"]
    self.learning_rate = hyperparams_dict["learning_rate"]
    self.l2 = hyperparams_dict["l2"]
    self.dropout = hyperparams_dict["dropout"]
    self.loss_alpha = hyperparams_dict["loss_alpha"]
    self.loss_gamma = hyperparams_dict["loss_gamma"]
    
    # Define network architecture 
    # Initialize layers list with first hidden layer
    self.layers_list = torch.nn.ModuleList([
      torch.nn.Linear(self.input_size, self.hidden_size), # Hidden layer 1
      torch.nn.SELU(), # Activation 1
      torch.nn.AlphaDropout(self.dropout) # Dropout 1
      ])
    
    # Add extra hidden layers to layers list, according to hyperparameter
    for n in range(0, (self.n_hidden_layers - 1)):
      self.layers_list.extend([
        torch.nn.Linear(self.hidden_size, self.hidden_size), # Hidden layer N
        torch.nn.SELU(), # Activation N
        torch.nn.AlphaDropout(self.dropout) # Dropout N
      ])
    
    # Add final output layer to layers list
    self.layers_list.append(
      torch.nn.Linear(self.hidden_size, 1) # Output layer
      # No sigmoid activation here, because the loss function has that built-in
      )
      
    # Create full network from layers list
    self.network = torch.nn.Sequential(*self.layers_list)
    
    # Sigmoid activation for prediction step only, not part of forward 
    # propagation as the loss function has sigmoid activation built in
    self.sigmoid = torch.nn.Sequential(torch.nn.Sigmoid())
      
    # Initialize weights to conform with self-normalizing SELU activation
    for layer in self.network:
      if isinstance(layer, torch.nn.Linear):
        torch.nn.init.kaiming_normal_(layer.weight, nonlinearity = "linear")
        torch.nn.init.zeros_(layer.bias)
    
  # Define forward propagation
  def forward(self, x):
    output = self.network(x.view(x.size(0), -1))
    return output # Returns logits, not probabilities
  
  # Define training loop
  def training_step(self, batch, batch_idx):
    
    # Perform training
    x, y = batch
    output = self.forward(x)
    
    # Calculate loss of training predictions
    # Loss function applies sigmoid activation before calculating focal loss
    loss = torchvision.ops.sigmoid_focal_loss(
      output, y, 
      alpha = self.loss_alpha, # Positive class weight parameter
      gamma = self.loss_gamma, # Parameter that discounts "easy" training examples 
                               # in loss calculation
      reduction = "mean") 
    
    # Log training loss
    self.log(
      "train_loss", loss, 
      on_step = False, on_epoch = True, prog_bar = True, logger = True)
    
    # Return training loss  
    return loss
  
  # Define validation loop
  def validation_step(self, batch, batch_idx):
    
    # Make predictions, apply sigmoid activation to get probabilities
    x, y = batch
    output = self.forward(x)
    pred = self.sigmoid(output)
    
    # Update & log avg. precision score, ensure y is in int32 format for metric
    self.val_avg_precision(pred, y.type(torch.int32))
    self.log(
      "val_avg_precision", self.val_avg_precision, 
      on_step = True, on_epoch = True, prog_bar = True, logger = True)
    
    # Return validation score  
    return self.val_avg_precision
  
  # Define prediction method (because the default just runs forward(), which
  # doesn't have sigmoid activation and doesn't return probabilities)
  def predict_step(self, batch, batch_idx):
    
    # Run the forward propagation
    x = batch 
    output = self.forward(x)
    
    # Apply sigmoid activation to return probability predictions
    pred = self.sigmoid(output)
    return pred
    
  # Define optimization algorithm, LR scheduler
  def configure_optimizers(self):
    
    # Adam optimizer with L2 regularization
    optimizer = torch.optim.Adam(
      self.parameters(), lr = self.learning_rate, weight_decay = self.l2)
    
    # Cyclic LR scheduler
    lr_scheduler = torch.optim.lr_scheduler.CyclicLR(
      optimizer, 
      base_lr = self.learning_rate, # Lowest LR at start of each cycle
      max_lr = (self.learning_rate * 5), # Highest LR at the end of first cycle
      mode = "exp_range", # Exponentially decreasing max. LR
      gamma = 0.99995, # Reduction factor for max. LR
      step_size_up = 200, # Training steps to go from base LR to max. LR. once 
                          # Heuristic: (2-8 * (steps (batches) in one epoch))
      cycle_momentum = False # Not compatible with Adam optimizer 
      )
    
    return {
    "optimizer": optimizer,
    "lr_scheduler": {
      "scheduler": lr_scheduler,
      "interval": "step",
      "frequency": 1
      }
    }

The model architecture above consists of at least one hidden layer, and it is possible to add more hidden layers depending on the n_hidden_layers parameter, which we’ll tune.

In theory, one hidden layer enables us to solve non-linear separation problems, and two hidden layers enable us to solve problems with multiple disconnected decision boundaries. More complex problems may require more layers, but in general, we want to use the simplest architecture that solves our problem (here’s a good article that talks about this in more detail).
The first hidden layer will take in our features as inputs, so the input size hyperparameter has to equal the number of features. We’ll give this as a hyperparameter to the model, but we won’t tune it.
The output size of the first hidden layer will equal the input & output sizes of any subsequent hidden layer(s). We will tune this as the hidden_size hyperparameter.
- It’s also possible to set a different hyperparameter for the output size of each subsequent hidden layer and tune them separately, so the layers constantly get smaller as we move forward in the network, performing a reduction in the feature space (this is a key part of autoencoder architectures).
- The output size of a hidden layer also equals the number of neurons / units in that layer, as 1 neuron = 1 output. Generally, we want hidden layer sizes to be between the network input size (number of features) and the final output size (one in this case).
The output layer will only have one output, the predicted logit for the positive class. We won’t apply a sigmoid activation at the end of the network, as our loss function applies sigmoid activation before calculating the loss.
Each hidden layer will be followed by a SELU activation function layer. This stands for “scaled exponential linear unit”. The SELU activation will enable us to model non-linear relationships, while also enforcing a self-normalizing property on the layer outputs (see original paper).
- Neural networks can be prone to numerical instability issues, so it’s a common approach to apply forms of scaling / normalization inside the network, to the outputs of layers, before passing them on to the next layer.
- The most common method for in-network normalization is adding batch normalization layers, but apparently it can interact with the application of dropout layers (see a detailed discussion related to this).
- The scaling / normalization we apply to our features in preprocessing can also be applied as the first layer of a network.
The SELU activation after each hidden layer will be followed by a dropout layer.
- This is a form of regularization which randomly drops out a fraction of the next layer’s inputs at each training step. Lightning automatically disables dropout when predicting with a trained NN model.
- When neural networks overfit, they tend to learn noise as a very specific combination of weights across layers. Dropout breaks these combinations, forcing the network to generalize better.
- We will use alpha dropout, which is applied after SELU activation, and maintains the self-normalizing property of SELU outputs.
- The probability of dropping each input is another hyperparameter we will tune, as dropout. Obviously, a higher probability will result in more severe regularization and possible underfitting.

Our training loss function is going to be focal loss, which was developed for highly imbalanced classification problems in image recognition (see original paper). This loss function discounts the errors made for “easily classified” training examples, focusing the model’s attention on difficult training examples.

This loss function has two parameters, alpha and gamma, which we will tune as hyperparameters.
A higher alpha value assigns more weight to positive class observations when calculating the loss, similar to our class weighting for previous models’ loss functions.
The gamma value controls how much “easy” training examples are discounted when calculating the loss. If gamma = 0, no discounting is applied, and the loss function becomes equivalent to cross entropy, which is a commonly used default loss function for NN classification.
- Cross entropy is practically equivalent to log loss, which we used previously.

Previously, we used the same loss function both to train our models, and to score their performance on validation data.

In general, we want to train & tune models with a loss function that scores the probability predictions directly, instead of the class predictions (more on this in model testing).
However, when we tune the parameters of focal loss, our loss function becomes part of the optimization problem (just like the class weights we applied to the loss function for the previous models). If we use focal loss also as our validation metric, this will be a form of leakage: We’ll simply search for the focal loss parameters that minimize focal loss by changing the metric itself, not necessarily those that maximize model performance (moving the goalposts in a sense).
Therefore, to validate our NN hyperparameters’ performance, we will use the average precision performance metric (which will be explained in model testing). Keep in mind this value is to be maximized, and not minimized like a loss function.

Finally, we choose an optimizer and an optional learning rate scheduler.

The optimizer manages how the model weights are updated after each training step.

Recall that NN models are usually trained with a form of minibatch gradient descent: In one training step, a batch of training observations are fed to the model, loss is computed for the predictions and the model weights are updated, taking the learning rate into account.
Minibatch SGD is available as an optimizer, but the commonly used default is the Adam algorithm. It is a robust general purpose optimizer that accelerates convergence and adjusts the learning rate dynamically. This is what we will use.
Adam can also apply L2 regularization to the model weights. We’ll tune this as our l2 hyperparameter.

The learning rate scheduler will dynamically adjust the learning rate, to ensure we get the best tradeoff between fast & optimal convergence, and the risk of overfitting & erratic training steps.

A simple approach is starting training with a relatively higher LR, and gradually lowering it after a certain number of training steps / epochs are elapsed with no improvement.
We’ll try a more sophisticated approach: A cyclic LR scheduler will cycle the LR from a base_lr value to a max_lr value, then back down to the base_lr, in a certain number of training steps. After each cycle, the max_lr will be reduced in an exponential fashion. We’ll tune only the base_lr as our hyperparameter learning_rate, but it’s possible to tune other parameters of the cyclic scheduler.
The rationale behind this approach is that a higher LR rate may appear to harm the training process initially, but yield better results later on. By cycling the LR in this fashion, we still decrease the “average” LR in the long term, while periodically increasing the LR in the short term, to avoid getting stuck in locally optimal solutions (more on this in the original paper, and a good summary article).

The last Lightning class we need to create is simply an exact copy of the existing Optuna pruner callback, due to compatibility issues (similar issues were discussed here and here).

Show Optuna pruner creation

# Optuna code uses old pytorch_lightning namespace which causes an error with the 
# pruning callback integration. Create copy of the Optuna pruning callback with 
# the new lightning.pytorch namespace as a workaround.
class OptunaPruning(PyTorchLightningPruningCallback, pl.Callback):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)

Hyperparameter tuning

Now, we define our validation function for the neural network model. There are a few more PyTorch and Lightning functions & classes we will use to load our data & validate our models.

We create two DataLoader instances in the loop, one each for our training & validation data. The dataloader creates minibatches with the data according to the batch_size parameter, and serves these batches to the model for training, validation & prediction.
- Batch size mainly affects the training speed, though it could affect the model performance as well (for example, smaller batch sizes could result in noisier model updates). Generally, it’s suggested to use the largest batch size possible with the available memory.
- The training data is shuffled for batching, but not the validation or prediction data.
- The validation & prediction steps could also be carried out without batching if the available memory is enough.
- The num_workers argument controls the number of parallel processes used by the dataloader. I’ve often had issues with this parameter, so I did not use parallelization.
- We create an early stopping callback and an Optuna pruning callback.
Finally, we create the Trainer. The trainer takes the model, dataloaders, callbacks and additional settings, and performs training, testing and prediction.
- We set parameters such as a high number of maximum training epochs (as early stopping is active), GPU training options, and logging / checkpointing / model summary options.
- With GPU training and no parallelization on the dataloaders, the Optuna study of 500 trials took roughly 4 hours on my machine.

Show validation function definition

# Define validation function for NN
def validate_nn(hyperparams_dict, trial, tol = 1e-4):
  
  # Store the CV scores for the parameter set
  cv_scores = []
  
  # Store the best checkpoint strings from Lightning Trainer
  best_epochs = []
  
  for i, (train_index, val_index) in enumerate(cv_indices):
    
    # Split training-validation data
    x_tr = x_train.iloc[train_index, ]
    y_tr = y_train.iloc[train_index, ]
    x_val = x_train.iloc[val_index, ]
    y_val = y_train.iloc[val_index, ]
    
    # Perform preprocessing
    x_tr = pipe_process.fit_transform(x_tr, y_tr)
    x_val = pipe_process.transform(x_val)

    # Load data into TrainDataset
    train_data = TrainDataset(x_tr, y_tr)
    val_data = TrainDataset(x_val, y_val)

    # Create data loaders
    train_loader = torch.utils.data.DataLoader(
      train_data, 
      batch_size = 1024, # Large batch size for fast training
      num_workers = 0, # DataLoader parallelization disabled, as it causes issues
      shuffle = True # Shuffle training data before creating training batches
      )
      
    val_loader = torch.utils.data.DataLoader(
      val_data, batch_size = len(x_val), num_workers = 0, 
      shuffle = False # Don't shuffle val. data before creating val. batches
      )
      
    # Create callbacks list
    callbacks = []
    
    # Create early stopping callback
    callback_earlystop = pl.callbacks.EarlyStopping(
      monitor = "val_avg_precision", 
      mode = "max", # Goal is to increase the val. metric, not decrease it
      min_delta = tol,
      patience = 10 # Early stop after 10 epochs without at least tol improvement
      )
    callbacks.append(callback_earlystop)
    
    # Create Optuna pruner callback only for first CV fold
    if i == 0:
      callback_pruner = OptunaPruning(trial, monitor = "val_avg_precision")
      callbacks.append(callback_pruner)
    
    # Create Lightning Trainer
    trainer = pl.Trainer(
      max_epochs = 500, # Arbitrary high number of training epochs, as early stopping is active
      log_every_n_steps = 5, # The default is 50, but there are less training steps (batches) than 50
      accelerator = "gpu", devices = "auto", precision = "16-mixed", # GPU training
      callbacks = callbacks,
      logger = True,
      enable_model_summary = False, # Disable printing model summary
      enable_progress_bar = False, # Disable prog. bar for Optuna trials
      enable_checkpointing = False # Disable checkpoints for Optuna trials
    )
  
    # Create & train model
    model = SeluDropoutModel(hyperparams_dict = hyperparams_dict)
    trainer.fit(model, train_loader, val_loader)
    
    # Retrieve best val score and n. of epochs
    score = callbacks[0].best_score.cpu().numpy()
    epoch = trainer.current_epoch - callbacks[0].wait_count # Starts from 1
    cv_scores.append(score)
    best_epochs.append(epoch)
    
  # Return the mean CV score, median best epoch
  return np.mean(cv_scores), np.median(best_epochs)

Our tuning objective function is similar to the one for XGBoost.

We’ll try networks with 1 to 3 hidden layers. I’d expect 1 or 2 hidden layers to be enough for this problem, but in previous iterations, 3-4 hidden layers performed close to optimal as well.
We won’t sample a hidden size value directly, but we’ll sample an exponent for the base of 2, yielding a hidden size in powers of 2, from 2 to 64. All hidden layers will have the same size.
The default learning rate for the Adam algorithm is 1e-3. We’ll tune from a broad range around that, as learning rate scheduling is in place.
The focal loss parameter alpha takes a value from 0 to 1, which will be our search space. A gamma value of 2 is suggested as a good setting by the authors of the original paper, so we’ll tune from a range around that.

# Define Optuna objective
def objective_nn(trial):
  
  # Define parameter ranges to tune over & suggest param set for trial
  n_hidden_layers = trial.suggest_int("n_hidden_layers", 1, 3)
  hidden_size = 2 ** trial.suggest_int("hidden_size", 1, 6) # Tune hidden size as
                                                            # exponent of 2
  learning_rate = trial.suggest_float("learning_rate", 5e-4, 5e-2)
  l2 = trial.suggest_float("l2", 5e-5, 1e-2, log = True)
  dropout = trial.suggest_float("dropout", 1e-3, 0.25, log = True)
  loss_alpha = trial.suggest_float("loss_alpha", 0, 1)
  loss_gamma = trial.suggest_float("loss_gamma", 0, 4)
  
  # Create hyperparameters dict
  hyperparams_dict = {
      "input_size": 90, # N. of input features
      "n_hidden_layers": n_hidden_layers,
      "hidden_size": hidden_size,
      "learning_rate": learning_rate,
      "l2": l2,
      "dropout": dropout,
      "loss_alpha": loss_alpha,
      "loss_gamma": loss_gamma
    }
    
  # Validate hyperparameter set
  score, epoch = validate_nn(hyperparams_dict = hyperparams_dict, trial = trial)
  
  # Report best n. of epochs
  trial.set_user_attr("n_epochs", epoch)
  
  return score

# Create study
study_nn = optuna.create_study(
  sampler = optuna.samplers.TPESampler(seed = 1923),
  pruner = optuna.pruners.HyperbandPruner(),
  study_name = "tune_nn",
  direction = "maximize" # Objective is to maximize reported metric, not minimize
)

# Import best trial
best_trial_nn = pd.read_csv("./ModifiedData/trials_nn.csv")
best_trial_nn = best_trial_nn.loc[
  best_trial_nn["state"] == "COMPLETE"].iloc[0,]
best_trial_nn

number                                                  371
value                                                0.4530
datetime_start                   2023-07-10 11:07:10.155796
datetime_complete                2023-07-10 11:07:51.176352
duration                             0 days 00:00:41.020556
params_dropout                                       0.0240
params_hidden_size                                        3
params_l2                                            0.0003
params_learning_rate                                 0.0167
params_loss_alpha                                    0.4922
params_loss_gamma                                    0.1408
params_n_hidden_layers                                    3
user_attrs_n_epochs                                 17.0000
system_attrs_completed_rung_0                        0.4447
system_attrs_completed_rung_1                        0.4515
system_attrs_completed_rung_2                        0.4557
system_attrs_completed_rung_3                           NaN
state                                              COMPLETE
Name: 7, dtype: object

The best tune has 3 hidden layers, each with a size of 8 ($2^3$).

This could mean our feature space is fairly non-linear, but also on the smaller side, as the hidden layer size is much smaller compared to the number of input features.
- This effectively results in a reduction of the feature space, which was also done by XGBoost in its own way.
Small amounts of L2 and dropout regularization are applied, but the lower bounds of the search spaces are not reached, so we should be close to the optimal values.
The base learning rate is considerably higher compared to the default of 0.001.
An alpha of roughly 0.49 means our focal loss function puts considerably more weight on the positive cases, and a low gamma of roughly 0.14 means “easy” training examples are slightly discounted from the loss calculation.
- A heuristic suggested by the authors of the original paper was alpha = 0.25 and gamma = 2. Our tune is considerably far from that, but previous iterations with a similar value for alpha and gamma also performed very similarly.

The median best number of epochs for this parameter set was 17, which will be set later, in the Trainer for our final model. For now, we only create our final model instance with the best hyperparameters.

# Define NN model with best parameters (n. epochs is determined later in Trainer)
hyperparams_dict = {
      "input_size": 90,
      "n_hidden_layers": best_trial_nn["params_n_hidden_layers"],
      "hidden_size": 2 ** best_trial_nn["params_hidden_size"],
      "learning_rate": best_trial_nn["params_learning_rate"],
      "l2": best_trial_nn["params_l2"],
      "dropout": best_trial_nn["params_dropout"],
      "loss_alpha": best_trial_nn["params_loss_alpha"],
      "loss_gamma": best_trial_nn["params_loss_gamma"]
    }
model_nn = SeluDropoutModel(hyperparams_dict)

Model testing

Now, we’ll train the final versions of our models on the full training data, and test their performance on the testing data.

Training & prediction with final models

The steps for retrieving the final predictions are mostly explained in the code comments below.

# Compute class weight & sample weight vectors
classes = list(set(y_train))
class_weight = compute_class_weight("balanced", classes = classes, y = y_train)
sample_weight_train = np.where(y_train == 1, class_weight[1], class_weight[0])
sample_weight_test = np.where(y_test == 1, class_weight[1], class_weight[0])

We’ll use a dummy classifier as our baseline to compare with. It will always give the target class ratios in the training data as the probability predictions.

# Create dummy classifier which always predicts the prior class probabilities
model_dummy = DummyClassifier(strategy = "prior")

# Make dict of models
models_dict = {
  "Dummy": model_dummy,
  "Logistic": pipe_logistic,
  "SVM": pipe_svm,
  "XGBoost": pipe_xgb,
  "Neural net": model_nn
}

We build, train & retrieve positive class prediction probabilities for every model in a loop.

Show training & prediction loop

# Train & predict with each model
preds_dict = {}

for key in models_dict.keys():
  
  # Retrieve model
  model = models_dict[key]
  
  # Fit dummy classifier
  if key == "Dummy":
    model.fit(x_train, y_train)
  
  # Fit NN classifier
  elif key == "Neural net":
    
    # Apply scikit preprocessing pipeline
    x_tr = pipe_process.fit_transform(x_train, y_train)
    x_test1 = pipe_process.transform(x_test)
    
    # Create train & test Datasets, dataloaders
    train_data = TrainDataset(x_tr, y_train)
    test_data = TestDataset(x_test1)
    
    train_loader = torch.utils.data.DataLoader(
      train_data, batch_size = 1024, num_workers = 0, shuffle = True)
    test_loader = torch.utils.data.DataLoader(
      test_data, batch_size = len(test_data), num_workers = 0, shuffle = False)
      
    # Create Lightning Trainer
    trainer = pl.Trainer(
      max_epochs = 17, # Best n. of epochs from validation
      log_every_n_steps = 5, # The default is 50, but there are less training steps (batches) than 50
      accelerator = "gpu", devices = "auto", precision = "16-mixed", # GPU training
      logger = True,
      enable_progress_bar = True, # Display training progress bar
      enable_checkpointing = True # Save model checkpoint
    )
    
    # Train model
    trainer.fit(model, train_loader)
  
  # Fit scikit-learn classifiers
  else:
    
    # Create unique sample weights argument to be passed into pipeline.fit
    kwargs = {model.steps[-1][0] + "__sample_weight": sample_weight_train}
    
    # Fit pipeline & model
    model.fit(x_train, y_train, **kwargs)
  
  # Predict positive class probabilities with NN
  if key == "Neural net":
    y_prob = trainer.predict(model_nn, test_loader)
    
    # Convert list of float16 Torch tensors to single float32 np.array
    preds_dict[key] = np.float32(y_prob[0].numpy().reshape(1, -1)[0])
  
  # Predict positive class probabilities with scikit-learn classifiers
  else:  
    y_prob = model.predict_proba(x_test)
    y_prob = np.array([x[1] for x in y_prob]) # Retrieve only the positive class probabilities
    preds_dict[key] = y_prob

Calculating performance metrics

The first set of metrics we’ll calculate are all dependent on the threshold probability: Precision, recall and F1 score.

The threshold probability is the minimum positive class probability required to classify an observation as positive. Numerous classification metrics take changing values depending on the threshold used to calculate them.
For binary classification, a default threshold of 0.5 is often assumed, but it’s important to use the ideal threshold depending on the problem, especially with a major class imbalance.
Precision refers to the fraction of positive class predictions that were correct (true positives / true positives + false positives).
Recall refers to the fraction of positive cases correctly found, out of all positive cases in the data (true positives / (true positives + false negatives).
There is a natural tradeoff between precision and recall:
- A higher threshold probability naturally means fewer cases are classified as positive. This generally results in higher precision and lower recall.
- A lower threshold probability naturally means more cases are classified as positive. This generally results in higher recall and lower precision.
The F1 score is the harmonic mean of precision and recall at a given threshold probability. It aims to summarize both metrics at a given threshold. By default, precision and recall are weighted equally, but it’s possible to calculate differently weighted versions.
We’ll calculate precision, recall and F1 score at various threshold probabilities, for each model. We’ll retrieve the “optimal” threshold probability that maximizes F1 score, and the precision & recall values at this point.
- We will plot all three scores against the threshold probabilities to visualize the relationship between them.
We won’t consider plain accuracy, as it can be very misleading with a large class imbalance. To give an extreme example, simply predicting the negative class all the time will give us an accuracy of roughly 88%, as that is the ratio of negative cases in our data.

Show dynamic metric calculations

# Retrieve F1 scores at different threshold probabilities for each model
scores_f1 = {}
scores_f1_best = {}

scores_precision = {}
scores_precision_best = {}

scores_recall = {}
scores_recall_best = {}

threshold_probs = {}
threshold_probs_best = {}

for key in preds_dict.keys():
  
  # Retrieve the precision & recall pairs at various thresholds, calculate F1
  # scores from them
  precision, recall, thresholds = precision_recall_curve(y_test, preds_dict[key])
  f1_scores = 2 * recall * precision / (recall + precision)
  
  # Add results to respective dictionaries  
  scores_f1[key] = f1_scores
  scores_f1_best[key] = max(f1_scores)
  
  scores_precision[key] = precision
  scores_precision_best[key] = precision[np.argmax(f1_scores)]
  
  scores_recall[key] = recall
  scores_recall_best[key] = recall[np.argmax(f1_scores)]
  
  threshold_probs[key] = thresholds
  threshold_probs_best[key] = thresholds[np.argmax(f1_scores)]

Next, we will calculate a few metrics not dependent on the threshold probability, as a “summary” of overall model performance.

Average precision is also known as the PRAUC: The area under the precision-recall curve. It takes a value between 0 and 1, higher being better. We’ll explain and discuss this metric later with the precision-recall curve plot.
Brier score is a probability scoring metric that compares predicted probabilities with actual class labels. Similar to log loss, a lower value is better. Unlike log loss, it takes a value between 0 and 1.
- We’ll use the class weighted version as we did with our training loss metrics. The scores will be misleading without class weights due to the imbalance, as they will be dominated by the scores for the majority class.
- It’s always a good idea to assess the probability predictions themselves, rather than just class predictions. They may be more important and useful than class predictions in many cases (weather forecasts, risk assessments, medical tests).
  - Intuitively, it also makes sense to aim for the best probability predictions regardless of the threshold probability, as they are the raw output of our models. This is why we trained & validated our models with loss metrics.
The Brier skill score compares the Brier score of a classifier with another baseline Brier score. This time, a higher score means a bigger improvement over the baseline. We’ll calculate this with our dummy classifier as the baseline.

Show static metric calculations

# Retrieve AP & Brier scores (weighted) for each model
scores_avg_precision = {}
scores_brier = {}

for key in preds_dict.keys():

  # Retrieve average precision scores
  avg_precision = average_precision_score(y_test, preds_dict[key])
  scores_avg_precision[key] = avg_precision
  
  # Retrieve Brier scores
  brier_score = brier_score_loss(
    y_test, preds_dict[key], sample_weight = sample_weight_test)
  scores_brier[key] = brier_score
  
 
# Retrieve Brier skill scores for each model, with dummy classifier as reference
scores_brier_skill = {}

for key in preds_dict.keys():
  
  brier_skill = 1 - (scores_brier[key] / scores_brier["Dummy"])
  scores_brier_skill[key] = brier_skill

Summary table of performance metrics

Now, let’s start reviewing our metrics. First, we start with a summary table.

Show code to get summary table

# Retrieve dataframe of scores
df_scores = pd.DataFrame(
  {
  "Avg. precision (PRAUC)": scores_avg_precision.values(),
  "Brier score (class weighted)": scores_brier.values(),
  "Brier skill score (class weighted)": scores_brier_skill.values(),
  "Best F1 score": scores_f1_best.values(),
  "Precision at best F1": scores_precision_best.values(),
  "Recall at best F1": scores_recall_best.values(),
  "Threshold prob. at best F1": threshold_probs_best.values()
  }, index = preds_dict.keys()
)
print(df_scores)

            Avg. precision (PRAUC)  Brier score (class weighted)   
Dummy                       0.1226                        0.3925  \
Logistic                    0.4166                        0.1961   
SVM                         0.3445                        0.2098   
XGBoost                     0.4704                        0.1914   
Neural net                  0.4376                        0.2807   

            Brier skill score (class weighted)  Best F1 score   
Dummy                                   0.0000         0.2184  \
Logistic                                0.5004         0.4060   
SVM                                     0.4654         0.3694   
XGBoost                                 0.5123         0.4318   
Neural net                              0.2848         0.4094   

            Precision at best F1  Recall at best F1   
Dummy                     0.1226             1.0000  \
Logistic                  0.3877             0.4261   
SVM                       0.3469             0.3950   
XGBoost                   0.4443             0.4199   
Neural net                0.4124             0.4063   

            Threshold prob. at best F1  
Dummy                           0.1226  
Logistic                        0.5926  
SVM                             0.6205  
XGBoost                         0.6399  
Neural net                      0.2233

All models are a large improvement over the baseline classifier. XGBoost seems to perform the best overall, in almost all summary metrics.

The NN model follows XGBoost in average precision, and has the second highest F1 score at the “optimal” threshold, but also the worst Brier score.
- We calculated Brier scores by fully balancing the contributions of each class with class weights. All models except the NN were trained in the same fashion, but the NN was trained with less weight on the positive class.
- This results in a “worse” weighted Brier score for the NN, but it would likely have the best unweighted Brier score as well. Which score is more important for assessing performance is dependent on the use case.
- We also see the “optimal” threshold is much lower for the NN model than for the class weighted models. We’ll talk about this in more detail with the predicted probabilities plot.
The logistic regression model comes next overall, and beats all others in recall at the optimal threshold probability. It’s likely a very efficient alternative to the XGBoost and NN models.
The SVM model performs considerably worse than the other models, and this could have several reasons. The kernel approximation we performed may not be well suited to this feature set. The approximated probability predictions may not be very reliable. Still, it’s likely an effective model on its own.

Precision - recall curves

We can visualize the tradeoff between precision and recall with the precision-recall curve (PRC) plot.

Show code to plot PRC curves

# Plot precision-recall curves
fig, ax = plt.subplots()
for key in preds_dict.keys():
  _ = PrecisionRecallDisplay.from_predictions(y_test, preds_dict[key], name = key, ax = ax)
_ = plt.title("Precision-recall curves of classifiers")
_ = plt.legend(loc = "upper right")
plt.show()
plt.close("all")

In the plot above, each curve starts with a threshold probability of 1, and ends with a threshold probability of 0. As a consequence, precision drops and recall increases as we move to the right.

A very skilled classifier would lose very little precision as it increases recall. The curve would be closer to the top right corner.
Calculating the area under the precision-recall curve (PRAUC) summarizes this tradeoff, and the overall model performance at all thresholds. This is equivalent to the average precision metric we previously calculated.
We see the XGBoost model has the best PRC curve at pretty much any threshold probability. The NN model outperforms the logistic model at lower thresholds, and the logistic model outperforms the NN model at some higher threshold values.

F1 score - precision - recall plots

Another way to look at the tradeoff between precision and recall is to plot both, along with the F1 scores, across various threshold probabilities.

Show code to get F1 plots dataframes

# Get dataframes for F1 score - threshold prob plots

# Logistic
df_f1_logistic = pd.DataFrame(
  {"F1 score": scores_f1["Logistic"][:-1], # N. scores = N. thresholds + 1
   "Precision": scores_precision["Logistic"][:-1],
   "Recall": scores_recall["Logistic"][:-1],
   "Threshold prob.": threshold_probs["Logistic"]
  }
).melt(
  value_vars = ["F1 score", "Precision", "Recall"], 
  var_name = "Metric",
  value_name = "Score",
  id_vars = "Threshold prob."
  )

# SVM
df_f1_svm = pd.DataFrame(
  {"F1 score": scores_f1["SVM"][:-1],
   "Precision": scores_precision["SVM"][:-1],
   "Recall": scores_recall["SVM"][:-1],
   "Threshold prob.": threshold_probs["SVM"]
  }
).melt(
  value_vars = ["F1 score", "Precision", "Recall"], 
  var_name = "Metric",
  value_name = "Score",
  id_vars = "Threshold prob."
  )

# XGBoost
df_f1_xgb = pd.DataFrame(
  {"F1 score": scores_f1["XGBoost"][:-1],
   "Precision": scores_precision["XGBoost"][:-1],
   "Recall": scores_recall["XGBoost"][:-1],
   "Threshold prob.": threshold_probs["XGBoost"]
  }
).melt(
  value_vars = ["F1 score", "Precision", "Recall"], 
  var_name = "Metric",
  value_name = "Score",
  id_vars = "Threshold prob."
  )

# NN
df_f1_nn = pd.DataFrame(
  {"F1 score": scores_f1["Neural net"][:-1],
   "Precision": scores_precision["Neural net"][:-1],
   "Recall": scores_recall["Neural net"][:-1],
   "Threshold prob.": threshold_probs["Neural net"]
  }
).melt(
  value_vars = ["F1 score", "Precision", "Recall"], 
  var_name = "Metric",
  value_name = "Score",
  id_vars = "Threshold prob."
  )

Show code to plot F1 scores

# Plot F1 score - threshold prob. plots
fig, ax = plt.subplots(4, sharex = True, sharey = True)
_ = fig.suptitle("F1 - precision - recall scores across threshold probabilities")

# Logistic
_ = sns.lineplot(
  ax = ax[0], 
  x = "Threshold prob.", y = "Score", hue = "Metric", 
  data = df_f1_logistic, legend =  False)
_ = ax[0].set_title("Logistic")

# SVM
_ = sns.lineplot(
  ax = ax[1], 
  x = "Threshold prob.", y = "Score", hue = "Metric", 
  data = df_f1_svm, legend = False)
_ = ax[1].set_title("SVM")

# XGBoost
_ = sns.lineplot(
  ax = ax[2], 
  x = "Threshold prob.", y = "Score", hue = "Metric", 
  data = df_f1_xgb, legend = False)
_ = ax[2].set_title("XGBoost")

# NN
_ = sns.lineplot(
  ax = ax[3], 
  x = "Threshold prob.", y = "Score", hue = "Metric", 
  data = df_f1_nn, legend = True)
_ = ax[3].set_title("Neural net")
plt.show()
plt.close("all")

We see a major difference between the class weighted models, and the NN model trained with focal loss.

The class weighted models reach their maximum F1 score around a threshold probability of 0.6 - 0.65. Recall declines and precision increases mainly after 0.3.
In contrast, the NN model reaches its maximum F1 score around a threshold probability of 0.22. Recall sharply declines and precision sharply increases after around 0.1.
This suggests the class weighted models ascribe much higher positive class probabilities to predictions compared to the NN model. We’ll see this clearly in the next plot.
The plot above can be used in real-life applications to see how each score changes with the threshold, and to choose an ideal threshold depending on which score is more important for the problem at hand.

Predicted probability distributions

Let’s look at the distributions of positive class probabilities predicted by our models to better see the difference between the class weighted models and the focal loss NN model.

Show code to get predicted prob. dataframes

# Get dataframes for predicted probability plots

# Logistic
df_preds_logistic = pd.DataFrame({
  "Prob. predictions": preds_dict["Logistic"],
  "Actual labels": y_test,
})

# SVM
df_preds_svm = pd.DataFrame({
  "Prob. predictions": preds_dict["SVM"],
  "Actual labels": y_test,
})

# XGBoost
df_preds_xgb = pd.DataFrame({
  "Prob. predictions": preds_dict["XGBoost"],
  "Actual labels": y_test,
})

# NN
df_preds_nn = pd.DataFrame({
  "Prob. predictions": preds_dict["Neural net"],
  "Actual labels": y_test,
})

Show code to plot predicted prob. distributions

# Plot predicted probability distributions of classifiers
fig, ax = plt.subplots(4, sharex = True, sharey = True)
_ = fig.suptitle("Distributions of positive class probability predictions")

# Logistic
_ = sns.histplot(
  ax = ax[0], 
  x = "Prob. predictions", 
  hue = "Actual labels",
  multiple = "stack",
  data = df_preds_logistic,
  legend = False)
_ = ax[0].set_title("Logistic")
_ = ax[0].set_ylabel("")

# SVM
_ = sns.histplot(
  ax = ax[1], 
  x = "Prob. predictions",
  hue = "Actual labels",
  multiple = "stack",
  data = df_preds_svm,
  legend = False)
_ = ax[1].set_title("SVM")
_ = ax[1].set_ylabel("")

# XGBoost
_ = sns.histplot(
  ax = ax[2], 
  x = "Prob. predictions",
  hue = "Actual labels",
  multiple = "stack",
  data = df_preds_xgb,
  legend = False)
_ = ax[2].set_title("XGBoost")
_ = ax[2].set_ylabel("")

# NN
_ = sns.histplot(
  ax = ax[3], 
  x = "Prob. predictions",
  hue = "Actual labels",
  multiple = "stack",
  data = df_preds_nn,
  legend = True)
_ = ax[3].set_title("Neural net")
_ = ax[3].set_xlabel("Probability predictions for positive class")
_ = ax[3].set_ylabel("N. of times predicted")
plt.show()
plt.close("all")

In the plot above, we have a stacked histogram of predicted positive class probabilities, colored by their actual target values.

A very skilled classifier would separate the probability predictions of the two classes as much as possible:
- The predicted probabilities for negative cases (blue bins) would be mostly gathered to the left, close to 0.
- The predicted probabilities for positive cases (orange bins) would be mostly gathered to the right, close to 1.
The class weighted models predict relatively higher positive class probabilities for all cases, including the negatives.
- The probabilities for the negative cases are close to normally distributed, roughly centered around 0.35 - 0.4.
  - The approximated probabilities for the SVM model are not smoothly distributed because of the isotonic approximation we used. Using the sigmoid method results in similar, normal-shaped distributions, just like the logistic and XGBoost predictions.
- The probabilities for the positive cases are generally higher, but more uniformly distributed. Many are higher than 0.8, but many are also between 0.2 - 0.6, and overlap with many negative cases.
- This suggests the models can easily classify the most extreme positive cases, but also have trouble telling apart less obvious positives from negative cases.
- Due to being trained with class weights, the models are “tuned” to predict relatively high positive probabilities overall. This is why their “optimal” threshold probabilities are much higher.
The probability predictions of the NN model are distributed quite differently. The vast majority of cases have very low positive class probabilities, mostly below 0.2.
- The probabilities for the negative cases are right skewed, with a median around maybe 0.1.
- A fairly uniform distribution applies to the positive cases, but with a considerable number of observations at the right tail. There are “outlier” positive cases with predicted probabilities much higher than 0.8. These are likely the most extreme and easily identified positive cases.
- This explains why the “optimal” threshold probability is much lower: The NN model, trained with focal loss, puts less weight on the positive cases, and predicts lower positive probabilities across the board, compared to the class weighted models’ predictions.
Both types of probability predictions likely have their pros and cons, and the best approach depends on the use case.
- The probabilities predicted by the class weighted models could be considered “unrealistic”, as they are tuned to be relatively high across the board.
  - For this problem, it’s unlikely most cars have a moderately high risk of being kicks. On the other hand, the XGBoost model clearly performs the best overall in terms of classification performance.
- In contrast, the probabilities predicted by the focal loss NN model could be considered more realistic, as they are low for the vast majority of cases, and relatively much higher for a small number of cases.
  - Intuitively, we’d expect most cars to have a smaller risk of being kicks, with a few high-risk ones. The NN model’s probability predictions also seems to “separate” the riskiest cars from the rest to a higher degree.

Sensitivity analysis

The classification models output probability predictions of class membership, and it’s up to us to use these probabilities for our real-life problem. We’ll illustrate this with a simple hypothetical business case.

Problem formulation

In our scenario, we evaluate which cars to purchase from our testing data, based on their predicted probabilities of being kicks.

We’ll assume we make a profit from purchasing good cars, and incur a certain loss from purchasing kicks.
We’ll aim to maximize our equity $E$ by changing two variables: The $N$ number of cars we intend to purchase, and the $t$ threshold probability for classifying them as kicks.
- We’ll sort the cars in ascending order according to their $p$ predicted probabilities of being a kick.
- We will subset the top $N$ cars, and purchase them if $p < t$, pass them if $p \ge t$.
If we purchase a good car, we’ll assume the car was worth what we paid, and we’ll get 10% of the purchase price $C$ as profit, or a return on our investment from using it.
- The net change in our equity will be $-C + C + 0.1C = 0.1C$.
If we purchase a kick, we’ll assume the car is useless, and we can only recover 20% of the purchase price as salvage value.
- The net change in our equity will be $-C+0.2C=-0.8C$.

With these assumptions, we can represent our equity value with the following expressions:

$E = sum_{i=1}^N (0.1 * C_i - 0.9 * C_i * k_i) * d_i$,

where

$k_i = 0$ for good cars, $k_i = 1$ for kicks,

$d_i = 1$ if $p_i < t$,

$d_i = 0$ if $p_i \ge t$.

I originally wanted to treat this as a linear programming problem, and solve for the optimal number of cars to buy, and the optimal threshold probability, using each model’s predicted probabilities. After some research, I realized this is more complex than I originally expected, as we have conditional constraints in this problem.

Instead, we’ll perform a sensitivity analysis, by calculating the equity at various combinations of our two decision variables, for each model. This will still give us quasi-optimal solutions for our problem, which shouldn’t be too far from the actual optimal values.

Calculations

The code to perform the necessary calculations is available below, explained by the code comments. We will try 100 values for the number of cars to buy, from 1 to 14k+, and 100 values for the threshold probability, from 0 to 1. We’ll calculate an equity value for each combination, using each model’s predictions. With 10k calculations for each model, this takes several minutes on my machine. There could be a way to program this better.

Show code to get sensitivity analysis input dataframes

# Retrieve prob. predictions, target labels, purchase prices in dataframes, sort in descending order according to prob. predictions

# Dummy
df_dummy = pd.DataFrame({
  "Price": x_test["VehBCost"],
  "Kick": y_test,
  "ProbKick": preds_dict["Dummy"]
})
df_dummy = df_dummy.sort_values("ProbKick", ascending = True)

# Logistic
df_logistic = pd.DataFrame({
  "Price": x_test["VehBCost"],
  "Kick": y_test,
  "ProbKick": preds_dict["Logistic"]
})
df_logistic = df_logistic.sort_values("ProbKick", ascending = True)

# SVM
df_svm = pd.DataFrame({
  "Price": x_test["VehBCost"],
  "Kick": y_test,
  "ProbKick": preds_dict["SVM"]
})
df_svm = df_svm.sort_values("ProbKick", ascending = True)

# XGBoost
df_xgb = pd.DataFrame({
  "Price": x_test["VehBCost"],
  "Kick": y_test,
  "ProbKick": preds_dict["XGBoost"]
})
df_xgb = df_xgb.sort_values("ProbKick", ascending = True)

# Neural net
df_nn = pd.DataFrame({
  "Price": x_test["VehBCost"],
  "Kick": y_test,
  "ProbKick": preds_dict["Neural net"]
})
df_nn = df_nn.sort_values("ProbKick", ascending = True)

Show code to define profit-loss calculation function

# Define function to calculate profit-loss, given threshold probability, number of cars to purchase and probability predictions
def calc_profit(threshold, num_purchases, df_probs):
  
  # Retrieve arrays of prices, labels, predicted probs
  price = df_probs["Price"].values
  kick = df_probs["Kick"].values
  prob = df_probs["ProbKick"].values
  
  # Retrieve n. of cars to purchase
  n = num_purchases
  
  # Get vector of purchase decisions for the top N cars at this threshold prob.
  decision = (prob[0:n] < threshold).astype(int)
  
  # Calculate profit/loss for each car
  profit = [
    ((0.1 * price[i]) - (0.9 * price[i] * kick[i])) * decision[i] for i in range(n)
    ]
  
  # Return n. of cars actually purchased, total profit / loss
  return sum(decision), sum(profit)

Show code to calculate profit-loss values

# Get combinations of thresholds - n. of cars to purchase
thresholds = np.arange(0, 1, 0.01) # 100 Threshold probabilities
num_buys = np.arange(0, len(y_test), 100) # 100 values for n. cars to purchase
thresholds_buys = list(product(thresholds, num_buys))


# Calculate n. of cars actually bought and profit at each threshold / n. purchases combination, for each model

# Dummy
output_dummy = [calc_profit(x, y, df_dummy) for x, y in thresholds_buys]
decisions_dummy = [x[0] for x in output_dummy]
profits_dummy= [x[1] for x in output_dummy]

# Logistic
output_logistic = [calc_profit(x, y, df_logistic) for x, y in thresholds_buys]
decisions_logistic = [x[0] for x in output_logistic]
profits_logistic = [x[1] for x in output_logistic]

# SVM
output_svm = [calc_profit(x, y, df_svm) for x, y in thresholds_buys]
decisions_svm = [x[0] for x in output_svm]
profits_svm = [x[1] for x in output_svm]

# XGBoost
output_xgb = [calc_profit(x, y, df_xgb) for x, y in thresholds_buys]
decisions_xgb = [x[0] for x in output_xgb]
profits_xgb = [x[1] for x in output_xgb]

# NN
output_nn = [calc_profit(x, y, df_nn) for x, y in thresholds_buys]
decisions_nn = [x[0] for x in output_nn]
profits_nn = [x[1] for x in output_nn]


# Make long dataframes of threshold-purchases-profit values for each model

# Dummy
df_long_dummy = pd.DataFrame({
  "Threshold": [x[0] for x in thresholds_buys],
  "Purchases": decisions_dummy,
  "Profits": profits_dummy,
  "Model": "Dummy"
})

# Logistic
df_long_logistic = pd.DataFrame({
  "Threshold": [x[0] for x in thresholds_buys],
  "Purchases": decisions_logistic,
  "Profits": profits_logistic,
  "Model": "Logistic"
})

# SVM
df_long_svm = pd.DataFrame({
  "Threshold": [x[0] for x in thresholds_buys],
  "Purchases": decisions_svm,
  "Profits": profits_svm,
  "Model": "SVM"
})

# XGBoost
df_long_xgb = pd.DataFrame({
  "Threshold": [x[0] for x in thresholds_buys],
  "Purchases": decisions_xgb,
  "Profits": profits_xgb,
  "Model": "XGBoost"
})

# Neural net
df_long_nn = pd.DataFrame({
  "Threshold": [x[0] for x in thresholds_buys],
  "Purchases": decisions_nn,
  "Profits": profits_nn,
  "Model": "Neural net"
})


# Combine long dataframes into one
df_long = pd.concat(
  [df_long_dummy, df_long_logistic, df_long_svm, df_long_xgb, df_long_nn])

# Drop rows with duplicates of purchase-profit-model columns (cases where a 
# probability t results in buying n number of cars, and a higher t doesn't result 
# in more cars bought, or a different profit value)
df_long = df_long.drop_duplicates(["Purchases", "Profits", "Model"])

Plots and quasi-optimal solutions

We’ll plot and compare the results of our sensitivity analysis for each model.

Show code to plot sensitivity analysis

# 2D lineplots of thresholds-purchases-profits
fig, ax = plt.subplots(3)
_ = fig.suptitle("Sensitivity analysis of classifier threshold probability, number of cars purchased and profit")

_ = sns.lineplot(
  ax = ax[0],
  data = df_long, x = "Threshold", y = "Profits", hue = "Model")
_ = ax[0].set_xlabel("Threshold probability")
_ = ax[0].set_ylabel("Profits, $mil")
_ = ax[0].legend(
  title = "Model", bbox_to_anchor=(1.05, 1.0), fontsize="small", loc='best')
  
_ = sns.lineplot(
  ax = ax[1],
  data = df_long, x = "Threshold", y = "Purchases", hue = "Model", legend = False)
_ = ax[1].set_xlabel("Threshold probability (lowest value that results in N. cars purchased)")
_ = ax[1].set_ylabel("N. cars purchased")

_ = sns.lineplot(
  ax = ax[2],
  data = df_long, x = "Purchases", y = "Profits", hue = "Model", legend = False)
_ = ax[2].set_xlabel("N. cars purchased")
_ = ax[2].set_ylabel("Profits, $mil")
plt.show()
plt.close("all")

We see very similar, smooth curves for the class weighted models.

The highest profits are reached at around 0.45 threshold probability, with 9.k-10k cars purchased out of more than 14k.
The profits increase steadily as we increase the threshold to this point, and decline steadily afterwards.

In contrast, the curves for the NN model, trained with focal loss, are much more different.

Profits sharply rise to their highest values around 0.15 threshold probability, also with 9k-10k cars purchased. They decline afterwards, plateau for a bit, and decline again.
- I believe the shape of the NN model’s profit / threshold prob. curve suggests the NN model tells apart the best and worst cars more easily compared to the class weighted models.

The most profitable model overall is XGBoost, though the NN and logistic regression come close. SVM trails behind considerably.

The NN and logistic models may reach a slightly higher profit value at some sub-optimal numbers of cars purchased. So they could still be preferred to XGBoost in some cases, if our number of purchases were to be limited.

Below, we view the quasi-optimal solutions for each model as a table.

Show code to retrieve quasi-optimal solutions

# Quasi-optimal combinations of threshold prob - n. purchases
optimal_dummy = df_long_dummy.loc[np.argmax(df_long_dummy["Profits"])]
optimal_logistic = df_long_logistic.loc[np.argmax(df_long_logistic["Profits"])]
optimal_svm = df_long_svm.loc[np.argmax(df_long_svm["Profits"])]
optimal_xgb = df_long_xgb.loc[np.argmax(df_long_xgb["Profits"])]
optimal_nn =df_long_nn.loc[np.argmax(df_long_nn["Profits"])]

# Put solutions in a dataframe
df_optimal = pd.concat([
  optimal_dummy, optimal_logistic, optimal_svm, optimal_xgb, optimal_nn], 
  axis = 1).transpose()

# Scale profits to millions
df_optimal["Profits"] = df_optimal["Profits"] / 1e6

# Rename columns  
df_optimal = df_optimal.rename({
  "Threshold": "Optimal threshold prob.",
  "Purchases": "N. cars purchased",
  "Profits": "Profits, $mil"
}, axis = 1)

# Reorder columns
df_optimal = df_optimal[[
  "Model", "Optimal threshold prob.", "N. cars purchased", "Profits, $mil"]]

print(df_optimal)

           Model Optimal threshold prob. N. cars purchased Profits, $mil
1888       Dummy                  0.1300               300        0.0246
6917    Logistic                  0.4700             10200        3.2618
7060         SVM                  0.4800             10000        2.7546
6468     XGBoost                  0.4400              8800        3.3290
2269  Neural net                  0.1500              9400        3.2445

Conclusions

We tested several classifiers, from simple logistic regression to a neural network model. We went over a typical hyperparameter tuning & performance evaluation workflow, with classification metrics suited to the problem at hand. We saw that XGBoost performed a bit better than a NN model overall, and logistic regression held up very well compared to its efficiency and simplicity. SVM trailed behind, but also performed decently for a highly imbalanced problem, and we didn’t pay much attention to the kernel choice. Training & tuning the logistic & SVM models with stochastic gradient descent saved a lot of time on a large dataset.

We applied class weights to the training loss functions to try and handle the class imbalance. We also tried focal loss, which was specifically designed for heavily imbalanced problems. It’s hard to say which approach is “better”, as we tried them across different algorithms. It’s clear they result in very different probability predictions, so that should be taken into consideration for real-life applications. It would be interesting to implement focal loss as a custom scikit-learn metric, and repeat this experiment by training the other models with focal loss.

Finally, my main goal with this project was to practice building & tuning a neural network model directly using PyTorch and Lightning, instead of using predefined architectures or wrappers. However, the final part of the analysis sparked my interest in linear programming and optimization problems. I plan to explore this topic further, as it seems complementary to machine learning methods.

Any comments or suggestions are welcome.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Report.md

Report.md

Imbalanced classification with deep learning - Used cars problem

Introduction

Setup

Data cleaning

Feature engineering

Preprocessing pipeline

Modeling & hyperparameter optimization

Logistic regression with SGD

Support vector machine with SGD

XGBoost

Neural network with PyTorch

Defining PyTorch & Lightning classes

Hyperparameter tuning

Model testing

Training & prediction with final models

Calculating performance metrics

Summary table of performance metrics

Precision - recall curves

F1 score - precision - recall plots

Predicted probability distributions

Sensitivity analysis

Problem formulation

Calculations

Plots and quasi-optimal solutions

Conclusions

Files

Report.md

Latest commit

History

Report.md

File metadata and controls

Imbalanced classification with deep learning - Used cars problem

Introduction

Setup

Data cleaning

Feature engineering

Preprocessing pipeline

Modeling & hyperparameter optimization

Logistic regression with SGD

Support vector machine with SGD

XGBoost

Neural network with PyTorch

Defining PyTorch & Lightning classes

Hyperparameter tuning

Model testing

Training & prediction with final models

Calculating performance metrics

Summary table of performance metrics

Precision - recall curves

F1 score - precision - recall plots

Predicted probability distributions

Sensitivity analysis

Problem formulation

Calculations

Plots and quasi-optimal solutions

Conclusions