Loss in performance by using init_score #6723

dwayne298 · 2024-11-15T17:21:54Z

Description

I need to use the init_score to provide a prior model but I'm seeing some behaviour I don't understand.

Setup

build a cv model with 20 trees
build a second cv model with 80 trees which uses the same folds as the first model, and feeds in cv predictions from first model as init_score
I have early stopping to avoid overfitting (second model stops at 2 trees)
first model alone has performance (cv metric) 2197.57
resulting combined performance is 2194.64

repeat the above but the first model has 10 trees and second has 90 trees
second model ends on 15 trees
first model alone has performance 2483.26
resulting combined performance is 2205.03

Query

I would have expected the two situations to give similar performance (due to using the same folds for both models, this setup should be the same as creating one model with 100 trees in both situations) . But I have seen over multiple examples that limiting the trees in the first model will lead to worse results. Is there any reason or intuition for this being the case?

Reproducible example

import lightgbm as lgb
import pandas as pd
import numpy as np
import sklearn.model_selection as skms

# RUN SCRIPT SECOND TIME BUT CHANGING num_iters TO 10
total_iters = 100
num_iters = 20

# create data
np.random.seed(5)
data = pd.DataFrame({
    "a": np.random.random(10_000),
    "b": np.random.random(10_000),
    "c": np.random.random(10_000),
    "d": np.random.random(10_000),
})
data["target"] = np.exp(5 + 3 * data["a"] + data["b"] - 2 * data["c"] + 1.5 * data["d"] + np.random.gamma(0.1, 1, 10_000))

# build first cv model
dataset = lgb.Dataset(
    data=data.drop(["target"], axis=1),
    label=data["target"],
    free_raw_data=False,
)
kf = skms.KFold(n_splits=3, shuffle=True, random_state=309)
kf_splits = kf.split(np.zeros(len(data)))

custom_folds = list()
for train_idx, test_idx in kf_splits:
    custom_folds.append((train_idx, test_idx))

cv_results = lgb.cv(
    params={
        "objective": "gamma", 
        "boosting_type": "gbdt", 
        "n_estimators": num_iters,
        "early_stopping": 5,
        "metric": "gamma_deviance",
    },
    train_set=dataset,
    folds=custom_folds,
    stratified=False,
    return_cvbooster=True,
)

# need cv preds to feed into second model - check my cv preds give same metric as lightgbm
print(cv_results["valid gamma_deviance-mean"])
def replicate_metrics(num_iters, model):
    list_metrics = []
    cv_preds = []
    for num_iter in range(1, num_iters + 1):
        metric_list = []
        for cv_idx, cv_fold in enumerate(custom_folds):
            mdl_temp = model.boosters[cv_idx]

            # predict from booster
            cv_preds_tmp = mdl_temp.predict(
                dataset.get_data().loc[cv_fold[1]],
                num_iteration=num_iter,
            )
            
            tmp = data["target"].loc[cv_fold[1]] / (cv_preds_tmp + 1.0e-9)

            metric_list.append(
                    2 * sum(tmp - np.log(tmp) - 1)
            )
            if num_iter == num_iters:
                cv_preds.append(cv_preds_tmp)
        list_metrics.append(np.mean(metric_list))        
    
    cv_preds = (
        pd.DataFrame(
            {
                "idx": np.concatenate([idx[1] for idx in custom_folds]),
                "cv_pred": np.concatenate(
                    cv_preds
                ),
            }
        )
        .sort_values(by=["idx"])
        .reset_index(drop=True)
        .pop("cv_pred")
    )
        
    print(list_metrics)
    
    return cv_preds

cv_preds = replicate_metrics(len(cv_results["valid gamma_deviance-mean"]), cv_results["cvbooster"])

# second model
dataset2 = lgb.Dataset(
    data=data.drop(["target"], axis=1),
    label=data["target"],
    free_raw_data=False,
    init_score=np.log(cv_preds),
)

cv_results2 = lgb.cv(
    params={
        "objective": "gamma", 
        "boosting_type": "gbdt", 
        "n_estimators": total_iters - num_iters,
        "early_stopping": 5,
        "metric": "gamma_deviance",
    },
    train_set=dataset2,
    folds=custom_folds,
    stratified=False,
    return_cvbooster=True,
)  
print(cv_results2["valid gamma_deviance-mean"])

Environment info

Package versions:
LightGBM: 4.5.0
numpy: 1.22.3
pandas: 1.4.1
sklearn: 1.1.1

Command(s) you used to install LightGBM

python -m venv venv
venv\Scripts\activate.ps1
python -m pip install -r requirements.txt

dwayne298 · 2024-11-18T19:13:32Z

I think fundamentally, my initial thoughts were wrong as the init_score I'm feeding in is essentially "test" predictions. That leads to the question - is it possible to feed in different init_score for each fold?

jameslamb added the question label Nov 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loss in performance by using init_score #6723

Loss in performance by using init_score #6723

dwayne298 commented Nov 15, 2024

dwayne298 commented Nov 18, 2024

Loss in performance by using init_score #6723

Loss in performance by using init_score #6723

Comments

dwayne298 commented Nov 15, 2024

Description

Reproducible example

Environment info

dwayne298 commented Nov 18, 2024