Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loss in performance by using init_score #6723

Open
dwayne298 opened this issue Nov 15, 2024 · 1 comment
Open

Loss in performance by using init_score #6723

dwayne298 opened this issue Nov 15, 2024 · 1 comment
Labels

Comments

@dwayne298
Copy link

Description

I need to use the init_score to provide a prior model but I'm seeing some behaviour I don't understand.

Setup

  • build a cv model with 20 trees
  • build a second cv model with 80 trees which uses the same folds as the first model, and feeds in cv predictions from first model as init_score
  • I have early stopping to avoid overfitting (second model stops at 2 trees)
  • first model alone has performance (cv metric) 2197.57
  • resulting combined performance is 2194.64

  • repeat the above but the first model has 10 trees and second has 90 trees
  • second model ends on 15 trees
  • first model alone has performance 2483.26
  • resulting combined performance is 2205.03

Query

I would have expected the two situations to give similar performance (due to using the same folds for both models, this setup should be the same as creating one model with 100 trees in both situations) . But I have seen over multiple examples that limiting the trees in the first model will lead to worse results. Is there any reason or intuition for this being the case?

Reproducible example

import lightgbm as lgb
import pandas as pd
import numpy as np
import sklearn.model_selection as skms

# RUN SCRIPT SECOND TIME BUT CHANGING num_iters TO 10
total_iters = 100
num_iters = 20

# create data
np.random.seed(5)
data = pd.DataFrame({
    "a": np.random.random(10_000),
    "b": np.random.random(10_000),
    "c": np.random.random(10_000),
    "d": np.random.random(10_000),
})
data["target"] = np.exp(5 + 3 * data["a"] + data["b"] - 2 * data["c"] + 1.5 * data["d"] + np.random.gamma(0.1, 1, 10_000))

# build first cv model
dataset = lgb.Dataset(
    data=data.drop(["target"], axis=1),
    label=data["target"],
    free_raw_data=False,
)
kf = skms.KFold(n_splits=3, shuffle=True, random_state=309)
kf_splits = kf.split(np.zeros(len(data)))

custom_folds = list()
for train_idx, test_idx in kf_splits:
    custom_folds.append((train_idx, test_idx))

cv_results = lgb.cv(
    params={
        "objective": "gamma", 
        "boosting_type": "gbdt", 
        "n_estimators": num_iters,
        "early_stopping": 5,
        "metric": "gamma_deviance",
    },
    train_set=dataset,
    folds=custom_folds,
    stratified=False,
    return_cvbooster=True,
)

# need cv preds to feed into second model - check my cv preds give same metric as lightgbm
print(cv_results["valid gamma_deviance-mean"])
def replicate_metrics(num_iters, model):
    list_metrics = []
    cv_preds = []
    for num_iter in range(1, num_iters + 1):
        metric_list = []
        for cv_idx, cv_fold in enumerate(custom_folds):
            mdl_temp = model.boosters[cv_idx]

            # predict from booster
            cv_preds_tmp = mdl_temp.predict(
                dataset.get_data().loc[cv_fold[1]],
                num_iteration=num_iter,
            )
            
            tmp = data["target"].loc[cv_fold[1]] / (cv_preds_tmp + 1.0e-9)

            metric_list.append(
                    2 * sum(tmp - np.log(tmp) - 1)
            )
            if num_iter == num_iters:
                cv_preds.append(cv_preds_tmp)
        list_metrics.append(np.mean(metric_list))        
    
    cv_preds = (
        pd.DataFrame(
            {
                "idx": np.concatenate([idx[1] for idx in custom_folds]),
                "cv_pred": np.concatenate(
                    cv_preds
                ),
            }
        )
        .sort_values(by=["idx"])
        .reset_index(drop=True)
        .pop("cv_pred")
    )
        
    print(list_metrics)
    
    return cv_preds

cv_preds = replicate_metrics(len(cv_results["valid gamma_deviance-mean"]), cv_results["cvbooster"])

# second model
dataset2 = lgb.Dataset(
    data=data.drop(["target"], axis=1),
    label=data["target"],
    free_raw_data=False,
    init_score=np.log(cv_preds),
)

cv_results2 = lgb.cv(
    params={
        "objective": "gamma", 
        "boosting_type": "gbdt", 
        "n_estimators": total_iters - num_iters,
        "early_stopping": 5,
        "metric": "gamma_deviance",
    },
    train_set=dataset2,
    folds=custom_folds,
    stratified=False,
    return_cvbooster=True,
)  
print(cv_results2["valid gamma_deviance-mean"])

Environment info

Package versions:
LightGBM: 4.5.0
numpy: 1.22.3
pandas: 1.4.1
sklearn: 1.1.1

Command(s) you used to install LightGBM

python -m venv venv
venv\Scripts\activate.ps1
python -m pip install -r requirements.txt
@dwayne298
Copy link
Author

I think fundamentally, my initial thoughts were wrong as the init_score I'm feeding in is essentially "test" predictions. That leads to the question - is it possible to feed in different init_score for each fold?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants