You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I need to use the init_score to provide a prior model but I'm seeing some behaviour I don't understand.
Setup
build a cv model with 20 trees
build a second cv model with 80 trees which uses the same folds as the first model, and feeds in cv predictions from first model as init_score
I have early stopping to avoid overfitting (second model stops at 2 trees)
first model alone has performance (cv metric) 2197.57
resulting combined performance is 2194.64
repeat the above but the first model has 10 trees and second has 90 trees
second model ends on 15 trees
first model alone has performance 2483.26
resulting combined performance is 2205.03
Query
I would have expected the two situations to give similar performance (due to using the same folds for both models, this setup should be the same as creating one model with 100 trees in both situations) . But I have seen over multiple examples that limiting the trees in the first model will lead to worse results. Is there any reason or intuition for this being the case?
Reproducible example
importlightgbmaslgbimportpandasaspdimportnumpyasnpimportsklearn.model_selectionasskms# RUN SCRIPT SECOND TIME BUT CHANGING num_iters TO 10total_iters=100num_iters=20# create datanp.random.seed(5)
data=pd.DataFrame({
"a": np.random.random(10_000),
"b": np.random.random(10_000),
"c": np.random.random(10_000),
"d": np.random.random(10_000),
})
data["target"] =np.exp(5+3*data["a"] +data["b"] -2*data["c"] +1.5*data["d"] +np.random.gamma(0.1, 1, 10_000))
# build first cv modeldataset=lgb.Dataset(
data=data.drop(["target"], axis=1),
label=data["target"],
free_raw_data=False,
)
kf=skms.KFold(n_splits=3, shuffle=True, random_state=309)
kf_splits=kf.split(np.zeros(len(data)))
custom_folds=list()
fortrain_idx, test_idxinkf_splits:
custom_folds.append((train_idx, test_idx))
cv_results=lgb.cv(
params={
"objective": "gamma",
"boosting_type": "gbdt",
"n_estimators": num_iters,
"early_stopping": 5,
"metric": "gamma_deviance",
},
train_set=dataset,
folds=custom_folds,
stratified=False,
return_cvbooster=True,
)
# need cv preds to feed into second model - check my cv preds give same metric as lightgbmprint(cv_results["valid gamma_deviance-mean"])
defreplicate_metrics(num_iters, model):
list_metrics= []
cv_preds= []
fornum_iterinrange(1, num_iters+1):
metric_list= []
forcv_idx, cv_foldinenumerate(custom_folds):
mdl_temp=model.boosters[cv_idx]
# predict from boostercv_preds_tmp=mdl_temp.predict(
dataset.get_data().loc[cv_fold[1]],
num_iteration=num_iter,
)
tmp=data["target"].loc[cv_fold[1]] / (cv_preds_tmp+1.0e-9)
metric_list.append(
2*sum(tmp-np.log(tmp) -1)
)
ifnum_iter==num_iters:
cv_preds.append(cv_preds_tmp)
list_metrics.append(np.mean(metric_list))
cv_preds= (
pd.DataFrame(
{
"idx": np.concatenate([idx[1] foridxincustom_folds]),
"cv_pred": np.concatenate(
cv_preds
),
}
)
.sort_values(by=["idx"])
.reset_index(drop=True)
.pop("cv_pred")
)
print(list_metrics)
returncv_predscv_preds=replicate_metrics(len(cv_results["valid gamma_deviance-mean"]), cv_results["cvbooster"])
# second modeldataset2=lgb.Dataset(
data=data.drop(["target"], axis=1),
label=data["target"],
free_raw_data=False,
init_score=np.log(cv_preds),
)
cv_results2=lgb.cv(
params={
"objective": "gamma",
"boosting_type": "gbdt",
"n_estimators": total_iters-num_iters,
"early_stopping": 5,
"metric": "gamma_deviance",
},
train_set=dataset2,
folds=custom_folds,
stratified=False,
return_cvbooster=True,
)
print(cv_results2["valid gamma_deviance-mean"])
I think fundamentally, my initial thoughts were wrong as the init_score I'm feeding in is essentially "test" predictions. That leads to the question - is it possible to feed in different init_score for each fold?
Description
I need to use the init_score to provide a prior model but I'm seeing some behaviour I don't understand.
Setup
Query
I would have expected the two situations to give similar performance (due to using the same folds for both models, this setup should be the same as creating one model with 100 trees in both situations) . But I have seen over multiple examples that limiting the trees in the first model will lead to worse results. Is there any reason or intuition for this being the case?
Reproducible example
Environment info
Package versions:
LightGBM: 4.5.0
numpy: 1.22.3
pandas: 1.4.1
sklearn: 1.1.1
Command(s) you used to install LightGBM
The text was updated successfully, but these errors were encountered: