You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The tasks that lead me to this issue is something like this: say I have three data sets D1, D2, D3. I train my first model xgb_1 with D1, refresh the leaves with D2 and then incrementally grow trees with D2 so that I get a model xgb_2. When I tried to refresh xgb_2 with D3, it outputs error. I did some digging and suspect it has something to do with node prune during refresh.
To better illustrate what I ran into, I made a toy example without incremental training here. I tried a few parameters so that the example can better show my case. I saw the same behavior under version 0.72.1, 0.80, 0.90
import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn.datasets import load_breast_cancer
param_dict = {'booster':'gbtree',
'objective':'binary:logistic',
'slient': 0,
'eta': 0.05,
'max_depth':2,
'min_child_weight':1,
'subsample':1,
'colsample_bytree':1,
'tree_method':'exact',
'eval_metric':'logloss'}
refresh_dict = {'booster':'gbtree',
'objective':'binary:logistic',
'slient': 0,
'eta': 0.05,
'eval_metric':'logloss',
'process_type':'update',
'updater':'refresh,prune'}
data = load_breast_cancer()
X = pd.DataFrame(data["data"])
X.columns = data["feature_names"]
y = data["target"]
DM_0 = xgb.DMatrix(X, label = y)
xgb_0 = xgb.train(params=param_dict,
dtrain=DM_0,
num_boost_round=5)
xgb_0.dump_model("xgb_0.txt")
I trained a model with only 5 trees and depth 2 so that we can better view it:
Now if I refresh the model with exactly the data I train it, which means no pruning will happens given the hyperparameter I set ('subsample':1, 'colsample_bytree':1,), and in the case, you can refresh it as many times as you want:
Here is the error you will run into (also note that we are running with the same data we used in first refresh, and thus this run should actually have no pruning at all):
---------------------------------------------------------------------------
XGBoostError Traceback (most recent call last)
<ipython-input-22-de83107b3495> in <module>()
2 dtrain = DM_1,
3 num_boost_round=5,
----> 4 xgb_model = xgb_1)
/usr/lib/python2.7/site-packages/xgboost/training.pyc in train(params, dtrain, num_boost_round, evals, obj, feval, maximize, early_stopping_rounds, evals_result, verbose_eval, xgb_model, callbacks, learning_rates)
214 evals=evals,
215 obj=obj, feval=feval,
--> 216 xgb_model=xgb_model, callbacks=callbacks)
217
218
/usr/lib/python2.7/site-packages/xgboost/training.pyc in _train_internal(params, dtrain, num_boost_round, evals, obj, feval, xgb_model, callbacks)
72 # Skip the first update if it is a recovery step.
73 if version % 2 == 0:
---> 74 bst.update(dtrain, i, obj)
75 bst.save_rabit_checkpoint()
76 version += 1
/usr/lib/python2.7/site-packages/xgboost/core.pyc in update(self, dtrain, iteration, fobj)
1107 if fobj is None:
1108 _check_call(_LIB.XGBoosterUpdateOneIter(self.handle, ctypes.c_int(iteration),
-> 1109 dtrain.handle))
1110 else:
1111 pred = self.predict(dtrain)
/usr/lib/python2.7/site-packages/xgboost/core.pyc in _check_call(ret)
174 """
175 if ret != 0:
--> 176 raise XGBoostError(py_str(_LIB.XGBGetLastError()))
177
178
XGBoostError: [00:29:27] /workspace/include/xgboost/./tree_model.h:234: Check failed: nodes_[nodes_[rid].LeftChild() ].IsLeaf():
Stack trace:
[bt] (0) /usr/xgboost/libxgboost.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x24) [0x7f0ae3aeecb4]
[bt] (1) /usr/xgboost/libxgboost.so(xgboost::tree::TreePruner::DoPrune(xgboost::RegTree&)+0x4ce) [0x7f0ae3c441ae]
[bt] (2) /usr/xgboost/libxgboost.so(xgboost::tree::TreePruner::Update(xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal<float> >*, xgboost::DMatrix*, std::vector<xgboost::RegTree*, std::allocator<xgboost::RegTree*> > const&)+0x87) [0x7f0ae3c47a57]
[bt] (3) /usr/xgboost/libxgboost.so(xgboost::gbm::GBTree::BoostNewTrees(xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal<float> >*, xgboost::DMatrix*, int, std::vector<std::unique_ptr<xgboost::RegTree, std::default_delete<xgboost::RegTree> >, std::allocator<std::unique_ptr<xgboost::RegTree, std::default_delete<xgboost::RegTree> > > >*)+0xaeb) [0x7f0ae3b747fb]
[bt] (4) /usr/xgboost/libxgboost.so(xgboost::gbm::GBTree::DoBoost(xgboost::DMatrix*, xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal<float> >*, xgboost::ObjFunction*)+0xd65) [0x7f0ae3b75c95]
[bt] (5) /usr/xgboost/libxgboost.so(xgboost::LearnerImpl::UpdateOneIter(int, xgboost::DMatrix*)+0x396) [0x7f0ae3b88556]
[bt] (6) /usr/xgboost/libxgboost.so(XGBoosterUpdateOneIter+0x35) [0x7f0ae3aebaa5]
[bt] (7) /lib64/libffi.so.6(ffi_call_unix64+0x4c) [0x7f0b611aadcc]
[bt] (8) /lib64/libffi.so.6(ffi_call+0x1f5) [0x7f0b611aa6f5]
The reason I believe this is cause by pruning is because the first 3 trees are unpruned compare to the original trees and if we only refresh the first 3 trees and everything is actually fine:
So, this is so far the error source that I can locate? Any ideas?
The text was updated successfully, but these errors were encountered:
hjh1011
changed the title
Can't refresh twice or more once pruning happens in the first time
Can't refresh twice or more once pruning happens in the first refresh
Feb 10, 2020
The tasks that lead me to this issue is something like this: say I have three data sets D1, D2, D3. I train my first model xgb_1 with D1, refresh the leaves with D2 and then incrementally grow trees with D2 so that I get a model xgb_2. When I tried to refresh xgb_2 with D3, it outputs error. I did some digging and suspect it has something to do with node prune during refresh.
To better illustrate what I ran into, I made a toy example without incremental training here. I tried a few parameters so that the example can better show my case. I saw the same behavior under version 0.72.1, 0.80, 0.90
I trained a model with only 5 trees and depth 2 so that we can better view it:
Now if I refresh the model with exactly the data I train it, which means no pruning will happens given the hyperparameter I set (
'subsample':1, 'colsample_bytree':1,
), and in the case, you can refresh it as many times as you want:However, If we take a subset of the train data and make sure that tree pruning happens:
In this specific case, the first pruning happens on the 4th tree:
So now if you want to refresh the model again with:
Here is the error you will run into (also note that we are running with the same data we used in first refresh, and thus this run should actually have no pruning at all):
The reason I believe this is cause by pruning is because the first 3 trees are unpruned compare to the original trees and if we only refresh the first 3 trees and everything is actually fine:
So, this is so far the error source that I can locate? Any ideas?
The text was updated successfully, but these errors were encountered: