Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't refresh twice or more once pruning happens in the first refresh #5297

Closed
hjh1011 opened this issue Feb 10, 2020 · 0 comments · Fixed by #5335
Closed

Can't refresh twice or more once pruning happens in the first refresh #5297

hjh1011 opened this issue Feb 10, 2020 · 0 comments · Fixed by #5335

Comments

@hjh1011
Copy link

hjh1011 commented Feb 10, 2020

The tasks that lead me to this issue is something like this: say I have three data sets D1, D2, D3. I train my first model xgb_1 with D1, refresh the leaves with D2 and then incrementally grow trees with D2 so that I get a model xgb_2. When I tried to refresh xgb_2 with D3, it outputs error. I did some digging and suspect it has something to do with node prune during refresh.

To better illustrate what I ran into, I made a toy example without incremental training here. I tried a few parameters so that the example can better show my case. I saw the same behavior under version 0.72.1, 0.80, 0.90

import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn.datasets import load_breast_cancer


param_dict = {'booster':'gbtree',
          'objective':'binary:logistic',
          'slient': 0,
          'eta': 0.05,
          'max_depth':2,
          'min_child_weight':1,
          'subsample':1,
          'colsample_bytree':1,
          'tree_method':'exact',
          'eval_metric':'logloss'}

refresh_dict = {'booster':'gbtree',
          'objective':'binary:logistic',
          'slient': 0,
          'eta': 0.05,
          'eval_metric':'logloss',
          'process_type':'update',
          'updater':'refresh,prune'}

data = load_breast_cancer()

X = pd.DataFrame(data["data"])
X.columns = data["feature_names"]
y = data["target"]

DM_0 = xgb.DMatrix(X, label = y)

xgb_0 = xgb.train(params=param_dict,
         dtrain=DM_0,
         num_boost_round=5)

xgb_0.dump_model("xgb_0.txt")

I trained a model with only 5 trees and depth 2 so that we can better view it:

booster[0]:
0:[worst radius<16.7950001] yes=1,no=2,missing=1
	1:[worst concave points<0.135800004] yes=3,no=4,missing=3
		3:leaf=0.0958456993
		4:leaf=-0.0200000014
	2:[mean texture<16.1100006] yes=5,no=6,missing=5
		5:leaf=0.00476190494
		6:leaf=-0.095480226
booster[1]:
0:[worst radius<16.7950001] yes=1,no=2,missing=1
	1:[worst concave points<0.135800004] yes=3,no=4,missing=3
		3:leaf=0.0913208351
		4:leaf=-0.0190817993
	2:[worst texture<19.9099998] yes=5,no=6,missing=5
		5:leaf=0.00504727243
		6:leaf=-0.0910744742
booster[2]:
0:[worst radius<16.7950001] yes=1,no=2,missing=1
	1:[worst concave points<0.160299987] yes=3,no=4,missing=3
		3:leaf=0.0828778669
		4:leaf=-0.0650591701
	2:[mean concavity<0.0721400008] yes=5,no=6,missing=5
		5:leaf=-0.00284839468
		6:leaf=-0.0886432752
booster[3]:
0:[worst concave points<0.142349988] yes=1,no=2,missing=1
	1:[worst area<957.450012] yes=3,no=4,missing=3
		3:leaf=0.0806099474
		4:leaf=-0.0552328601
	2:[worst area<729.549988] yes=5,no=6,missing=5
		5:leaf=0.0112194559
		6:leaf=-0.0840485916
booster[4]:
0:[worst perimeter<105.949997] yes=1,no=2,missing=1
	1:[worst concave points<0.158899993] yes=3,no=4,missing=3
		3:leaf=0.0797721893
		4:leaf=-0.0421395414
	2:[mean concave points<0.0488649979] yes=5,no=6,missing=5
		5:leaf=0.0206357557
		6:leaf=-0.0770028159

Now if I refresh the model with exactly the data I train it, which means no pruning will happens given the hyperparameter I set ('subsample':1, 'colsample_bytree':1,), and in the case, you can refresh it as many times as you want:

xgb_1 = xgb.train(params=refresh_dict,
                  dtrain = DM_0,
                  num_boost_round=5,
                  xgb_model = xgb_0)

xgb_2 = xgb.train(params=refresh_dict,
                  dtrain = DM_0,
                  num_boost_round=5,
                  xgb_model = xgb_1)

However, If we take a subset of the train data and make sure that tree pruning happens:

DM_1 = DM_0.slice([i for i in range(350, 450)])
xgb_1 = xgb.train(params=refresh_dict,
                  dtrain = DM_1,
                  num_boost_round=5,
                  xgb_model = xgb_0)

xgb_1.dump_model("xgb_1.txt")

In this specific case, the first pruning happens on the 4th tree:

booster[0]:
0:[worst radius<16.7950001] yes=1,no=2,missing=1
	1:[worst concave points<0.135800004] yes=3,no=4,missing=3
		3:leaf=0.0942857191
		4:leaf=0.00909090973
	2:[mean texture<16.1100006] yes=5,no=6,missing=5
		5:leaf=0
		6:leaf=-0.0785714313
booster[1]:
0:[worst radius<16.7950001] yes=1,no=2,missing=1
	1:[worst concave points<0.135800004] yes=3,no=4,missing=3
		3:leaf=0.0900324881
		4:leaf=0.00880176853
	2:[worst texture<19.9099998] yes=5,no=6,missing=5
		5:leaf=0
		6:leaf=-0.0761579573
booster[2]:
0:[worst radius<16.7950001] yes=1,no=2,missing=1
	1:[worst concave points<0.160299987] yes=3,no=4,missing=3
		3:leaf=0.0844815597
		4:leaf=0
	2:[mean concavity<0.0721400008] yes=5,no=6,missing=5
		5:leaf=0
		6:leaf=-0.0791416466
booster[3]:
0:[worst concave points<0.142349988] yes=1,no=2,missing=1
	1:leaf=0.0791857168
	2:[worst area<729.549988] yes=5,no=6,missing=5
		5:leaf=0
		6:leaf=-0.0771616027
booster[4]:
0:[worst perimeter<105.949997] yes=1,no=2,missing=1
	1:[worst concave points<0.158899993] yes=3,no=4,missing=3
		3:leaf=0.078185834
		4:leaf=0
	2:[mean concave points<0.0488649979] yes=5,no=6,missing=5
		5:leaf=0.0301180389
		6:leaf=-0.0634451061

So now if you want to refresh the model again with:

xgb_2 = xgb.train(params=refresh_dict,
                  dtrain = DM_1,
                  num_boost_round=5,
                  xgb_model = xgb_1)

Here is the error you will run into (also note that we are running with the same data we used in first refresh, and thus this run should actually have no pruning at all):

---------------------------------------------------------------------------
XGBoostError                              Traceback (most recent call last)
<ipython-input-22-de83107b3495> in <module>()
      2                   dtrain = DM_1,
      3                   num_boost_round=5,
----> 4                   xgb_model = xgb_1)

/usr/lib/python2.7/site-packages/xgboost/training.pyc in train(params, dtrain, num_boost_round, evals, obj, feval, maximize, early_stopping_rounds, evals_result, verbose_eval, xgb_model, callbacks, learning_rates)
    214                            evals=evals,
    215                            obj=obj, feval=feval,
--> 216                            xgb_model=xgb_model, callbacks=callbacks)
    217 
    218 

/usr/lib/python2.7/site-packages/xgboost/training.pyc in _train_internal(params, dtrain, num_boost_round, evals, obj, feval, xgb_model, callbacks)
     72         # Skip the first update if it is a recovery step.
     73         if version % 2 == 0:
---> 74             bst.update(dtrain, i, obj)
     75             bst.save_rabit_checkpoint()
     76             version += 1

/usr/lib/python2.7/site-packages/xgboost/core.pyc in update(self, dtrain, iteration, fobj)
   1107         if fobj is None:
   1108             _check_call(_LIB.XGBoosterUpdateOneIter(self.handle, ctypes.c_int(iteration),
-> 1109                                                     dtrain.handle))
   1110         else:
   1111             pred = self.predict(dtrain)

/usr/lib/python2.7/site-packages/xgboost/core.pyc in _check_call(ret)
    174     """
    175     if ret != 0:
--> 176         raise XGBoostError(py_str(_LIB.XGBGetLastError()))
    177 
    178 

XGBoostError: [00:29:27] /workspace/include/xgboost/./tree_model.h:234: Check failed: nodes_[nodes_[rid].LeftChild() ].IsLeaf(): 
Stack trace:
  [bt] (0) /usr/xgboost/libxgboost.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x24) [0x7f0ae3aeecb4]
  [bt] (1) /usr/xgboost/libxgboost.so(xgboost::tree::TreePruner::DoPrune(xgboost::RegTree&)+0x4ce) [0x7f0ae3c441ae]
  [bt] (2) /usr/xgboost/libxgboost.so(xgboost::tree::TreePruner::Update(xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal<float> >*, xgboost::DMatrix*, std::vector<xgboost::RegTree*, std::allocator<xgboost::RegTree*> > const&)+0x87) [0x7f0ae3c47a57]
  [bt] (3) /usr/xgboost/libxgboost.so(xgboost::gbm::GBTree::BoostNewTrees(xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal<float> >*, xgboost::DMatrix*, int, std::vector<std::unique_ptr<xgboost::RegTree, std::default_delete<xgboost::RegTree> >, std::allocator<std::unique_ptr<xgboost::RegTree, std::default_delete<xgboost::RegTree> > > >*)+0xaeb) [0x7f0ae3b747fb]
  [bt] (4) /usr/xgboost/libxgboost.so(xgboost::gbm::GBTree::DoBoost(xgboost::DMatrix*, xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal<float> >*, xgboost::ObjFunction*)+0xd65) [0x7f0ae3b75c95]
  [bt] (5) /usr/xgboost/libxgboost.so(xgboost::LearnerImpl::UpdateOneIter(int, xgboost::DMatrix*)+0x396) [0x7f0ae3b88556]
  [bt] (6) /usr/xgboost/libxgboost.so(XGBoosterUpdateOneIter+0x35) [0x7f0ae3aebaa5]
  [bt] (7) /lib64/libffi.so.6(ffi_call_unix64+0x4c) [0x7f0b611aadcc]
  [bt] (8) /lib64/libffi.so.6(ffi_call+0x1f5) [0x7f0b611aa6f5]

The reason I believe this is cause by pruning is because the first 3 trees are unpruned compare to the original trees and if we only refresh the first 3 trees and everything is actually fine:

xgb_2 = xgb.train(params=refresh_dict,
                  dtrain = DM_1,
                  num_boost_round=3,
                  xgb_model = xgb_1)

So, this is so far the error source that I can locate? Any ideas?

@hjh1011 hjh1011 changed the title Can't refresh twice or more once pruning happens in the first time Can't refresh twice or more once pruning happens in the first refresh Feb 10, 2020
@lock lock bot locked as resolved and limited conversation to collaborators Jun 24, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants