[FEATURE] convert dask -> non-dask model #6547

pseudotensor · 2020-12-22T23:35:28Z

Related to this general issue, I think xgboost team does a much better job and ensures dask can be pickled etc.

But I encountered some differences between non-dask and dask trees that makes reading them incompatible. Specifically we have poor accuracy when trying to read the tree structure to do predictions, but only for dask models.

I know xgboost can be used with treelite etc. Is it expected that dask models can be used too?

In general, if there is any complication, is there a way to convert a dask model (sckit and raw API) into non-dask form so non-dask tools can be used?

@trivialfis

hcho3 · 2020-12-22T23:44:24Z

@pseudotensor I don't see any technical hurdle in converting Dask model to an ordinary XGBoost model JSON file. We already have a callback function to serialize model at every boosting iteration, and the callback should work with Dask XGBoost:

xgboost/tests/python/test_with_dask.py

Lines 1017 to 1033 in 2231940

    
           @pytest.mark.skipif(**tm.no_sklearn()) 
        
           def test_callback(self, client): 
        
               from sklearn.datasets import load_breast_cancer 
        
               X, y = load_breast_cancer(return_X_y=True) 
        
               X, y = da.from_array(X), da.from_array(y) 
        
               cls = xgb.dask.DaskXGBClassifier(objective='binary:logistic', tree_method='hist', 
        
                                                n_estimators=10) 
        
               cls.client = client 
        
               with tempfile.TemporaryDirectory() as tmpdir: 
        
                   cls.fit(X, y, callbacks=[xgb.callback.TrainingCheckPoint(directory=tmpdir, 
        
                                                                            iterations=1, 
        
                                                                            name='model')]) 
        
                   for i in range(1, 10): 
        
                       assert os.path.exists( 
        
                           os.path.join(tmpdir, 'model_' + str(i) + '.json'))

trivialfis · 2020-12-23T07:17:22Z

Could you please share the script that prediction is different?

trivialfis · 2020-12-24T12:54:02Z

Prediction difference is serious bug, so far I haven't been able to reproduce it. But if you have a MRE I will not hesitate to fix it and push another patch release.

pseudotensor · 2020-12-24T12:59:00Z

Yes, @trivialfis , I'm not saying exactly that the predictions are off, just the way we access the tree structure via the internal format leads to very different predictions between dask and non-dask. I'll get a repro of some kind ASAP.

trivialfis · 2020-12-24T13:00:03Z

To answer the original question, the booster returned by dask train function is exactly the same with single node. But if you want to convert the pickled model between skl interfaces of dask and single node, I don't think that's possible at this point. They are different Python classes and pickle stores Python bytecode.

pseudotensor · 2020-12-29T17:30:08Z

I think it was deduced that the primary problem here is that the dask model does not use ntree_limit so predictions can be off if trying to actually select the best_iterations model. If that is solved, this issue is no longer needed, so I'll close. Thanks!

pseudotensor · 2020-12-31T22:15:13Z

Actually, re-opening. dask is heavily feature incomplete, esp. for scikit-learn API. But most critically dask cannot do pred_contribs etc. So would be good to be able to convert to non-dask so can apply normal operations during prediction time.

@hcho3 , thanks, I'll try your suggestion.

pseudotensor · 2021-01-03T02:23:31Z

Also, the non-sklearn dask API also fails to work for pred_contribs etc. even though supposed to work, so this also becomes important for trying to have that working.

pseudotensor · 2021-01-03T03:26:40Z

FYI this seems to work:

import pandas as pd
import xgboost as xgb
def fun():
    from dask.distributed import Client, wait
    from dask_cuda import LocalCUDACluster

    with LocalCUDACluster() as cluster:
        with Client(cluster) as client:

            import xgboost as xgb
            import dask_cudf

            import pickle
            (model, X, y, kwargs) = pickle.load(open("xgbissue6469.pkl", "rb"))
            X = pd.read_csv("creditcard.csv")
            print(X.columns)
            target = 'default payment next month'
            y = X[target]
            X = X.drop(target, axis=1)
            import dask.dataframe as dd
            X = dd.from_pandas(X, chunksize=250).persist()
            y = dd.from_pandas(y, chunksize=250).persist()
            valid_X = kwargs['eval_set'][0][0]
            valid_y = kwargs['eval_set'][0][1]
            valid_X = dd.from_pandas(valid_X, chunksize=250).persist()
            valid_y = dd.from_pandas(valid_y, chunksize=250).persist()
            #kwargs['eval_set'] = [(valid_X, valid_y)]
            kwargs.pop('eval_set', None)
            kwargs.pop('eval_metric', None)
            model = xgb.dask.DaskXGBClassifier(**model.get_params())
            model.fit(X, y, **kwargs)
            model.get_booster().save_model('0001.model')
            bst = xgb.Booster()
            bst.load_model('0001.model')
            dpredict = xgb.DMatrix(X.compute())
            preds_contribs = bst.predict(dpredict, pred_contribs=True)
            print(preds_contribs.shape)

if __name__ == '__main__':
    fun()

and other variations of this work too, e.g. going to sklearn model and having it load the booster:

import pandas as pd
import xgboost as xgb
def fun():
    from dask.distributed import Client, wait
    from dask_cuda import LocalCUDACluster

    with LocalCUDACluster() as cluster:
        with Client(cluster) as client:

            import xgboost as xgb
            import dask_cudf

            import pickle
            (model, X, y, kwargs) = pickle.load(open("xgbissue6469.pkl", "rb"))
            X = pd.read_csv("creditcard.csv")
            print(X.columns)
            target = 'default payment next month'
            y = X[target]
            X = X.drop(target, axis=1)
            import dask.dataframe as dd
            X = dd.from_pandas(X, chunksize=250).persist()
            y = dd.from_pandas(y, chunksize=250).persist()
            valid_X = kwargs['eval_set'][0][0]
            valid_y = kwargs['eval_set'][0][1]
            valid_X = dd.from_pandas(valid_X, chunksize=250).persist()
            valid_y = dd.from_pandas(valid_y, chunksize=250).persist()
            #kwargs['eval_set'] = [(valid_X, valid_y)]
            kwargs.pop('eval_set', None)
            kwargs.pop('eval_metric', None)
            model = xgb.dask.DaskXGBClassifier(**model.get_params())
            model.fit(X, y, **kwargs)
            model.get_booster().save_model('0001.model')
            bst = xgb.Booster()
            bst.load_model('0001.model')
            dpredict = xgb.DMatrix(X.compute())
            preds_contribs = bst.predict(dpredict, pred_contribs=True)
            print(preds_contribs.shape)

            model2 = xgb.XGBClassifier(**model.get_params())
            model2._Booster = bst
            preds_contribs2 = model2.predict(X.compute(), pred_contribs=True)
            print(preds_contribs2.shape)

            print("here")

if __name__ == '__main__':
    fun()

trivialfis · 2021-01-07T14:00:32Z

Could you please help taking a look into #6582 ?

hcho3 changed the title ~~[FEATURE} convert dask -> non-dask model~~ [FEATURE] convert dask -> non-dask model Dec 22, 2020

hcho3 added the feature-request label Dec 23, 2020

pseudotensor mentioned this issue Dec 25, 2020

ntree_limit not used in dask scikit-learn prediction. #6553

Closed

pseudotensor closed this as completed Dec 29, 2020

pseudotensor reopened this Dec 31, 2020

pseudotensor mentioned this issue Jan 3, 2021

pred_contribs via dask API fails: ValueError: Shape of passed values is ... #6568

Closed

trivialfis mentioned this issue Jan 7, 2021

Support _estimator_type. #6582

Merged

trivialfis closed this as completed in #6582 Jan 8, 2021

pseudotensor mentioned this issue May 24, 2021

can't load dask model from non-dask model #6991

Closed

noahisch mentioned this issue Jul 29, 2022

Support for DaskXGBClassifier to PMML jpmml/jpmml-daskml#1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] convert dask -> non-dask model #6547

[FEATURE] convert dask -> non-dask model #6547

pseudotensor commented Dec 22, 2020

hcho3 commented Dec 22, 2020 •

edited

Loading

trivialfis commented Dec 23, 2020

trivialfis commented Dec 24, 2020

pseudotensor commented Dec 24, 2020

trivialfis commented Dec 24, 2020

pseudotensor commented Dec 29, 2020

pseudotensor commented Dec 31, 2020

pseudotensor commented Jan 3, 2021

pseudotensor commented Jan 3, 2021 •

edited

Loading

trivialfis commented Jan 7, 2021

[FEATURE] convert dask -> non-dask model #6547

[FEATURE] convert dask -> non-dask model #6547

Comments

pseudotensor commented Dec 22, 2020

hcho3 commented Dec 22, 2020 • edited Loading

trivialfis commented Dec 23, 2020

trivialfis commented Dec 24, 2020

pseudotensor commented Dec 24, 2020

trivialfis commented Dec 24, 2020

pseudotensor commented Dec 29, 2020

pseudotensor commented Dec 31, 2020

pseudotensor commented Jan 3, 2021

pseudotensor commented Jan 3, 2021 • edited Loading

trivialfis commented Jan 7, 2021

hcho3 commented Dec 22, 2020 •

edited

Loading

pseudotensor commented Jan 3, 2021 •

edited

Loading