Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] convert dask -> non-dask model #6547

Closed
pseudotensor opened this issue Dec 22, 2020 · 10 comments · Fixed by #6582
Closed

[FEATURE] convert dask -> non-dask model #6547

pseudotensor opened this issue Dec 22, 2020 · 10 comments · Fixed by #6582

Comments

@pseudotensor
Copy link
Contributor

rapidsai/cuml#3140 (comment)

Related to this general issue, I think xgboost team does a much better job and ensures dask can be pickled etc.

But I encountered some differences between non-dask and dask trees that makes reading them incompatible. Specifically we have poor accuracy when trying to read the tree structure to do predictions, but only for dask models.

I know xgboost can be used with treelite etc. Is it expected that dask models can be used too?

In general, if there is any complication, is there a way to convert a dask model (sckit and raw API) into non-dask form so non-dask tools can be used?

@trivialfis

@hcho3
Copy link
Collaborator

hcho3 commented Dec 22, 2020

@pseudotensor I don't see any technical hurdle in converting Dask model to an ordinary XGBoost model JSON file. We already have a callback function to serialize model at every boosting iteration, and the callback should work with Dask XGBoost:

@pytest.mark.skipif(**tm.no_sklearn())
def test_callback(self, client):
from sklearn.datasets import load_breast_cancer
X, y = load_breast_cancer(return_X_y=True)
X, y = da.from_array(X), da.from_array(y)
cls = xgb.dask.DaskXGBClassifier(objective='binary:logistic', tree_method='hist',
n_estimators=10)
cls.client = client
with tempfile.TemporaryDirectory() as tmpdir:
cls.fit(X, y, callbacks=[xgb.callback.TrainingCheckPoint(directory=tmpdir,
iterations=1,
name='model')])
for i in range(1, 10):
assert os.path.exists(
os.path.join(tmpdir, 'model_' + str(i) + '.json'))

@hcho3 hcho3 changed the title [FEATURE} convert dask -> non-dask model [FEATURE] convert dask -> non-dask model Dec 22, 2020
@trivialfis
Copy link
Member

Could you please share the script that prediction is different?

@trivialfis
Copy link
Member

Prediction difference is serious bug, so far I haven't been able to reproduce it. But if you have a MRE I will not hesitate to fix it and push another patch release.

@pseudotensor
Copy link
Contributor Author

Yes, @trivialfis , I'm not saying exactly that the predictions are off, just the way we access the tree structure via the internal format leads to very different predictions between dask and non-dask. I'll get a repro of some kind ASAP.

@trivialfis
Copy link
Member

To answer the original question, the booster returned by dask train function is exactly the same with single node. But if you want to convert the pickled model between skl interfaces of dask and single node, I don't think that's possible at this point. They are different Python classes and pickle stores Python bytecode.

@pseudotensor
Copy link
Contributor Author

I think it was deduced that the primary problem here is that the dask model does not use ntree_limit so predictions can be off if trying to actually select the best_iterations model. If that is solved, this issue is no longer needed, so I'll close. Thanks!

@pseudotensor
Copy link
Contributor Author

Actually, re-opening. dask is heavily feature incomplete, esp. for scikit-learn API. But most critically dask cannot do pred_contribs etc. So would be good to be able to convert to non-dask so can apply normal operations during prediction time.

@hcho3 , thanks, I'll try your suggestion.

@pseudotensor
Copy link
Contributor Author

Also, the non-sklearn dask API also fails to work for pred_contribs etc. even though supposed to work, so this also becomes important for trying to have that working.

@pseudotensor
Copy link
Contributor Author

pseudotensor commented Jan 3, 2021

FYI this seems to work:

import pandas as pd
import xgboost as xgb
def fun():
    from dask.distributed import Client, wait
    from dask_cuda import LocalCUDACluster

    with LocalCUDACluster() as cluster:
        with Client(cluster) as client:

            import xgboost as xgb
            import dask_cudf

            import pickle
            (model, X, y, kwargs) = pickle.load(open("xgbissue6469.pkl", "rb"))
            X = pd.read_csv("creditcard.csv")
            print(X.columns)
            target = 'default payment next month'
            y = X[target]
            X = X.drop(target, axis=1)
            import dask.dataframe as dd
            X = dd.from_pandas(X, chunksize=250).persist()
            y = dd.from_pandas(y, chunksize=250).persist()
            valid_X = kwargs['eval_set'][0][0]
            valid_y = kwargs['eval_set'][0][1]
            valid_X = dd.from_pandas(valid_X, chunksize=250).persist()
            valid_y = dd.from_pandas(valid_y, chunksize=250).persist()
            #kwargs['eval_set'] = [(valid_X, valid_y)]
            kwargs.pop('eval_set', None)
            kwargs.pop('eval_metric', None)
            model = xgb.dask.DaskXGBClassifier(**model.get_params())
            model.fit(X, y, **kwargs)
            model.get_booster().save_model('0001.model')
            bst = xgb.Booster()
            bst.load_model('0001.model')
            dpredict = xgb.DMatrix(X.compute())
            preds_contribs = bst.predict(dpredict, pred_contribs=True)
            print(preds_contribs.shape)

if __name__ == '__main__':
    fun()

and other variations of this work too, e.g. going to sklearn model and having it load the booster:

import pandas as pd
import xgboost as xgb
def fun():
    from dask.distributed import Client, wait
    from dask_cuda import LocalCUDACluster

    with LocalCUDACluster() as cluster:
        with Client(cluster) as client:

            import xgboost as xgb
            import dask_cudf

            import pickle
            (model, X, y, kwargs) = pickle.load(open("xgbissue6469.pkl", "rb"))
            X = pd.read_csv("creditcard.csv")
            print(X.columns)
            target = 'default payment next month'
            y = X[target]
            X = X.drop(target, axis=1)
            import dask.dataframe as dd
            X = dd.from_pandas(X, chunksize=250).persist()
            y = dd.from_pandas(y, chunksize=250).persist()
            valid_X = kwargs['eval_set'][0][0]
            valid_y = kwargs['eval_set'][0][1]
            valid_X = dd.from_pandas(valid_X, chunksize=250).persist()
            valid_y = dd.from_pandas(valid_y, chunksize=250).persist()
            #kwargs['eval_set'] = [(valid_X, valid_y)]
            kwargs.pop('eval_set', None)
            kwargs.pop('eval_metric', None)
            model = xgb.dask.DaskXGBClassifier(**model.get_params())
            model.fit(X, y, **kwargs)
            model.get_booster().save_model('0001.model')
            bst = xgb.Booster()
            bst.load_model('0001.model')
            dpredict = xgb.DMatrix(X.compute())
            preds_contribs = bst.predict(dpredict, pred_contribs=True)
            print(preds_contribs.shape)

            model2 = xgb.XGBClassifier(**model.get_params())
            model2._Booster = bst
            preds_contribs2 = model2.predict(X.compute(), pred_contribs=True)
            print(preds_contribs2.shape)

            print("here")

if __name__ == '__main__':
    fun()

@trivialfis
Copy link
Member

Could you please help taking a look into #6582 ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants