`ntree_limit` not used in dask scikit-learn prediction. #6553

pseudotensor · 2020-12-25T09:44:56Z

Started from discussion here: #6547

This is just start of concern that dask and non-dask do not agree. I don't know even if they agreed that the use of the internal xgb tree structure will give same results. So far they do not.

from dask.distributed import Client
from dask_cuda import LocalCUDACluster
from dask import dataframe as dd
import xgboost as xgb
import numpy as np
import random


def main(client):

    # data
    import pickle
    params, X, y = pickle.load(open("bad_dask.pkl", "rb"))
    for k in ["accuracy", "debug_verbose", "disable_gpus", "early_stopping_rounds", "early_stopping_threshold",
              "ensemble_level", "interpretability", "lossguide", "model_class_name",
              "monotonicity_constraints", "score_f_name", "silent", "time_tolerance",
              "train_shape", "valid_shape"]:
        del params[k]
    print(params)
    print(X.shape)
    target = "__TARGET__"
    X[target] = y

    # dask
    np.random.seed(1234)
    random.seed(1234)
    dask_df = dd.from_pandas(X, chunksize=5000).persist()
    X = X.drop(target, axis=1)
    yd = dask_df[target]
    Xd = dask_df.drop(target, axis=1)

    dtrain = xgb.dask.DaskDMatrix(client, Xd, yd)
    output = xgb.dask.train(client,
                            params,
                            dtrain,
                            num_boost_round=params['n_estimators'])
    preds = xgb.dask.predict(client, output, dtrain)
    print(preds.compute()[0:10])

    # non-dask
    np.random.seed(1234)
    random.seed(1234)
    dtrain = xgb.DMatrix(X, y)
    output = xgb.train(params,
                            dtrain,
                            num_boost_round=params['n_estimators'])
    preds = output.predict(dtrain)
    print(preds[0:10])

    model = xgb.XGBRegressor(**params)
    model.fit(X, y)
    preds = model.predict(X)
    print(preds[0:10])


if __name__ == '__main__':
    with LocalCUDACluster() as cluster:
        with Client(cluster) as client:
            main(client)

gives

/home/jon/minicondadai/bin/python /home/jon/h2oai.fullcondatest/bad_dask.py
{'base_score': 0.5041183531371466, 'booster': 'gbtree', 'colsample_bylevel': 1, 'colsample_bynode': 1, 'colsample_bytree': 0.8, 'gamma': 0, 'gpu_id': -1, 'importance_type': 'gain', 'interaction_constraints': '', 'learning_rate': 0.05, 'max_delta_step': 0, 'max_depth': 6, 'min_child_weight': 1.0, 'missing': nan, 'monotone_constraints': '()', 'n_estimators': 30, 'n_jobs': 1, 'num_parallel_tree': 1, 'objective': 'reg:squarederror', 'random_state': 1234, 'reg_alpha': 0.0, 'reg_lambda': 1.0, 'scale_pos_weight': 1, 'subsample': 0.7, 'tree_method': 'hist', 'validate_parameters': 1, 'verbosity': None, 'num_class': 1, 'labels': None, 'time_column': None, 'encoder': None, 'tgc': None, 'pred_gap': None, 'pred_periods': None, 'target': None, 'tsp': None, 'max_bin': 256, 'grow_policy': 'depthwise', 'max_leaves': 0, 'eval_metric': 'rmse', 'seed': 1234}
(1000, 10)
[01:41:43] task [xgboost.dask]:tcp://127.0.0.1:32817 got new rank 0
[01:41:43] WARNING: /workspace/xgboost/src/learner.cc:547: 
Parameters: { importance_type, missing, n_estimators } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[0.43222576 0.39690694 0.5354113  0.50603116 0.5120257  0.41515645
 0.4671617  0.4762869  0.5519314  0.42856684]
[01:41:43] WARNING: /workspace/xgboost/src/learner.cc:547: 
Parameters: { importance_type, missing, n_estimators } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[0.46900997 0.3888386  0.5265289  0.5107001  0.55268747 0.4718765
 0.47028947 0.4690606  0.56090456 0.39229804]
[0.46900997 0.3888386  0.5265289  0.5107001  0.55268747 0.4718765
 0.47028947 0.4690606  0.56090456 0.39229804]

bad_dask.pkl.zip

So you can see that non-dask raw API and sklearn API agree, but dask does not. I didn't show it, but I get same result with sklearn API for dask as raw API for dask.

The text was updated successfully, but these errors were encountered:

pseudotensor · 2020-12-25T09:45:05Z

FYI @trivialfis

pseudotensor · 2020-12-25T09:56:21Z

For some context, forgetting about wrong predictions alone, if we just read from the model state what the predictions should be based upon the tree structure, that does not agree with the predictions made by dask. So probably the predict code itself is wrong.

pseudotensor · 2020-12-25T10:23:55Z

But looking at our results, it seems both things are true. The tree structure is different for whatever reason, and the predictions are not consistent.

E.g. with sample weight (not like above test, but similar) if one reads the tree structure for dask one gets:

  0 |  0.486334
  1 |  0.413638
  2 |  0.493245
  3 |  0.504709
  4 |  0.459897
  5 |  0.416565
  6 |  0.512489
  7 |  0.502452
  8 |  0.608652
  9 |  0.44373 
 10 |  0.645711

but python gets:

  0 | 0.485168
  1 | 0.412472
  2 | 0.48073 
  3 | 0.50981 
  4 | 0.456628
  5 | 0.415668
  6 | 0.50897 
  7 | 0.507552
  8 | 0.607485
  9 | 0.441894
 10 | 0.650524

while for non-dask they perfectly agree:

tree structure read:

  0 |  0.465502
  1 |  0.454047
  2 |  0.500453
  3 |  0.504128
  4 |  0.482939
  5 |  0.423499
  6 |  0.412935
  7 |  0.498929
  8 |  0.635505
  9 |  0.35679 
 10 |  0.57153

python

  0 | 0.465502
  1 | 0.454047
  2 | 0.500453
  3 | 0.504128
  4 | 0.482939
  5 | 0.423499
  6 | 0.412935
  7 | 0.498929
  8 | 0.635505
  9 | 0.35679 
 10 | 0.57153

pseudotensor · 2020-12-25T11:36:14Z

What I found from the above, is that if I only choose last tree (i.e. don't pass ntree_limit or set to 0 or set to self.best_iteration + 1 or what I set as n_estimators), then I can get the tree structure and python to agree.

So while for non-dask every tree seems to agree, for dask only the last (i.e. cumulative) tree predictions agree between the structure of the tree and what dask says it the predictions.

I still don't get dask and non-dask to agree between themselves, but at least as long as I only choose the last ntree_limit I can make dask make sense itself. So there seems to be some problem requesting intermediate trees and not agreeing with non-dask.

trivialfis · 2020-12-25T16:50:57Z

I meant prediction on the same model. For distributed training there will be slight difference caused by a few sources of errors. I will try to look into your issue deeper and draw a conclusion on whether it's a bug. But so far you don't have to worry about it too much.

pseudotensor · 2020-12-25T18:09:37Z

A couple things:

Here it is just one worker, not distributed right? Why shouldn't that be like normal non-dask
Still there is a wrong behavior when asking for other trees besides all. It gives wrong results, which means, e.g. if one uses early stopping and get best_iterations, the predictions will be wrong because that is not all trees.

trivialfis · 2020-12-29T15:35:12Z

@pseudotensor How do you access a subset of trees using dask?

trivialfis · 2020-12-29T15:42:19Z

There's an inconsistency between skl and dask skl due to the use ntree_limit. The ntree_limit is not enabled in dask interface while it's automatically used in non-dask interface. Right now a quick workaround is define the EarlyStopping callback yourself, and use save_best option.

trivialfis · 2020-12-29T15:50:37Z

Keeping various behavior consistent with skl is quite difficult for me ... I think there are lots of heuristics there I don't know or simply can't remember.

pseudotensor · 2021-01-03T10:22:35Z

Ya, I know it's difficult. That's why I suggested to avoid continuing down road where all the APIs are distinct. Better if there was not a special dask API and everything was handled internally. This would also force easier/sooner feature parity.

trivialfis · 2021-01-09T19:20:42Z

Better if there was not a special dask API and everything was handled internally. This would also force easier/sooner feature parity.

Yeah I absolutely agree with you and wanted that so bad when I was building the dask API. However, there were a number of obstacles, from how to conditionally import dask features, to supporting dask specific features. The conditional import is easy to understand, but for distributed data structures, the handling is much more complicated than single node.

If you look into the current dask interface code, a large number of code is devoted into how to obtain the local data without making copies, another issue is keeping the data and meta info like labels and weights consistent in both partition size and order. None of these things is presented on single node computation. The actual interface for dask skl that you use is actually quite thin. Another part is handling async dask client, which means not only there will be additional parameter that single node training doesn't need to care about, but also there are API specific ways of performing computation like client.compute(data). data.compute() doesn't work on async environment, even if it does the extra compute and future handling still counts.

For a bit more history, I made a mistake that DaskDMatrix at one point held a reference to input dask data shape, so that I can call the shape attribute for debugging purpose, and also better aligning with local API num_row()/num_col(). Everything went fine until one day we discovered that due to shape[0] (number of rows) from dask data being a lazy object, when it was passed along with DaskDMatrix via closure, the whole data set got copied and pulled into every worker, without any warning nor error. So after that I have to be very careful around defining object attributes. Also, this is a cause of that very ugly function create_fn_args.

Another possible way to go forward is defining yet another interface that can include current interfaces as different backends, with the newly added global_config for backend control. But as you may see, this might turn into an even messier state as the input space for a single interface is exponentially growing due to incompatible input data structures. But let's say we can define policy on how to handle the incompatibility, the unified interface still requires us finishing the current interface first.

Right now I don't have any concrete plan on how to bring them together other than reusing as much code and tests as possible, and hope for passing skl estimator check in near future. The current state is, for skl estimator class methods, the accepted arguments are 1-on-1 match with single node computation. Feel free to ping me if you have suggestion, I'm open to lengthy discussion online and offline.

trivialfis · 2021-01-09T19:22:24Z

where all the APIs are distinct.

This is the part that I'm currently addressing by aligning the dask interface with single node interface. Also I abstracted some configurations into reusable functions that don't touch the input data so they can be reused.

trivialfis changed the title ~~prediction difference between dask and non-dask~~ ntree_limit not used in dask scikit-learn prediction. Dec 29, 2020

trivialfis mentioned this issue Jan 2, 2021

[WIP] [dask] Use inplace prediction by default. #6567

Closed

1 task

This was referenced Jan 29, 2021

[WIP] Rework predict functions. #6638

Closed

[breaking] Add prediction fucntion for DMatrix and use inplace predict for dask. #6668

Merged

trivialfis closed this as completed in #6668 Feb 8, 2021

pseudotensor mentioned this issue May 28, 2021

Different dask vs. non-dask training even on 1 worker, 1 chunk, 1 n_jobs #7004

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`ntree_limit` not used in dask scikit-learn prediction. #6553

`ntree_limit` not used in dask scikit-learn prediction. #6553

pseudotensor commented Dec 25, 2020

pseudotensor commented Dec 25, 2020

pseudotensor commented Dec 25, 2020 •

edited

Loading

pseudotensor commented Dec 25, 2020

pseudotensor commented Dec 25, 2020 •

edited

Loading

trivialfis commented Dec 25, 2020

pseudotensor commented Dec 25, 2020 •

edited

Loading

trivialfis commented Dec 29, 2020

trivialfis commented Dec 29, 2020

trivialfis commented Dec 29, 2020

pseudotensor commented Jan 3, 2021

trivialfis commented Jan 9, 2021 •

edited

Loading

trivialfis commented Jan 9, 2021 •

edited

Loading

ntree_limit not used in dask scikit-learn prediction. #6553

ntree_limit not used in dask scikit-learn prediction. #6553

Comments

pseudotensor commented Dec 25, 2020

pseudotensor commented Dec 25, 2020

pseudotensor commented Dec 25, 2020 • edited Loading

pseudotensor commented Dec 25, 2020

pseudotensor commented Dec 25, 2020 • edited Loading

trivialfis commented Dec 25, 2020

pseudotensor commented Dec 25, 2020 • edited Loading

trivialfis commented Dec 29, 2020

trivialfis commented Dec 29, 2020

trivialfis commented Dec 29, 2020

pseudotensor commented Jan 3, 2021

trivialfis commented Jan 9, 2021 • edited Loading

trivialfis commented Jan 9, 2021 • edited Loading

`ntree_limit` not used in dask scikit-learn prediction. #6553

`ntree_limit` not used in dask scikit-learn prediction. #6553

pseudotensor commented Dec 25, 2020 •

edited

Loading

pseudotensor commented Dec 25, 2020 •

edited

Loading

pseudotensor commented Dec 25, 2020 •

edited

Loading

trivialfis commented Jan 9, 2021 •

edited

Loading

trivialfis commented Jan 9, 2021 •

edited

Loading