Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ntree_limit not used in dask scikit-learn prediction. #6553

Closed
pseudotensor opened this issue Dec 25, 2020 · 12 comments · Fixed by #6668
Closed

ntree_limit not used in dask scikit-learn prediction. #6553

pseudotensor opened this issue Dec 25, 2020 · 12 comments · Fixed by #6668

Comments

@pseudotensor
Copy link
Contributor

Started from discussion here: #6547

This is just start of concern that dask and non-dask do not agree. I don't know even if they agreed that the use of the internal xgb tree structure will give same results. So far they do not.

from dask.distributed import Client
from dask_cuda import LocalCUDACluster
from dask import dataframe as dd
import xgboost as xgb
import numpy as np
import random


def main(client):

    # data
    import pickle
    params, X, y = pickle.load(open("bad_dask.pkl", "rb"))
    for k in ["accuracy", "debug_verbose", "disable_gpus", "early_stopping_rounds", "early_stopping_threshold",
              "ensemble_level", "interpretability", "lossguide", "model_class_name",
              "monotonicity_constraints", "score_f_name", "silent", "time_tolerance",
              "train_shape", "valid_shape"]:
        del params[k]
    print(params)
    print(X.shape)
    target = "__TARGET__"
    X[target] = y

    # dask
    np.random.seed(1234)
    random.seed(1234)
    dask_df = dd.from_pandas(X, chunksize=5000).persist()
    X = X.drop(target, axis=1)
    yd = dask_df[target]
    Xd = dask_df.drop(target, axis=1)

    dtrain = xgb.dask.DaskDMatrix(client, Xd, yd)
    output = xgb.dask.train(client,
                            params,
                            dtrain,
                            num_boost_round=params['n_estimators'])
    preds = xgb.dask.predict(client, output, dtrain)
    print(preds.compute()[0:10])

    # non-dask
    np.random.seed(1234)
    random.seed(1234)
    dtrain = xgb.DMatrix(X, y)
    output = xgb.train(params,
                            dtrain,
                            num_boost_round=params['n_estimators'])
    preds = output.predict(dtrain)
    print(preds[0:10])

    model = xgb.XGBRegressor(**params)
    model.fit(X, y)
    preds = model.predict(X)
    print(preds[0:10])


if __name__ == '__main__':
    with LocalCUDACluster() as cluster:
        with Client(cluster) as client:
            main(client)

gives

/home/jon/minicondadai/bin/python /home/jon/h2oai.fullcondatest/bad_dask.py
{'base_score': 0.5041183531371466, 'booster': 'gbtree', 'colsample_bylevel': 1, 'colsample_bynode': 1, 'colsample_bytree': 0.8, 'gamma': 0, 'gpu_id': -1, 'importance_type': 'gain', 'interaction_constraints': '', 'learning_rate': 0.05, 'max_delta_step': 0, 'max_depth': 6, 'min_child_weight': 1.0, 'missing': nan, 'monotone_constraints': '()', 'n_estimators': 30, 'n_jobs': 1, 'num_parallel_tree': 1, 'objective': 'reg:squarederror', 'random_state': 1234, 'reg_alpha': 0.0, 'reg_lambda': 1.0, 'scale_pos_weight': 1, 'subsample': 0.7, 'tree_method': 'hist', 'validate_parameters': 1, 'verbosity': None, 'num_class': 1, 'labels': None, 'time_column': None, 'encoder': None, 'tgc': None, 'pred_gap': None, 'pred_periods': None, 'target': None, 'tsp': None, 'max_bin': 256, 'grow_policy': 'depthwise', 'max_leaves': 0, 'eval_metric': 'rmse', 'seed': 1234}
(1000, 10)
[01:41:43] task [xgboost.dask]:tcp://127.0.0.1:32817 got new rank 0
[01:41:43] WARNING: /workspace/xgboost/src/learner.cc:547: 
Parameters: { importance_type, missing, n_estimators } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[0.43222576 0.39690694 0.5354113  0.50603116 0.5120257  0.41515645
 0.4671617  0.4762869  0.5519314  0.42856684]
[01:41:43] WARNING: /workspace/xgboost/src/learner.cc:547: 
Parameters: { importance_type, missing, n_estimators } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[0.46900997 0.3888386  0.5265289  0.5107001  0.55268747 0.4718765
 0.47028947 0.4690606  0.56090456 0.39229804]
[0.46900997 0.3888386  0.5265289  0.5107001  0.55268747 0.4718765
 0.47028947 0.4690606  0.56090456 0.39229804]

bad_dask.pkl.zip

So you can see that non-dask raw API and sklearn API agree, but dask does not. I didn't show it, but I get same result with sklearn API for dask as raw API for dask.

@pseudotensor
Copy link
Contributor Author

FYI @trivialfis

@pseudotensor
Copy link
Contributor Author

pseudotensor commented Dec 25, 2020

For some context, forgetting about wrong predictions alone, if we just read from the model state what the predictions should be based upon the tree structure, that does not agree with the predictions made by dask. So probably the predict code itself is wrong.

@pseudotensor
Copy link
Contributor Author

But looking at our results, it seems both things are true. The tree structure is different for whatever reason, and the predictions are not consistent.

E.g. with sample weight (not like above test, but similar) if one reads the tree structure for dask one gets:

  0 |  0.486334
  1 |  0.413638
  2 |  0.493245
  3 |  0.504709
  4 |  0.459897
  5 |  0.416565
  6 |  0.512489
  7 |  0.502452
  8 |  0.608652
  9 |  0.44373 
 10 |  0.645711

but python gets:

  0 | 0.485168
  1 | 0.412472
  2 | 0.48073 
  3 | 0.50981 
  4 | 0.456628
  5 | 0.415668
  6 | 0.50897 
  7 | 0.507552
  8 | 0.607485
  9 | 0.441894
 10 | 0.650524

while for non-dask they perfectly agree:

tree structure read:

  0 |  0.465502
  1 |  0.454047
  2 |  0.500453
  3 |  0.504128
  4 |  0.482939
  5 |  0.423499
  6 |  0.412935
  7 |  0.498929
  8 |  0.635505
  9 |  0.35679 
 10 |  0.57153 

python

  0 | 0.465502
  1 | 0.454047
  2 | 0.500453
  3 | 0.504128
  4 | 0.482939
  5 | 0.423499
  6 | 0.412935
  7 | 0.498929
  8 | 0.635505
  9 | 0.35679 
 10 | 0.57153 

@pseudotensor
Copy link
Contributor Author

pseudotensor commented Dec 25, 2020

What I found from the above, is that if I only choose last tree (i.e. don't pass ntree_limit or set to 0 or set to self.best_iteration + 1 or what I set as n_estimators), then I can get the tree structure and python to agree.

So while for non-dask every tree seems to agree, for dask only the last (i.e. cumulative) tree predictions agree between the structure of the tree and what dask says it the predictions.

I still don't get dask and non-dask to agree between themselves, but at least as long as I only choose the last ntree_limit I can make dask make sense itself. So there seems to be some problem requesting intermediate trees and not agreeing with non-dask.

@trivialfis
Copy link
Member

I meant prediction on the same model. For distributed training there will be slight difference caused by a few sources of errors. I will try to look into your issue deeper and draw a conclusion on whether it's a bug. But so far you don't have to worry about it too much.

@pseudotensor
Copy link
Contributor Author

pseudotensor commented Dec 25, 2020

A couple things:

  1. Here it is just one worker, not distributed right? Why shouldn't that be like normal non-dask

  2. Still there is a wrong behavior when asking for other trees besides all. It gives wrong results, which means, e.g. if one uses early stopping and get best_iterations, the predictions will be wrong because that is not all trees.

@trivialfis
Copy link
Member

@pseudotensor How do you access a subset of trees using dask?

@trivialfis
Copy link
Member

There's an inconsistency between skl and dask skl due to the use ntree_limit. The ntree_limit is not enabled in dask interface while it's automatically used in non-dask interface. Right now a quick workaround is define the EarlyStopping callback yourself, and use save_best option.

@trivialfis trivialfis changed the title prediction difference between dask and non-dask ntree_limit not used in dask scikit-learn prediction. Dec 29, 2020
@trivialfis
Copy link
Member

Keeping various behavior consistent with skl is quite difficult for me ... I think there are lots of heuristics there I don't know or simply can't remember.

@pseudotensor
Copy link
Contributor Author

Ya, I know it's difficult. That's why I suggested to avoid continuing down road where all the APIs are distinct. Better if there was not a special dask API and everything was handled internally. This would also force easier/sooner feature parity.

@trivialfis
Copy link
Member

trivialfis commented Jan 9, 2021

Better if there was not a special dask API and everything was handled internally. This would also force easier/sooner feature parity.

Yeah I absolutely agree with you and wanted that so bad when I was building the dask API. However, there were a number of obstacles, from how to conditionally import dask features, to supporting dask specific features. The conditional import is easy to understand, but for distributed data structures, the handling is much more complicated than single node.

If you look into the current dask interface code, a large number of code is devoted into how to obtain the local data without making copies, another issue is keeping the data and meta info like labels and weights consistent in both partition size and order. None of these things is presented on single node computation. The actual interface for dask skl that you use is actually quite thin. Another part is handling async dask client, which means not only there will be additional parameter that single node training doesn't need to care about, but also there are API specific ways of performing computation like client.compute(data). data.compute() doesn't work on async environment, even if it does the extra compute and future handling still counts.

For a bit more history, I made a mistake that DaskDMatrix at one point held a reference to input dask data shape, so that I can call the shape attribute for debugging purpose, and also better aligning with local API num_row()/num_col(). Everything went fine until one day we discovered that due to shape[0] (number of rows) from dask data being a lazy object, when it was passed along with DaskDMatrix via closure, the whole data set got copied and pulled into every worker, without any warning nor error. So after that I have to be very careful around defining object attributes. Also, this is a cause of that very ugly function create_fn_args.

Another possible way to go forward is defining yet another interface that can include current interfaces as different backends, with the newly added global_config for backend control. But as you may see, this might turn into an even messier state as the input space for a single interface is exponentially growing due to incompatible input data structures. But let's say we can define policy on how to handle the incompatibility, the unified interface still requires us finishing the current interface first.

Right now I don't have any concrete plan on how to bring them together other than reusing as much code and tests as possible, and hope for passing skl estimator check in near future. The current state is, for skl estimator class methods, the accepted arguments are 1-on-1 match with single node computation. Feel free to ping me if you have suggestion, I'm open to lengthy discussion online and offline.

@trivialfis
Copy link
Member

trivialfis commented Jan 9, 2021

where all the APIs are distinct.

This is the part that I'm currently addressing by aligning the dask interface with single node interface. Also I abstracted some configurations into reusable functions that don't touch the input data so they can be reused.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants