-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ntree_limit
not used in dask scikit-learn prediction.
#6553
Comments
FYI @trivialfis |
For some context, forgetting about wrong predictions alone, if we just read from the model state what the predictions should be based upon the tree structure, that does not agree with the predictions made by dask. So probably the predict code itself is wrong. |
But looking at our results, it seems both things are true. The tree structure is different for whatever reason, and the predictions are not consistent. E.g. with sample weight (not like above test, but similar) if one reads the tree structure for dask one gets:
but python gets:
while for non-dask they perfectly agree: tree structure read:
python
|
What I found from the above, is that if I only choose last tree (i.e. don't pass ntree_limit or set to 0 or set to self.best_iteration + 1 or what I set as n_estimators), then I can get the tree structure and python to agree. So while for non-dask every tree seems to agree, for dask only the last (i.e. cumulative) tree predictions agree between the structure of the tree and what dask says it the predictions. I still don't get dask and non-dask to agree between themselves, but at least as long as I only choose the last ntree_limit I can make dask make sense itself. So there seems to be some problem requesting intermediate trees and not agreeing with non-dask. |
I meant prediction on the same model. For distributed training there will be slight difference caused by a few sources of errors. I will try to look into your issue deeper and draw a conclusion on whether it's a bug. But so far you don't have to worry about it too much. |
A couple things:
|
@pseudotensor How do you access a subset of trees using dask? |
There's an inconsistency between skl and dask skl due to the use |
ntree_limit
not used in dask scikit-learn prediction.
Keeping various behavior consistent with skl is quite difficult for me ... I think there are lots of heuristics there I don't know or simply can't remember. |
Ya, I know it's difficult. That's why I suggested to avoid continuing down road where all the APIs are distinct. Better if there was not a special dask API and everything was handled internally. This would also force easier/sooner feature parity. |
Yeah I absolutely agree with you and wanted that so bad when I was building the dask API. However, there were a number of obstacles, from how to conditionally import dask features, to supporting dask specific features. The conditional import is easy to understand, but for distributed data structures, the handling is much more complicated than single node. If you look into the current dask interface code, a large number of code is devoted into how to obtain the local data without making copies, another issue is keeping the data and meta info like labels and weights consistent in both partition size and order. None of these things is presented on single node computation. The actual interface for dask skl that you use is actually quite thin. Another part is handling async dask client, which means not only there will be additional parameter that single node training doesn't need to care about, but also there are API specific ways of performing computation like For a bit more history, I made a mistake that Another possible way to go forward is defining yet another interface that can include current interfaces as different backends, with the newly added Right now I don't have any concrete plan on how to bring them together other than reusing as much code and tests as possible, and hope for passing skl estimator check in near future. The current state is, for skl estimator class methods, the accepted arguments are 1-on-1 match with single node computation. Feel free to ping me if you have suggestion, I'm open to lengthy discussion online and offline. |
This is the part that I'm currently addressing by aligning the dask interface with single node interface. Also I abstracted some configurations into reusable functions that don't touch the input data so they can be reused. |
Started from discussion here: #6547
This is just start of concern that dask and non-dask do not agree. I don't know even if they agreed that the use of the internal xgb tree structure will give same results. So far they do not.
gives
bad_dask.pkl.zip
So you can see that non-dask raw API and sklearn API agree, but dask does not. I didn't show it, but I get same result with sklearn API for dask as raw API for dask.
The text was updated successfully, but these errors were encountered: