-
Notifications
You must be signed in to change notification settings - Fork 540
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REVIEW][PROPOSAL] Add tags and prefered memory order tags to estimators #3113
Conversation
Please update the changelog in order to start CI tests. View the gpuCI docs here. |
Codecov Report
@@ Coverage Diff @@
## branch-0.17 #3113 +/- ##
===============================================
+ Coverage 70.68% 70.94% +0.26%
===============================================
Files 197 197
Lines 15564 16092 +528
===============================================
+ Hits 11001 11417 +416
- Misses 4563 4675 +112
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like the idea! Two things:
- Not sure it should be documented in every single init, whic seems repetitive
- We have one currently frustrating exception that RF has different preferred inptus for fit and predict due to usage of FIL (row-wise)... I would love to change that in the future but it's not there yet.
Also, we should add a note about this to the estimator guide in #3040 when both are in. |
@dantegd Is this similar to SkLearn's estimator tags? If so, it might be better to do a larger tag system similar to their design (or at least make this PR compatible with the tag system design). I know many of our tests could benefit from the sklearn tag system (and we would have less tests that are hardcoded to skip particular estimators). |
@mdemoret-nv thanks for the comment! I wasn't aware of the tag addition in Scikit 0.21, so this was immensely helpful to know. I think their implementation is very solid and very useful for our needs, at least for my purposes here with the order attribute. Right now just added @JohnZed @mdemoret-nv thoughts? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Needs a simple test. And I think the preferred order may be off for kneighbors classifier? Otherwise great
|
||
def _more_tags(self): | ||
return { | ||
'preferred_input_order': 'F' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
F for fit, C for predict (awful, I know)... does that meet the tag definition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's an excellent question, I guess for estimators that have discrepancies like this we should leave it as None, what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. Found only minor things, mostly in the estimator guide.
|
||
def _more_tags(self): | ||
return { | ||
'preferred_input_order': 'C' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just leaving a small note (more for myself) that t_sne & UMAP both could probably accept 'F' now that the underlying KNN prim can accept it.
def _more_tags(self): | ||
return { | ||
# fit and predict require conflicting memory layouts | ||
'preferred_input_order': None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this could be fixed but I'll need to look into it. Created #3153
@@ -4,15 +4,36 @@ This guide is meant to help developers follow the correct patterns when creating | |||
|
|||
**Note:** This guide is long, because it includes internal details on how cuML manages input and output types for advanced use cases. But for the vast majority of estimators, the requirements are very simple and can follow the example patterns shown below in the [Quick Start Guide](#quick-start-guide). | |||
|
|||
## Table of Contents | |||
|
|||
- [Recommended Scikit-Learn Documentation](#recommended-scikit-learn-documentation) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oooh, I like this
] | ||
``` | ||
|
||
7. Implement `_more_tags()` if any of the [default tags]() need to be overriden for the new estimator: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be linking to something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found one more small thing as I investigated #3153. I'm going to go ahead and close that issue.
def _more_tags(self): | ||
return { | ||
# fit and predict require conflicting memory layouts | ||
'preferred_input_order': None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked more closely at the kneighbors variants and they do actually require order='F'
as input. We can safely set this F
here and in the kneighbors_classifier
.
Co-authored-by: Corey J. Nolet <[email protected]>
Co-authored-by: Corey J. Nolet <[email protected]>
Co-authored-by: Corey J. Nolet <[email protected]>
Co-authored-by: Corey J. Nolet <[email protected]>
Co-authored-by: Corey J. Nolet <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes LGTM!
rerun tests |
rerun tests |
While benchmarking the upcoming general SHAP implementations in cuML models, there is a non trivial penalty, both in memory and time, that occurs if data is generated in the opposite order that models require. This is also true of things like HPO and pipelines.
This PR adds the adoption of Scikit-learn tag system https://scikit-learn.org/stable/developers/develop.html#estimator-tags as well as adding cuML specific tags:
preferred_input_order
- whether column or row major order input is preferred by the estimatorX_types_gpu
- similar toX_types
of the standard Scikit-lean tags, but for specifying acceptable input types to an algo.