Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW][PROPOSAL] Add tags and prefered memory order tags to estimators #3113

Merged
merged 27 commits into from
Nov 20, 2020

Conversation

dantegd
Copy link
Member

@dantegd dantegd commented Nov 4, 2020

While benchmarking the upcoming general SHAP implementations in cuML models, there is a non trivial penalty, both in memory and time, that occurs if data is generated in the opposite order that models require. This is also true of things like HPO and pipelines.

This PR adds the adoption of Scikit-learn tag system https://scikit-learn.org/stable/developers/develop.html#estimator-tags as well as adding cuML specific tags:

  • preferred_input_order - whether column or row major order input is preferred by the estimator
  • X_types_gpu - similar to X_types of the standard Scikit-lean tags, but for specifying acceptable input types to an algo.

@dantegd dantegd added 3 - Ready for Review Ready for review by team proposal Change current process or code Cython / Python Cython or Python issue labels Nov 4, 2020
@dantegd dantegd requested a review from a team as a code owner November 4, 2020 05:10
@GPUtester
Copy link
Contributor

Please update the changelog in order to start CI tests.

View the gpuCI docs here.

@codecov-io
Copy link

codecov-io commented Nov 4, 2020

Codecov Report

Merging #3113 (17d0ff6) into branch-0.17 (b205e8f) will increase coverage by 0.26%.
The diff coverage is 96.26%.

Impacted file tree graph

@@               Coverage Diff               @@
##           branch-0.17    #3113      +/-   ##
===============================================
+ Coverage        70.68%   70.94%   +0.26%     
===============================================
  Files              197      197              
  Lines            15564    16092     +528     
===============================================
+ Hits             11001    11417     +416     
- Misses            4563     4675     +112     
Impacted Files Coverage Δ
python/cuml/neighbors/nearest_neighbors.pyx 90.45% <ø> (ø)
python/cuml/neighbors/kneighbors_classifier.pyx 94.64% <33.33%> (-1.69%) ⬇️
python/cuml/neighbors/kneighbors_regressor.pyx 92.75% <33.33%> (-2.71%) ⬇️
python/cuml/cluster/dbscan.pyx 100.00% <100.00%> (ø)
python/cuml/cluster/kmeans.pyx 81.81% <100.00%> (+0.27%) ⬆️
python/cuml/common/base.pyx 80.34% <100.00%> (+4.46%) ⬆️
python/cuml/decomposition/pca.pyx 92.19% <100.00%> (+0.14%) ⬆️
python/cuml/decomposition/tsvd.pyx 97.05% <100.00%> (+0.06%) ⬆️
python/cuml/ensemble/randomforestclassifier.pyx 75.47% <100.00%> (+2.11%) ⬆️
python/cuml/ensemble/randomforestregressor.pyx 73.56% <100.00%> (+3.22%) ⬆️
... and 23 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b205e8f...17d0ff6. Read the comment docs.

Copy link
Contributor

@JohnZed JohnZed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the idea! Two things:

  • Not sure it should be documented in every single init, whic seems repetitive
  • We have one currently frustrating exception that RF has different preferred inptus for fit and predict due to usage of FIL (row-wise)... I would love to change that in the future but it's not there yet.

python/cuml/solvers/qn.pyx Outdated Show resolved Hide resolved
@JohnZed
Copy link
Contributor

JohnZed commented Nov 5, 2020

Also, we should add a note about this to the estimator guide in #3040 when both are in.

@mdemoret-nv
Copy link
Contributor

@dantegd Is this similar to SkLearn's estimator tags? If so, it might be better to do a larger tag system similar to their design (or at least make this PR compatible with the tag system design). I know many of our tests could benefit from the sklearn tag system (and we would have less tests that are hardcoded to skip particular estimators).

@dantegd
Copy link
Member Author

dantegd commented Nov 8, 2020

@mdemoret-nv thanks for the comment! I wasn't aware of the tag addition in Scikit 0.21, so this was immensely helpful to know. I think their implementation is very solid and very useful for our needs, at least for my purposes here with the order attribute.

Right now just added preferred_input_order for the discussion, which is the tag that will imediately be used, and then we can discuss more tags as the need arises.

@JohnZed @mdemoret-nv thoughts?

@dantegd dantegd changed the title [REVIEW][PROPOSAL] Add prefered memory order class attribute [REVIEW][PROPOSAL] Add tags and prefered memory order tags to estimators Nov 9, 2020
Copy link
Contributor

@JohnZed JohnZed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Needs a simple test. And I think the preferred order may be off for kneighbors classifier? Otherwise great

python/cuml/common/base.pyx Show resolved Hide resolved
python/cuml/linear_model/elastic_net.pyx Show resolved Hide resolved

def _more_tags(self):
return {
'preferred_input_order': 'F'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

F for fit, C for predict (awful, I know)... does that meet the tag definition?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's an excellent question, I guess for estimators that have discrepancies like this we should leave it as None, what do you think?

Copy link
Member

@cjnolet cjnolet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Found only minor things, mostly in the estimator guide.


def _more_tags(self):
return {
'preferred_input_order': 'C'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just leaving a small note (more for myself) that t_sne & UMAP both could probably accept 'F' now that the underlying KNN prim can accept it.

def _more_tags(self):
return {
# fit and predict require conflicting memory layouts
'preferred_input_order': None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could be fixed but I'll need to look into it. Created #3153

@@ -4,15 +4,36 @@ This guide is meant to help developers follow the correct patterns when creating

**Note:** This guide is long, because it includes internal details on how cuML manages input and output types for advanced use cases. But for the vast majority of estimators, the requirements are very simple and can follow the example patterns shown below in the [Quick Start Guide](#quick-start-guide).

## Table of Contents

- [Recommended Scikit-Learn Documentation](#recommended-scikit-learn-documentation)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oooh, I like this

wiki/python/ESTIMATOR_GUIDE.md Outdated Show resolved Hide resolved
]
```

7. Implement `_more_tags()` if any of the [default tags]() need to be overriden for the new estimator:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be linking to something?

wiki/python/ESTIMATOR_GUIDE.md Outdated Show resolved Hide resolved
wiki/python/ESTIMATOR_GUIDE.md Outdated Show resolved Hide resolved
wiki/python/ESTIMATOR_GUIDE.md Outdated Show resolved Hide resolved
wiki/python/ESTIMATOR_GUIDE.md Outdated Show resolved Hide resolved
Copy link
Member

@cjnolet cjnolet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found one more small thing as I investigated #3153. I'm going to go ahead and close that issue.

def _more_tags(self):
return {
# fit and predict require conflicting memory layouts
'preferred_input_order': None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked more closely at the kneighbors variants and they do actually require order='F' as input. We can safely set this F here and in the kneighbors_classifier.

@dantegd dantegd changed the title [REVIEW][PROPOSAL] Add tags and prefered memory order tags to estimators [REVIEW][PROPOSAL] Add tags and prefered memory order tags to estimators [skip-ci] Nov 18, 2020
@dantegd dantegd added 4 - Waiting on Reviewer Waiting for reviewer to review or respond and removed 3 - Ready for Review Ready for review by team labels Nov 19, 2020
Copy link
Member

@cjnolet cjnolet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes LGTM!

@dantegd dantegd changed the title [REVIEW][PROPOSAL] Add tags and prefered memory order tags to estimators [skip-ci] [REVIEW][PROPOSAL] Add tags and prefered memory order tags to estimators Nov 19, 2020
@dantegd dantegd added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 4 - Waiting on Reviewer Waiting for reviewer to review or respond labels Nov 19, 2020
@dantegd
Copy link
Member Author

dantegd commented Nov 19, 2020

rerun tests

@dantegd
Copy link
Member Author

dantegd commented Nov 20, 2020

rerun tests

@dantegd dantegd merged commit b3e4827 into rapidsai:branch-0.17 Nov 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge Cython / Python Cython or Python issue proposal Change current process or code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants