-
Notifications
You must be signed in to change notification settings - Fork 540
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Transforms RandomForest estimators non-consecutive labels to consecutive labels where appropriate #4780
Conversation
raise ValueError("The labels need " | ||
"to be consecutive values from " | ||
"0 to the number of unique label values") | ||
self.classes_unorder = cp.unique(y_m).tolist() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be reusing existing primitives where at all possible and using the make_monotonic primitive to do this. That allows us to optimize this specific operation once and have it benefit all uses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, just had a couple of comments
@pytest.mark.parametrize("datatype", [np.float32, np.float64]) | ||
@pytest.mark.parametrize("max_features", [1.0, "auto", "log2", "sqrt"]) | ||
@pytest.mark.parametrize("b", [0, 5, -5, 10]) | ||
@pytest.mark.parametrize("a", [1, 2, 3]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need this full matrix of tests for testing the monotonic case, one combination for each datatype would be enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I will fix one value for a, b and max_features.
y_m, _ = make_monotonic(y_m) | ||
break | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if the logic for the loop might be better to have in make_monotonic
, perhaps with a parameter like make_monotonic(array, check_already_monotonic=True)
so that every use of the prim is cleaner? What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd vote we make that change in RAFT, plumb raft::label::make_monotonic
to Python (rapidsai/raft#640), and then make a follow up PR to use that in cuML and remove this prim.
Context: #4478 (comment)
What do you think? With that said, we could just do both :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not modify the loop as I wanted to make least possible changes to the code. However, we can make use of check_lables primitives to see if the labels are already monotonic. I can make the change for cleaner implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@beckernick I am on board with that idea.
rerun tests |
3 similar comments
rerun tests |
rerun tests |
rerun tests |
Codecov Report
@@ Coverage Diff @@
## branch-22.08 #4780 +/- ##
=============================================
Coverage 77.62% 77.62%
=============================================
Files 180 180
Lines 11382 11384 +2
=============================================
+ Hits 8835 8837 +2
Misses 2547 2547
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
Is this ready for another round of reviews? |
Its ready! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
rerun tests |
1 similar comment
rerun tests |
Changing the title before merging, as this PR only applies this change to random forest models. |
This will also close #691 |
@gpucibot merge |
…ive labels where appropriate (rapidsai#4780) This PR closes rapidsai#4478 by transforming non-consecutive labels outside of [0,n) to consecutive labels inside [0,n) similar to what Scikit-learn does under the hood. Closes rapidsai#691 Authors: - https://github.com/VamsiTallam95 Approvers: - Micka (https://github.com/lowener) - Dante Gama Dessavre (https://github.com/dantegd) - Corey J. Nolet (https://github.com/cjnolet) URL: rapidsai#4780
This PR closes #4478 by transforming non-consecutive labels outside of [0,n) to consecutive labels inside [0,n) similar to what Scikit-learn does under the hood.
Closes #691