-
Notifications
You must be signed in to change notification settings - Fork 540
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] cuML estimators should support non-consecutive labels outside of [0, n) where appropriate #4478
Comments
This issue has been labeled |
I ran into this, I'm working with Pandas dataframes. After using imblearn RandomUnderSampler, my binary valued labels were in order, i.e. all 0's then all 1's. I used pd.sample to shuffle the data, and I had to reset indices to make the error go away. Here is the snippet:
|
Generalizing from @jhancock1975 's code snippet, this error can occur even with correctly formatted data during cross validation (if a fold doesn't get the right subset of labels). import pandas as pd
import cuml
from sklearn.model_selection import cross_val_score
df = pd.DataFrame({
"x1": [0.0,1,2,3,3,4],
"x2": [-3,2.0,5,5,3,2],
"y": [0,1,1,1,2,2]
})
clf = cuml.ensemble.RandomForestClassifier()
cross_val_score(
clf,
df[["x1", "x2"]],
df["y"],
cv=2,
error_score="raise"
)
# Throws the error linked above |
Scikit-learn does this under the hood here: |
This issue has been labeled |
It should be possible to do this using This function looks like it might have an expensive JIT cost, though. Perhaps cuml/cpp/src_prims/label/classlabels.cuh Line 164 in 768a4ed
|
@beckernick, yep, this is indeed something we had written originally in C++ in cuml for DBSCAN and have since moved to RAFT (pending removal from cuml, the raft version is more up to date). If desired, this could also be a good reason to expose through pylibraft and use in cuml (reusable and very clean separation of implementation details). |
That sounds like it could be a good solution. I suspect this non-consecutive label issue will keep popping up. Will file a new issue on RAFT, cross-link, and mark it as a good first issue based on your description. |
This issue has been labeled |
…ive labels where appropriate (#4780) This PR closes #4478 by transforming non-consecutive labels outside of [0,n) to consecutive labels inside [0,n) similar to what Scikit-learn does under the hood. Closes #691 Authors: - https://github.com/VamsiTallam95 Approvers: - Micka (https://github.com/lowener) - Dante Gama Dessavre (https://github.com/dantegd) - Corey J. Nolet (https://github.com/cjnolet) URL: #4780
…ive labels where appropriate (rapidsai#4780) This PR closes rapidsai#4478 by transforming non-consecutive labels outside of [0,n) to consecutive labels inside [0,n) similar to what Scikit-learn does under the hood. Closes rapidsai#691 Authors: - https://github.com/VamsiTallam95 Approvers: - Micka (https://github.com/lowener) - Dante Gama Dessavre (https://github.com/dantegd) - Corey J. Nolet (https://github.com/cjnolet) URL: rapidsai#4780
As noted in rapidsai/cudf#10024 , cuML RandomForestClassifier will throw an error if the target column has non-consecutive labels outside of the [0, n) range. This does not occur in scikit-learn, perhaps due to label encoding happening under the hood.
This may occur with other estimators as well.
The text was updated successfully, but these errors were encountered: