Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] How To Pass cuDF Dataframe to cuML.ensemble.RandomForestClassifier? #4480

Closed
mexicantexan opened this issue Jan 12, 2022 · 2 comments
Closed
Labels
? - Needs Triage Need team to review and classify question Further information is requested

Comments

@mexicantexan
Copy link

mexicantexan commented Jan 12, 2022

What is your question?
I'm trying to fit data to the cuml.ensemble.RandomForestClassifier and I keep getting the error: "The labels need to be consecutive values from 0 to the number of unique label values"

I'm passing cudf.DataFrame objects into the function which have the same number of rows but differing number of columns. The column labels start at 0 and step by 1 up to the final column (in the example below 108). What am I doing wrong? I've attached a printout of the dataframes that I'm passing in below and some code for context:

clf1 = modelClass(max_depth=D1, random_state=random.randrange(0, 1024, 1),
n_bins=15, n_streams=4, split_criterion=criterion, bootstrap=bootstrap, n_estimators=trs1)

clf1.fit(X1, Y1)
X1's dataframe looks like this:

0 1 2 ... 107 108
0 1.000000e-11 1.000000e-11 1.647421e-01 ... 1.000000e-11 1.647421e-01
1 1.000000e-11 1.000000e-11 1.760000e-02 ... 1.000000e-11 1.760000e-02
2 1.000000e-11 1.000000e-11 -1.772000e-01 ... 1.000000e-11 -1.772000e-01
3 1.000000e-11 1.000000e-11 8.254000e-01 ... 1.000000e-11 8.254000e-01
4 1.000000e-11 1.000000e-11 2.587000e-01 ... 1.000000e-11 2.587000e-01
... ... ... ... ... ... ...
5402 1.000000e-11 1.000000e-11 1.704444e-01 ... 1.000000e-11 1.704444e-01
5403 1.000000e-11 1.000000e-11 -1.860000e-01 ... 1.000000e-11 -1.860000e-01
5404 0.000000e+00 1.000000e-11 1.229714e-01 ... 1.000000e-11 1.229714e-01
5405 1.000000e-11 1.959500e-01 1.984667e-01 ... 1.959500e-01 1.984667e-01
5406 1.000000e-11 1.000000e-11 1.000000e-11 ... 1.000000e-11 1.000000e-11

[5407 rows x 109 columns]; dtype=('0', dtype('float64')); <cudf.core.dataframe._DataFrameLocIndexer object at 0x7f9c3d0f3070>

Y1's Dataframe looks like this:

0
0 -2
1 4
2 -3
3 1
4 0
... ...
5402 0
5403 -2
5404 0
5405 0
5406 0

[5407 rows x 1 columns]; dtype=('0', dtype('int32')); <cudf.core.dataframe._DataFrameLocIndexer object at 0x7f9c1b847b50>

System Information: Ubuntu 20.04, Titan RTX, CUDA 11.5, Rapids 21.12 built-in Conda, Python 3.8

@mexicantexan mexicantexan added ? - Needs Triage Need team to review and classify question Further information is requested labels Jan 12, 2022
@mexicantexan
Copy link
Author

I've spent 14 hours trying different dataframes/Series, using pandas dataframes, numpy arrays, different data types, and I can't seem to get the RandomForestClassifier to fit. It always keeps coming back to: "The labels need to be consecutive values from 0 to the number of unique label values"

I've manually gone through and adjusted each label to a number, iterated a for loop over the labels to start at 0, and go to the max number of columns, I've saved everything to an excel sheet and triple checked that the labels are correct and that there's no missing data.

Any help would be appreciated.

@mexicantexan mexicantexan changed the title [QST] How To Pass cuDF Dataframe to cuML.ensemble.RandomForestClassifier? [BUG] How To Pass cuDF Dataframe to cuML.ensemble.RandomForestClassifier? Jan 13, 2022
@beckernick
Copy link
Member

Your y column is not made up of consecutive values in [0, n), so you are hitting this bug: https://github.com//issues/4478 . You need to encode your y column to be in that range.

Starting from the example in that issue, you could do the following (as one example):

import cudf
import cumldf = cudf.DataFrame({
    "x1": [0.0,1,2],
    "x2": [-3,2.0,5],
    "y": [-3, 0, 4.0]
})
​
enc = cuml.preprocessing.LabelEncoder()
df["y_consecutive"] = enc.fit_transform(df.y)
​
print(clf.fit(df[["x1", "x2"]], df["y_consecutive"]))
RandomForestClassifier()

I would recommend we close this issue and continue discussion of the bug in the linked issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
? - Needs Triage Need team to review and classify question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants