Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] SVC with probability=True on cudf dataframes throws error #3090

Closed
aerdem4 opened this issue Oct 30, 2020 · 2 comments · Fixed by #3176
Closed

[BUG] SVC with probability=True on cudf dataframes throws error #3090

aerdem4 opened this issue Oct 30, 2020 · 2 comments · Fixed by #3176
Assignees
Labels
bug Something isn't working

Comments

@aerdem4
Copy link
Contributor

aerdem4 commented Oct 30, 2020

Describe the bug
I am not able to train SVC with probability=True on rapids-0.16.

Steps/Code to reproduce bug
I did a clean rapids-0.16 conda install on Ubuntu. Then simply fit function fails on cudf dataframes. Same code runs on rapids0.15 without an issue. If I set probability=False, I don't get any issue on 0.16 as well.

svc_model = cuml.SVC(C=100.0, cache_size=3000.0, probability=True)
svc_model.fit(train_df[features], train_df[target])

The error is on cudf indexing:
IndexError: Failed to convert index to appropirate row

@aerdem4 aerdem4 added ? - Needs Triage Need team to review and classify bug Something isn't working labels Oct 30, 2020
@tfeher tfeher self-assigned this Oct 30, 2020
@tfeher tfeher removed the ? - Needs Triage Need team to review and classify label Oct 30, 2020
@tfeher
Copy link
Contributor

tfeher commented Oct 30, 2020

Indeed this fails with 0.16, thanks @aerdem4 for reporting! Here is a complete reproducer:

import cupy as cp
import cuml
import cudf

X = cp.random.randint(2000, size=(100,4)).astype(cp.float32)
X_df = cudf.DataFrame(data=X, columns=['One', 'Two', 'Three', 'Target'])
X_df['Target'] = X_df['Target'] % 2

features = ['One', 'Two', 'Three']
target = 'Target'

svc_model = cuml.SVC(C=100.0, cache_size=3000.0, probability=True)
svc_model.fit(X_df[features], X_df[target])

Note that setting probability=True means that under the hood we use scikit-learn's CalibratedClassifierCV method to fit cuml's SVM, and then calibrate the probability. Because we involve sklearn, the data is first converted to numpy, passed to sklearn, which calls cuml with the numpy data (more about this in issue #2608). The problem is probably in the initial step, the conversion to numpy is not performed correctly.

As a workaround, you can use numpy input data:

X_np = X_df[features].to_pandas().to_numpy()
svc_model = cuml.SVC(C=100.0, cache_size=3000.0, probability=True)
svc_model.fit(X_np, X_df[target])

We do not test probabilistic svm with cuDF input data, that is why this has passed CI. I will have a look at the exact cause of this problem.

@tfeher
Copy link
Contributor

tfeher commented Nov 23, 2020

I have tested this with the nigthly ubuntu 18.04 images, and the error disappeared. I suspect that the CumlArrayDescriptor related changes fixed the problem (#3040). I have improved the unit tests in #3176 to catch such errors in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants