[BUG] SVC with probability=True on cudf dataframes throws error #3090

aerdem4 · 2020-10-30T08:43:56Z

Describe the bug
I am not able to train SVC with probability=True on rapids-0.16.

Steps/Code to reproduce bug
I did a clean rapids-0.16 conda install on Ubuntu. Then simply fit function fails on cudf dataframes. Same code runs on rapids0.15 without an issue. If I set probability=False, I don't get any issue on 0.16 as well.

svc_model = cuml.SVC(C=100.0, cache_size=3000.0, probability=True)
svc_model.fit(train_df[features], train_df[target])

The error is on cudf indexing:
IndexError: Failed to convert index to appropirate row

The text was updated successfully, but these errors were encountered:

tfeher · 2020-10-30T14:22:20Z

Indeed this fails with 0.16, thanks @aerdem4 for reporting! Here is a complete reproducer:

import cupy as cp
import cuml
import cudf

X = cp.random.randint(2000, size=(100,4)).astype(cp.float32)
X_df = cudf.DataFrame(data=X, columns=['One', 'Two', 'Three', 'Target'])
X_df['Target'] = X_df['Target'] % 2

features = ['One', 'Two', 'Three']
target = 'Target'

svc_model = cuml.SVC(C=100.0, cache_size=3000.0, probability=True)
svc_model.fit(X_df[features], X_df[target])

Note that setting probability=True means that under the hood we use scikit-learn's CalibratedClassifierCV method to fit cuml's SVM, and then calibrate the probability. Because we involve sklearn, the data is first converted to numpy, passed to sklearn, which calls cuml with the numpy data (more about this in issue #2608). The problem is probably in the initial step, the conversion to numpy is not performed correctly.

As a workaround, you can use numpy input data:

X_np = X_df[features].to_pandas().to_numpy()
svc_model = cuml.SVC(C=100.0, cache_size=3000.0, probability=True)
svc_model.fit(X_np, X_df[target])

We do not test probabilistic svm with cuDF input data, that is why this has passed CI. I will have a look at the exact cause of this problem.

tfeher · 2020-11-23T21:08:03Z

I have tested this with the nigthly ubuntu 18.04 images, and the error disappeared. I suspect that the CumlArrayDescriptor related changes fixed the problem (#3040). I have improved the unit tests in #3176 to catch such errors in the future.

aerdem4 added ? - Needs Triage Need team to review and classify bug Something isn't working labels Oct 30, 2020

tfeher self-assigned this Oct 30, 2020

tfeher removed the ? - Needs Triage Need team to review and classify label Oct 30, 2020

tfeher mentioned this issue Nov 23, 2020

[REVIEW] Add probabilistic SVM tests with various input array types #3176

Merged

JohnZed closed this as completed in #3176 Nov 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] SVC with probability=True on cudf dataframes throws error #3090

[BUG] SVC with probability=True on cudf dataframes throws error #3090

aerdem4 commented Oct 30, 2020

tfeher commented Oct 30, 2020 •

edited

Loading

tfeher commented Nov 23, 2020

[BUG] SVC with probability=True on cudf dataframes throws error #3090

[BUG] SVC with probability=True on cudf dataframes throws error #3090

Comments

aerdem4 commented Oct 30, 2020

tfeher commented Oct 30, 2020 • edited Loading

tfeher commented Nov 23, 2020

tfeher commented Oct 30, 2020 •

edited

Loading