[BUG] cuml.dask.datasets.classification.make_classification creates y value that reports wrong dtype #2898

ResidentMario · 2020-10-01T19:07:57Z

Describe the bug

The distributed linear regression algorithm throws when fitted on simple make_classification example due to mismatched types. This seems to indicate some error in the implementation—the exact error is:

RuntimeError: 1 of 1 worker jobs failed: Expected input to be of type in [dtype('float32')] but got int64

But the data being fed to the algorithm is float32:

Steps/Code to reproduce bug

from dask.distributed import Client
from dask_cuda import LocalCUDACluster

class TestArgs:
    def __init__(self):
        pass
args = TestArgs()
args.n_workers = 1
args.threads_per_worker = 1
args.dataset_size = '10 GB'
args.worker_memory_limit = '12 GB'
args.chunk_size = 5950000

cluster = LocalCUDACluster()
client = Client(cluster)

from cuml.dask.datasets.classification import make_classification

MAGIC_N_SAMPLES = {
    '1 MB': 5950,
    '10 MB': 59500,
    '100 MB': 595000,
    '1 GB': 5950000,
    '10 GB': 59500000,
    '100 GB': 595000000
}
if args.dataset_size != '':
    args.n_samples = MAGIC_N_SAMPLES[args.dataset_size]

X, y = make_classification(
    n_samples=args.n_samples, n_features=20, n_informative=2, n_redundant=2, n_classes=2,
    n_parts=args.n_samples // args.chunk_size
)

from cuml.dask.linear_model import LinearRegression
clf = LinearRegression()

import time
start_time = time.time()

clf.fit(X, y)

print(f"Fitting done in {str(time.time() - start_time)} seconds.")

Environment details (please complete the following information):
Throw is on an g4dn.xlarge instance on AWS (1x T4 GPU) with conda installed and the following environment.yml:

name: rapids
channels:
  - rapidsai
  - nvidia
  - conda-forge
  - defaults
dependencies:
  - python=3.7
  - cuml=0.15
  - cudf=0.15
  - cudatoolkit=10.1
  - dask-ml=1.7.0
  - pip:
     - dask_ml
     - dask_cuda

OS is Ubuntu 18.04.

The text was updated successfully, but these errors were encountered:

ResidentMario · 2020-10-01T20:10:33Z

After some tinkering, my guess is that the y created by make_classification is actually int type, even though it self-reports being float32?

Using cuml.make_classification, which returns numpy arrays, I get an int value as the y dtype. If I then wrap this output using dask_cudf.from_cudf(cudf.DataFrame(...)) it works as expected.

divyegala · 2020-10-02T07:27:07Z

Thanks @ResidentMario for bringing this to our notice!

Looks like this is a bug, that got passed under the hood. We are indeed creating int64 labels, but since we set our own meta for dask.array, it incorrectly presents it as float32. I will work on a fix for this soon.

ResidentMario · 2020-10-02T20:55:26Z

👀 Ok great, let me know when this bug gets resolved, will look forward to continuing to work on the benchmark script I was using this code for then.

divyegala · 2020-10-06T18:36:25Z

@ResidentMario apart from that, any reason you are using make-classification instead of make-regression for Linear Regression? Think our cuml.dask.make_regression will generate labels in the correct dtype for the use case

ResidentMario · 2020-10-06T19:08:47Z

The trouble is that I am already using make_classification elsewhere. I could rewrite that code to use a regression example, but this seems like a trivial fix to make. Once this is fixed I can just pull cuml from nightly, which would save me a rewrite.

divyegala · 2020-10-06T19:52:59Z

@ResidentMario while I could fix the meta to correct display the type of labels generated by cuml.dask.make_classification correctly, I don't believe it would be correct for us to generate float32/float64 labels in this case. As with scikit-learn, which also generates int64 labels, if you would try any of our classification algorithms, they would work well with the label type here. So I think the correct fix from my side would be to correct the metadata for the labels here while still generating int and not changing the dtype to both stick to scikit-learn type API as well as correctly generate for our classification algorithms, but I would have to unfortunately request you to try using cuml.dask.make_regression for your use case

ResidentMario · 2020-10-06T20:26:53Z

Hmm, I'm not sure we're disagreeing here.

I'm planning on creating a struct using cuml.dask.make_classification and then consuming it using one of the cuml.dask.* learners. So:

correctly generate for our classification algorithms

Is the behavior I'm interested in.

The other benchmarks I've done so far use cuml.make_classification and sklearn.datasets.make_classification. What I am saying is that if I wanted to use make_regression, I would have to rewrite those benchmarks to do regression instead.

I do not have any dependencies on the current behavior of cuml.dask.make_classification.

divyegala · 2020-10-06T20:56:04Z

@ResidentMario sorry if we're diverging a bit in our understanding of the issue at hand here. Your dependency on cuml.dask.make_classification is to have all cuml.dask.* learners to work just out-of-the-box? So, a conversion from int to float should just be handled by our side, without you having to worry about it from a user-perspective?

divyegala · 2020-10-06T22:00:35Z

@ResidentMario thinking more on this... will it help if I add something like a parameter called label_dtype='int64' so that labels can be generated for any given dtype? There may be value in not having to create a copy of the data, since we will have to go down that road to convert the data to another dtype under-the-hood. But generating labels directly of the required dtype may provide more value by not having to keep a copy in GPU memory

ResidentMario · 2020-10-08T00:03:19Z

I actually just took up your suggestion and switched to using make_regression and rolled onwards. 😄

I do not advocate for diverging cuml implementation of make_classification from the sklearn one. Like you, I think that they should be kept as similar as possible to minimize user surprise. I would prioritize that over out-of-the-box classifiers (it should be pretty trivial to convert y to the right dtype as a preprocessing step anyway).

I think the scope of work for this issue is limited to fixing make_classification to report the dtype of its y column column correctly. I've updated the title of this issue to reflect that!

ResidentMario added ? - Needs Triage Need team to review and classify bug Something isn't working labels Oct 1, 2020

divyegala added Cython / Python Cython or Python issue Dask / cuml.dask Issue/PR related to Python level dask or cuml.dask features. and removed ? - Needs Triage Need team to review and classify labels Oct 2, 2020

divyegala self-assigned this Oct 2, 2020

ResidentMario changed the title ~~[BUG] cuml.dask.linear_model.LinearRegression throws on cuml.dask.datasets.classification.make_classification dataset~~ [BUG] cuml.dask.datasets.classification.make_classification creates y value that reports wrong dtype Oct 8, 2020

divyegala mentioned this issue Oct 8, 2020

[REVIEW] Correcting labels meta dtype for cuml.dask.make_classification #2940

Merged

dantegd closed this as completed in #2940 Oct 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] cuml.dask.datasets.classification.make_classification creates y value that reports wrong dtype #2898

[BUG] cuml.dask.datasets.classification.make_classification creates y value that reports wrong dtype #2898

ResidentMario commented Oct 1, 2020 •

edited

Loading

ResidentMario commented Oct 1, 2020

divyegala commented Oct 2, 2020 •

edited

Loading

ResidentMario commented Oct 2, 2020

divyegala commented Oct 6, 2020

ResidentMario commented Oct 6, 2020

divyegala commented Oct 6, 2020 •

edited

Loading

ResidentMario commented Oct 6, 2020

divyegala commented Oct 6, 2020

divyegala commented Oct 6, 2020

ResidentMario commented Oct 8, 2020

[BUG] cuml.dask.datasets.classification.make_classification creates y value that reports wrong dtype #2898

[BUG] cuml.dask.datasets.classification.make_classification creates y value that reports wrong dtype #2898

Comments

ResidentMario commented Oct 1, 2020 • edited Loading

ResidentMario commented Oct 1, 2020

divyegala commented Oct 2, 2020 • edited Loading

ResidentMario commented Oct 2, 2020

divyegala commented Oct 6, 2020

ResidentMario commented Oct 6, 2020

divyegala commented Oct 6, 2020 • edited Loading

ResidentMario commented Oct 6, 2020

divyegala commented Oct 6, 2020

divyegala commented Oct 6, 2020

ResidentMario commented Oct 8, 2020

ResidentMario commented Oct 1, 2020 •

edited

Loading

divyegala commented Oct 2, 2020 •

edited

Loading

divyegala commented Oct 6, 2020 •

edited

Loading