Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] cuml.dask.datasets.classification.make_classification creates y value that reports wrong dtype #2898

Closed
ResidentMario opened this issue Oct 1, 2020 · 10 comments · Fixed by #2940
Assignees
Labels
bug Something isn't working Cython / Python Cython or Python issue Dask / cuml.dask Issue/PR related to Python level dask or cuml.dask features.

Comments

@ResidentMario
Copy link

ResidentMario commented Oct 1, 2020

Describe the bug

The distributed linear regression algorithm throws when fitted on simple make_classification example due to mismatched types. This seems to indicate some error in the implementation—the exact error is:

RuntimeError: 1 of 1 worker jobs failed: Expected input to be of type in [dtype('float32')] but got int64

But the data being fed to the algorithm is float32:

image

Steps/Code to reproduce bug

from dask.distributed import Client
from dask_cuda import LocalCUDACluster

class TestArgs:
    def __init__(self):
        pass
args = TestArgs()
args.n_workers = 1
args.threads_per_worker = 1
args.dataset_size = '10 GB'
args.worker_memory_limit = '12 GB'
args.chunk_size = 5950000

cluster = LocalCUDACluster()
client = Client(cluster)

from cuml.dask.datasets.classification import make_classification

MAGIC_N_SAMPLES = {
    '1 MB': 5950,
    '10 MB': 59500,
    '100 MB': 595000,
    '1 GB': 5950000,
    '10 GB': 59500000,
    '100 GB': 595000000
}
if args.dataset_size != '':
    args.n_samples = MAGIC_N_SAMPLES[args.dataset_size]

X, y = make_classification(
    n_samples=args.n_samples, n_features=20, n_informative=2, n_redundant=2, n_classes=2,
    n_parts=args.n_samples // args.chunk_size
)

from cuml.dask.linear_model import LinearRegression
clf = LinearRegression()

import time
start_time = time.time()

clf.fit(X, y)

print(f"Fitting done in {str(time.time() - start_time)} seconds.")

Environment details (please complete the following information):
Throw is on an g4dn.xlarge instance on AWS (1x T4 GPU) with conda installed and the following environment.yml:

name: rapids
channels:
  - rapidsai
  - nvidia
  - conda-forge
  - defaults
dependencies:
  - python=3.7
  - cuml=0.15
  - cudf=0.15
  - cudatoolkit=10.1
  - dask-ml=1.7.0
  - pip:
     - dask_ml
     - dask_cuda

OS is Ubuntu 18.04.

@ResidentMario ResidentMario added ? - Needs Triage Need team to review and classify bug Something isn't working labels Oct 1, 2020
@ResidentMario
Copy link
Author

After some tinkering, my guess is that the y created by make_classification is actually int type, even though it self-reports being float32?

Using cuml.make_classification, which returns numpy arrays, I get an int value as the y dtype. If I then wrap this output using dask_cudf.from_cudf(cudf.DataFrame(...)) it works as expected.

@divyegala
Copy link
Member

divyegala commented Oct 2, 2020

Thanks @ResidentMario for bringing this to our notice!

Looks like this is a bug, that got passed under the hood. We are indeed creating int64 labels, but since we set our own meta for dask.array, it incorrectly presents it as float32. I will work on a fix for this soon.

@divyegala divyegala added Cython / Python Cython or Python issue Dask / cuml.dask Issue/PR related to Python level dask or cuml.dask features. and removed ? - Needs Triage Need team to review and classify labels Oct 2, 2020
@divyegala divyegala self-assigned this Oct 2, 2020
@ResidentMario
Copy link
Author

👀 Ok great, let me know when this bug gets resolved, will look forward to continuing to work on the benchmark script I was using this code for then.

@divyegala
Copy link
Member

@ResidentMario apart from that, any reason you are using make-classification instead of make-regression for Linear Regression? Think our cuml.dask.make_regression will generate labels in the correct dtype for the use case

@ResidentMario
Copy link
Author

The trouble is that I am already using make_classification elsewhere. I could rewrite that code to use a regression example, but this seems like a trivial fix to make. Once this is fixed I can just pull cuml from nightly, which would save me a rewrite.

@divyegala
Copy link
Member

divyegala commented Oct 6, 2020

@ResidentMario while I could fix the meta to correct display the type of labels generated by cuml.dask.make_classification correctly, I don't believe it would be correct for us to generate float32/float64 labels in this case. As with scikit-learn, which also generates int64 labels, if you would try any of our classification algorithms, they would work well with the label type here. So I think the correct fix from my side would be to correct the metadata for the labels here while still generating int and not changing the dtype to both stick to scikit-learn type API as well as correctly generate for our classification algorithms, but I would have to unfortunately request you to try using cuml.dask.make_regression for your use case

@ResidentMario
Copy link
Author

Hmm, I'm not sure we're disagreeing here.

I'm planning on creating a struct using cuml.dask.make_classification and then consuming it using one of the cuml.dask.* learners. So:

correctly generate for our classification algorithms

Is the behavior I'm interested in.

The other benchmarks I've done so far use cuml.make_classification and sklearn.datasets.make_classification. What I am saying is that if I wanted to use make_regression, I would have to rewrite those benchmarks to do regression instead.

I do not have any dependencies on the current behavior of cuml.dask.make_classification.

@divyegala
Copy link
Member

@ResidentMario sorry if we're diverging a bit in our understanding of the issue at hand here. Your dependency on cuml.dask.make_classification is to have all cuml.dask.* learners to work just out-of-the-box? So, a conversion from int to float should just be handled by our side, without you having to worry about it from a user-perspective?

@divyegala
Copy link
Member

@ResidentMario thinking more on this... will it help if I add something like a parameter called label_dtype='int64' so that labels can be generated for any given dtype? There may be value in not having to create a copy of the data, since we will have to go down that road to convert the data to another dtype under-the-hood. But generating labels directly of the required dtype may provide more value by not having to keep a copy in GPU memory

@ResidentMario ResidentMario changed the title [BUG] cuml.dask.linear_model.LinearRegression throws on cuml.dask.datasets.classification.make_classification dataset [BUG] cuml.dask.datasets.classification.make_classification creates y value that reports wrong dtype Oct 8, 2020
@ResidentMario
Copy link
Author

I actually just took up your suggestion and switched to using make_regression and rolled onwards. 😄

I do not advocate for diverging cuml implementation of make_classification from the sklearn one. Like you, I think that they should be kept as similar as possible to minimize user surprise. I would prioritize that over out-of-the-box classifiers (it should be pretty trivial to convert y to the right dtype as a preprocessing step anyway).

I think the scope of work for this issue is limited to fixing make_classification to report the dtype of its y column column correctly. I've updated the title of this issue to reflect that!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Cython / Python Cython or Python issue Dask / cuml.dask Issue/PR related to Python level dask or cuml.dask features.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants