Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Loading a file with null data with dask gives 'cudaErrorInvalidValue invalid argument' #7572

Closed
rommelDB opened this issue Mar 11, 2021 · 5 comments
Labels
bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code.

Comments

@rommelDB
Copy link

Describe the bug
Given a dask environment with two workers, the following script, produces a fatal error while reading and processing a file with null data. Also, the same issue happens with other formats like csv or orc files.

Steps/Code to reproduce bug

import cudf

from distributed import Client
dask_client = Client('tcp://127.0.0.1:8786')

dir_data_lc = "/home/workspace/tpch-with-nulls/"

table = "customer"

df = cudf.read_parquet(dir_data_lc + table + "_0_0.parquet")

import dask_cudf
gdf = dask_cudf.from_cudf(df, npartitions=2)
gdf.to_csv("*.csv")

The current log is:

Traceback (most recent call last):
  File "bug-dask.py", line 16, in <module>
    gdf.to_csv("*.csv")
  File "/home/workspace/lib/python3.7/site-packages/dask/dataframe/core.py", line 1459, in to_csv
    return to_csv(self, filename, **kwargs)
  File "/home/workspace/lib/python3.7/site-packages/dask/dataframe/io/csv.py", line 871, in to_csv
    delayed(values).compute(**compute_kwargs)
  File "/home/workspace/lib/python3.7/site-packages/dask/base.py", line 281, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/home/workspace/lib/python3.7/site-packages/dask/base.py", line 563, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/home/workspace/lib/python3.7/site-packages/distributed/client.py", line 2655, in get
    results = self.gather(packed, asynchronous=asynchronous, direct=direct)
  File "/home/workspace/lib/python3.7/site-packages/distributed/client.py", line 1970, in gather
    asynchronous=asynchronous,
  File "/home/workspace/lib/python3.7/site-packages/distributed/client.py", line 839, in sync
    self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
  File "/home/workspace/lib/python3.7/site-packages/distributed/utils.py", line 340, in sync
    raise exc.with_traceback(tb)
  File "/home/workspace/lib/python3.7/site-packages/distributed/utils.py", line 324, in f
    result[0] = yield future
  File "/home/workspace/lib/python3.7/site-packages/tornado/gen.py", line 762, in run
    value = future.result()
  File "/home/workspace/lib/python3.7/site-packages/distributed/client.py", line 1829, in _gather
    raise exception.with_traceback(traceback)
  File "/home/workspace/lib/python3.7/site-packages/dask/dataframe/io/csv.py", line 685, in _write_csv
    df.to_csv(f, **kwargs)
  File "/home/workspace/lib/python3.7/site-packages/cudf/core/dataframe.py", line 7390, in to_csv
    **kwargs,
  File "/home/workspace/lib/python3.7/contextlib.py", line 74, in inner
    return func(*args, **kwds)
  File "/home/workspace/lib/python3.7/site-packages/cudf/io/csv.py", line 209, in to_csv
    index=index,
  File "cudf/_lib/csv.pyx", line 418, in cudf._lib.csv.write_csv
  File "cudf/_lib/csv.pyx", line 486, in cudf._lib.csv.write_csv
RuntimeError: CUDA error at: /home/workspace/include/rmm/device_buffer.hpp:445: cudaErrorInvalidValue invalid argument

If the same file is loaded with Pandas, the script runs smoothly.

import dask.dataframe
import pandas as pd

from distributed import Client
dask_client = Client('tcp://127.0.0.1:8786')

dir_data_lc = "/home/workspace/tpch-with-nulls/"

table = "customer"

pdf = pd.read_parquet(dir_data_lc + table + "_0_0.parquet")
gdf = dask.dataframe.from_pandas(pdf, npartitions=2)
gdf.to_csv("*.csv")

Expected behavior
Loading and processing such file with null data should run smoothly.

Environment overview

  • Environment location: Bare-metal
  • Method of cuDF install: conda nightly v0.19
@rommelDB rommelDB added Needs Triage Need team to review and classify bug Something isn't working labels Mar 11, 2021
@kkraus14 kkraus14 added cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Mar 26, 2021
@vuule
Copy link
Contributor

vuule commented Mar 29, 2021

@rommelDB can you please share the input Parquet file?

@rommelDB
Copy link
Author

@vuule
Copy link
Contributor

vuule commented Mar 29, 2021

Thank you. Will take a look today.

@vuule
Copy link
Contributor

vuule commented Mar 30, 2021

@rommelDB, I could not repro the issue on ToT 0.19:

    from distributed import Client
    dask_client = Client(processes=True, asynchronous=False)

    df = cudf.read_parquet("customer_0_0.parquet")

    gdf = dask_cudf.from_cudf(df, npartitions=2)
    gdf.to_csv("*.csv")

Tried both to_csv and to_orc.
Is the issue specific to a dask configuration?

@rommelDB
Copy link
Author

Hi @vuule it seems that the issue has gone away, I just tried with the last nightly 0.19 and I couldn't reproduce it either. I appreaciate your help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

No branches or pull requests

3 participants