You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm trying to load some data from jsonl files using the cudf backend but I get an error of "ValueError: Metadata inference failed in read_single_partition." and "RuntimeError('CUDF failure at: /__w/cudf/cudf/cpp/src/io/json/nested_json_gpu.cu:86: Given JSON input is too large')".
ValueError: Metadata inference failed in `read_single_partition`.
You have supplied a custom function and Dask is unable to
determine the type of output that that function returns.
To resolve this please provide a meta= keyword.
The docstring of the Dask function you ran should have more information.
Original error is below:
------------------------
RuntimeError('CUDF failure at: /__w/cudf/cudf/cpp/src/io/json/nested_json_gpu.cu:86: Given JSON input is too large')
Traceback:
---------
File "/usr/local/lib/python3.10/dist-packages/dask/dataframe/utils.py", line 195, in raise_on_meta_erroryield
File "/usr/local/lib/python3.10/dist-packages/dask/dataframe/core.py", line 7175, in _emulatereturn func(*_extract_meta(args, True), **_extract_meta(kwargs, True))
File "/usr/local/lib/python3.10/dist-packages/nemo_curator/utils/distributed_utils.py", line 267, in read_single_partition
df = read_f(file, **read_kwargs)
File "/usr/local/lib/python3.10/dist-packages/cudf/io/json.py", line 96, in read_json
df = libjson.read_json(
File "json.pyx", line 45, in cudf._lib.json.read_json
File "json.pyx", line 137, in cudf._lib.json.read_json
The text was updated successfully, but these errors were encountered:
Paraphrasing some points after discussing with @ayushdg. cuDF had an issue where it could not read individual jsonl files larger than 2.1 GB . See: rapidsai/cudf#16138.
cuDF fixed this in: rapidsai/cudf#16162 that should be a part of the release next week. Couple of steps:
For the time being we recommend splitting the large file into smaller chunks using NeMo Curator's make_data_shards functionality.
We'll test if the cuDF fix works as expected & suggest using the latest nightly containers once a new version of cuDF is out in a week. We'll update this issue then.
Describe the bug
I'm trying to load some data from jsonl files using the
cudf
backend but I get an error of "ValueError: Metadata inference failed inread_single_partition
." and "RuntimeError('CUDF failure at: /__w/cudf/cudf/cpp/src/io/json/nested_json_gpu.cu:86: Given JSON input is too large')".Steps/Code to reproduce bug
Expected behavior
I expect the data to load as well as when using
pandas
backend but "faster" and making use of GPUs parallelization.Environment overview (please complete the following information)
Additional logs
The text was updated successfully, but these errors were encountered: