Given JSON input is too large when using "cudf" backend #174

Edfame · 2024-08-01T16:08:41Z

Describe the bug

I'm trying to load some data from jsonl files using the cudf backend but I get an error of "ValueError: Metadata inference failed in read_single_partition." and "RuntimeError('CUDF failure at: /__w/cudf/cudf/cpp/src/io/json/nested_json_gpu.cu:86: Given JSON input is too large')".

Steps/Code to reproduce bug

files = get_all_files_paths_under("/path/data") # ['/path/data/file1.jsonl', ... , /path/data/fileN.jsonl']
meta = {
    'field1': np.dtype('int64'),
    'fieldN': np.dtype('O')
}

dataset = DocumentDataset.read_json(input_files=files, backend="cudf", add_filename=True, input_meta=meta)

Expected behavior

I expect the data to load as well as when using pandas backend but "faster" and making use of GPUs parallelization.

Environment overview (please complete the following information)

Environment location: enroot + pyxis on slurm cluster
Method of NeMo-Curator install: NeMo Framework docker image, 24.05llama3.1, NeMo Curator 0.4.0

Additional logs

ValueError: Metadata inference failed in `read_single_partition`.

You have supplied a custom function and Dask is unable to 
determine the type of output that that function returns. 

To resolve this please provide a meta= keyword.
The docstring of the Dask function you ran should have more information.

Original error is below:
------------------------
RuntimeError('CUDF failure at: /__w/cudf/cudf/cpp/src/io/json/nested_json_gpu.cu:86: Given JSON input is too large')

Traceback:
---------
  File "/usr/local/lib/python3.10/dist-packages/dask/dataframe/utils.py", line 195, in raise_on_meta_error
    yield
  File "/usr/local/lib/python3.10/dist-packages/dask/dataframe/core.py", line 7175, in _emulate
    return func(*_extract_meta(args, True), **_extract_meta(kwargs, True))
  File "/usr/local/lib/python3.10/dist-packages/nemo_curator/utils/distributed_utils.py", line 267, in read_single_partition
    df = read_f(file, **read_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/cudf/io/json.py", line 96, in read_json
    df = libjson.read_json(
  File "json.pyx", line 45, in cudf._lib.json.read_json
  File "json.pyx", line 137, in cudf._lib.json.read_json

The text was updated successfully, but these errors were encountered:

ryantwolf · 2024-08-01T17:39:35Z

Paraphrasing some points after discussing with @ayushdg. cuDF had an issue where it could not read individual jsonl files larger than 2.1 GB . See: rapidsai/cudf#16138.
cuDF fixed this in: rapidsai/cudf#16162 that should be a part of the release next week. Couple of steps:

For the time being we recommend splitting the large file into smaller chunks using NeMo Curator's make_data_shards functionality.
We'll test if the cuDF fix works as expected & suggest using the latest nightly containers once a new version of cuDF is out in a week. We'll update this issue then.

Edfame · 2024-08-02T08:29:36Z

Thank you @ryantwolf! For the time being I'll try to use make_data_shards and wait for the dev build next week!

ryantwolf · 2024-08-12T20:29:53Z

Should be working now. Let us know if you run into this issue again.

Edfame added the bug Something isn't working label Aug 1, 2024

ryantwolf assigned ryantwolf and ayushdg Aug 1, 2024

ryantwolf closed this as completed Aug 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Given JSON input is too large when using "cudf" backend #174

Given JSON input is too large when using "cudf" backend #174

Edfame commented Aug 1, 2024 •

edited

Loading

ryantwolf commented Aug 1, 2024

Edfame commented Aug 2, 2024

ryantwolf commented Aug 12, 2024

Given JSON input is too large when using "cudf" backend #174

Given JSON input is too large when using "cudf" backend #174

Comments

Edfame commented Aug 1, 2024 • edited Loading

ryantwolf commented Aug 1, 2024

Edfame commented Aug 2, 2024

ryantwolf commented Aug 12, 2024

Edfame commented Aug 1, 2024 •

edited

Loading