Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Given JSON input is too large when using "cudf" backend #174

Closed
Edfame opened this issue Aug 1, 2024 · 3 comments
Closed

Given JSON input is too large when using "cudf" backend #174

Edfame opened this issue Aug 1, 2024 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@Edfame
Copy link

Edfame commented Aug 1, 2024

Describe the bug

I'm trying to load some data from jsonl files using the cudf backend but I get an error of "ValueError: Metadata inference failed in read_single_partition." and "RuntimeError('CUDF failure at: /__w/cudf/cudf/cpp/src/io/json/nested_json_gpu.cu:86: Given JSON input is too large')".

Steps/Code to reproduce bug

files = get_all_files_paths_under("/path/data") # ['/path/data/file1.jsonl', ... , /path/data/fileN.jsonl']
meta = {
    'field1': np.dtype('int64'),
    'fieldN': np.dtype('O')
}

dataset = DocumentDataset.read_json(input_files=files, backend="cudf", add_filename=True, input_meta=meta) 

Expected behavior

I expect the data to load as well as when using pandas backend but "faster" and making use of GPUs parallelization.

Environment overview (please complete the following information)

  • Environment location: enroot + pyxis on slurm cluster
  • Method of NeMo-Curator install: NeMo Framework docker image, 24.05llama3.1, NeMo Curator 0.4.0

Additional logs

ValueError: Metadata inference failed in `read_single_partition`.

You have supplied a custom function and Dask is unable to 
determine the type of output that that function returns. 

To resolve this please provide a meta= keyword.
The docstring of the Dask function you ran should have more information.

Original error is below:
------------------------
RuntimeError('CUDF failure at: /__w/cudf/cudf/cpp/src/io/json/nested_json_gpu.cu:86: Given JSON input is too large')

Traceback:
---------
  File "/usr/local/lib/python3.10/dist-packages/dask/dataframe/utils.py", line 195, in raise_on_meta_error
    yield
  File "/usr/local/lib/python3.10/dist-packages/dask/dataframe/core.py", line 7175, in _emulate
    return func(*_extract_meta(args, True), **_extract_meta(kwargs, True))
  File "/usr/local/lib/python3.10/dist-packages/nemo_curator/utils/distributed_utils.py", line 267, in read_single_partition
    df = read_f(file, **read_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/cudf/io/json.py", line 96, in read_json
    df = libjson.read_json(
  File "json.pyx", line 45, in cudf._lib.json.read_json
  File "json.pyx", line 137, in cudf._lib.json.read_json
@Edfame Edfame added the bug Something isn't working label Aug 1, 2024
@ryantwolf
Copy link
Collaborator

Paraphrasing some points after discussing with @ayushdg. cuDF had an issue where it could not read individual jsonl files larger than 2.1 GB . See: rapidsai/cudf#16138.
cuDF fixed this in: rapidsai/cudf#16162 that should be a part of the release next week. Couple of steps:

  1. For the time being we recommend splitting the large file into smaller chunks using NeMo Curator's make_data_shards functionality.
  2. We'll test if the cuDF fix works as expected & suggest using the latest nightly containers once a new version of cuDF is out in a week. We'll update this issue then.

@Edfame
Copy link
Author

Edfame commented Aug 2, 2024

Thank you @ryantwolf! For the time being I'll try to use make_data_shards and wait for the dev build next week!

@ryantwolf
Copy link
Collaborator

Should be working now. Let us know if you run into this issue again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants