Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Limit JSON reader input to 2GiB #17017

Open
karthikeyann opened this issue Oct 8, 2024 · 3 comments
Open

[BUG] Limit JSON reader input to 2GiB #17017

karthikeyann opened this issue Oct 8, 2024 · 3 comments
Assignees
Labels
bug Something isn't working cuIO cuIO issue

Comments

@karthikeyann
Copy link
Contributor

karthikeyann commented Oct 8, 2024

Describe the bug
Due to limitations of certain algorithms such cub::ExclusiveScan, and json tree implementation, the input json can not exceed limit 2GiB. (INT_MAX)
The json_read_data_type INTEGRAL benchmark fail to run because of allocation out_of_memory, because the input json_size exceed 2GiB limit.

json_in.size() = 2167967896
total_temp_storage_bytes: 18446744073707336447
Fail: Unexpected error: std::bad_alloc: out_of_memory: RMM failure at:/home/coder/.conda/envs/rapids/include/rmm/mr/device/pool_memory_resource.hpp:277: Maximum pool size exceeded

Steps/Code to reproduce bug
./benchmarks/JSON_READER_NVBENCH --profile --benchmark json_read_data_type --axis data_type=INTEGRAL

Expected behavior
Either >2GiB should be batched multiple times with read_json, or error out that >2GiB sources are not supported.

Additional context
#16978

@elstehle
Copy link
Contributor

Due to limitations of certain algorithms such cub::ExclusiveScan [...] the input json can not exceed limit 2GiB. (INT_MAX)

We recently merged NVIDIA/cccl#2171 to add support for larger-than-INT_MAX number of items in cub::DeviceScan.

rapids-bot bot pushed a commit that referenced this issue Oct 14, 2024
…n `INT_MAX` bytes (#17057)

Addresses #17017 

Libcudf does not support parsing regular JSON inputs of size greater than `INT_MAX` bytes. Note that the batched reader can only be used for JSON lines inputs.

Authors:
  - Shruti Shivakumar (https://github.com/shrshi)

Approvers:
  - Muhammad Haseeb (https://github.com/mhaseeb123)
  - Vukasin Milovanovic (https://github.com/vuule)
  - Karthikeyan (https://github.com/karthikeyann)

URL: #17057
@ttnghia
Copy link
Contributor

ttnghia commented Oct 15, 2024

Then should we revert #17057?

@vyasr
Copy link
Contributor

vyasr commented Oct 16, 2024

Not until we update to the latest CCCL. This feature hasn't even been released yet, so we have a ways to go.

@karthikeyann karthikeyann added this to the Nested JSON reader milestone Nov 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuIO cuIO issue
Projects
None yet
Development

No branches or pull requests

5 participants