Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pass batch size to JSON reader using environment variable #16502

Merged
merged 17 commits into from
Aug 12, 2024

Conversation

shrshi
Copy link
Contributor

@shrshi shrshi commented Aug 6, 2024

Description

The JSON reader set the batch size to INT_MAX bytes since the motivation for implementing a batched JSON reader was to parse source files whose total size is larger than INT_MAX (#16138, #16162). However, we can use a much smaller batch size to evaluate the correctness of the reader and speed up tests significantly.
This PR focuses on reducing runtime of the batched reader test by setting the batch size to be used by the reader as an environment variable.
The runtime of JsonLargeReaderTest.MultiBatch in LARGE_STRINGS_TEST gtest drops from ~52s to ~3s.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@shrshi shrshi requested review from a team as code owners August 6, 2024 23:38
@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue labels Aug 6, 2024
@shrshi shrshi added cuIO cuIO issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Aug 6, 2024
@shrshi shrshi requested a review from davidwendt August 6, 2024 23:44
@shrshi
Copy link
Contributor Author

shrshi commented Aug 7, 2024

/ok to test

@shrshi shrshi requested a review from ttnghia August 7, 2024 20:44
Co-authored-by: Nghia Truong <[email protected]>
Copy link
Contributor

@davidwendt davidwendt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. I did not realize the environment variable is just for the test. Thanks for doing that. What is the execution time for the test now? Can you try it with compute-sanitizer too?

Also, just a nit, can you change plain size_t to std::size_t in places where this PR changed code?

@shrshi
Copy link
Contributor Author

shrshi commented Aug 12, 2024

This looks good. I did not realize the environment variable is just for the test. Thanks for doing that. What is the execution time for the test now? Can you try it with compute-sanitizer too?

The execution time has dropped from 52s to 3s now. With compute-sanitizer, the runtime is 123s.

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CMake approval.

Copy link

copy-pr-bot bot commented Aug 12, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@shrshi
Copy link
Contributor Author

shrshi commented Aug 12, 2024

/ok to test

@shrshi shrshi added the 5 - Ready to Merge Testing and reviews complete, ready to merge label Aug 12, 2024
@shrshi
Copy link
Contributor Author

shrshi commented Aug 12, 2024

/merge

@rapids-bot rapids-bot bot merged commit cce00c0 into rapidsai:branch-24.10 Aug 12, 2024
80 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge CMake CMake build issue cuIO cuIO issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants