Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimizing multi-source byte range reading in JSON reader #15396

Merged
merged 48 commits into from
Apr 30, 2024

Conversation

shrshi
Copy link
Contributor

@shrshi shrshi commented Mar 26, 2024

Description

This piece of work seeks to achieve two goals - (i) reducing repeated reading of byte range chunks in the JSON reader, and (ii) enabling multi-source byte range reading for chunks spanning sources.

  • We expand on the idea outlined in Reduce IO when byte_range option is used in read_json #15185 to reduce the repeated reading of follow-on chunks while searching for the end of the last row in the requested chunk. After the requested chunk, the following chunks are divided into subchunks, and read until the delimiter character is reached.
  • We estimate the buffer size needed for the entire byte range, and compute offsets per source into the buffer.

Visualization of the performance improvement with this optimization

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Mar 26, 2024
@shrshi shrshi added feature request New feature or request Performance Performance related issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change and removed feature request New feature or request labels Mar 26, 2024
@shrshi shrshi marked this pull request as ready for review March 27, 2024 16:02
@shrshi shrshi requested a review from a team as a code owner March 27, 2024 16:03
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestions with a side of scope creep :D

cpp/src/io/json/read_json.cu Outdated Show resolved Hide resolved
cpp/src/io/json/read_json.cu Outdated Show resolved Hide resolved
cpp/src/io/json/read_json.cu Outdated Show resolved Hide resolved
Copy link

copy-pr-bot bot commented Apr 5, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@shrshi
Copy link
Contributor Author

shrshi commented Apr 5, 2024

/ok to test

@shrshi shrshi marked this pull request as draft April 5, 2024 10:21
@shrshi
Copy link
Contributor Author

shrshi commented Apr 5, 2024

/ok to test

@shrshi
Copy link
Contributor Author

shrshi commented Apr 5, 2024

/ok to test

@shrshi
Copy link
Contributor Author

shrshi commented Apr 5, 2024

/ok to test

@shrshi
Copy link
Contributor Author

shrshi commented Apr 23, 2024

/ok to test

@shrshi
Copy link
Contributor Author

shrshi commented Apr 24, 2024

/ok to test

@shrshi
Copy link
Contributor Author

shrshi commented Apr 25, 2024

/ok to test

Comment on lines 322 to 323
datasource::owning_buffer<rmm::device_uvector<char>> outdata(std::move(outbuf));
std::swap(indata, outdata);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like it's significant, but I didn't grasp it.
What's the advantage of swapping in place over, say, assigning to indata?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My reasoning here was that after the RMM buffer in indata datasource has been normalized, we can discard the buffer. Rather than copying outdata to indata with the implicitly generated copy assignment operator in owning_buffer, I thought swapping it would be faster.

Copy link
Contributor

@mythrocks mythrocks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor nitpicks, and a clarifying question.

Copy link
Contributor

@hyperbolic2346 hyperbolic2346 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few nits and questions

cpp/src/io/json/json_normalization.cu Outdated Show resolved Hide resolved
cpp/src/io/json/read_json.cu Outdated Show resolved Hide resolved
// of subchunks.
size_t buffer_size =
reader_compression != compression_type::NONE
? total_source_size * compression_ratio + 4096
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we adding 4096? Is that for headers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, 4096 is for the headers. I've used the uncompressed buffer size estimate from

uncomp_len = comp_len * 4 + 4096; // In case uncompressed size isn't known in advance, assume

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to have this guess defined instead of a hard-coded value. I'm ok either way though.

cpp/src/io/json/read_json.cu Outdated Show resolved Hide resolved
@shrshi
Copy link
Contributor Author

shrshi commented Apr 29, 2024

/ok to test

@shrshi
Copy link
Contributor Author

shrshi commented Apr 29, 2024

/ok to test

@shrshi shrshi requested a review from hyperbolic2346 April 29, 2024 23:16
Copy link
Contributor

@hyperbolic2346 hyperbolic2346 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just nits from me this time.

cpp/src/io/json/read_json.cu Outdated Show resolved Hide resolved
cpp/src/io/json/read_json.cu Outdated Show resolved Hide resolved
cpp/src/io/json/read_json.cu Outdated Show resolved Hide resolved
cpp/src/io/json/read_json.cu Outdated Show resolved Hide resolved
cpp/src/io/json/read_json.cu Outdated Show resolved Hide resolved
@shrshi
Copy link
Contributor Author

shrshi commented Apr 30, 2024

/ok to test

@shrshi
Copy link
Contributor Author

shrshi commented Apr 30, 2024

/merge

@rapids-bot rapids-bot bot merged commit f3206ea into rapidsai:branch-24.06 Apr 30, 2024
70 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Performance Performance related issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants