-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimizing multi-source byte range reading in JSON reader #15396
Optimizing multi-source byte range reading in JSON reader #15396
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestions with a side of scope creep :D
/ok to test |
/ok to test |
/ok to test |
/ok to test |
/ok to test |
…byte-range-improvement
/ok to test |
/ok to test |
datasource::owning_buffer<rmm::device_uvector<char>> outdata(std::move(outbuf)); | ||
std::swap(indata, outdata); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like it's significant, but I didn't grasp it.
What's the advantage of swapping in place over, say, assigning to indata
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My reasoning here was that after the RMM buffer in indata
datasource has been normalized, we can discard the buffer. Rather than copying outdata
to indata
with the implicitly generated copy assignment operator in owning_buffer
, I thought swapping it would be faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor nitpicks, and a clarifying question.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few nits and questions
cpp/src/io/json/read_json.cu
Outdated
// of subchunks. | ||
size_t buffer_size = | ||
reader_compression != compression_type::NONE | ||
? total_source_size * compression_ratio + 4096 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we adding 4096? Is that for headers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, 4096 is for the headers. I've used the uncompressed buffer size estimate from
cudf/cpp/src/io/comp/uncomp.cpp
Line 361 in 064dd7b
uncomp_len = comp_len * 4 + 4096; // In case uncompressed size isn't known in advance, assume |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice to have this guess defined instead of a hard-coded value. I'm ok either way though.
/ok to test |
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just nits from me this time.
…byte-range-improvement
/ok to test |
/merge |
Description
This piece of work seeks to achieve two goals - (i) reducing repeated reading of byte range chunks in the JSON reader, and (ii) enabling multi-source byte range reading for chunks spanning sources.
byte_range
option is used inread_json
#15185 to reduce the repeated reading of follow-on chunks while searching for the end of the last row in the requested chunk. After the requested chunk, the following chunks are divided into subchunks, and read until the delimiter character is reached.Visualization of the performance improvement with this optimization
Checklist