Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce IO when byte_range option is used in read_json #15185

Closed
vuule opened this issue Feb 29, 2024 · 2 comments
Closed

Reduce IO when byte_range option is used in read_json #15185

vuule opened this issue Feb 29, 2024 · 2 comments
Labels
cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue

Comments

@vuule
Copy link
Contributor

vuule commented Feb 29, 2024

When reading a byte range from a file, JSON reader has to read data beyond the actual byte range to get the remainder of the last row that starts within the range. In order to find the end of this row, current implementation reads the next byte range. Once the reader finds the delimiter, it reads the full required range of data, discarding everything previously read.
This leads to 3x data read in most cases:
https://github.com/rapidsai/cudf/blob/branch-24.04/cpp/src/io/json/read_json.cu#L162-L202
We could implement a solution that reads additional data in smaller chunks and does not discard the byte range that was initially read. Once we find the next delimiter, we can concatenate all required data into a single buffer. This does include a D2D copy of all data, but this will be a lot faster than IO or the H2D copy that we now make. With this, we can limit the IO to just the requested byte range + few extra KBs.

@vuule vuule added cuIO cuIO issue Performance Performance related issue labels Feb 29, 2024
@vuule
Copy link
Contributor Author

vuule commented Feb 29, 2024

An interesting interaction with #15186: datasource can prevent full reading of the next byte range by limiting the returned data to the mapped range. This means that fixing #15186 would increase the IO overhead in this issue.

@GregoryKimball GregoryKimball added the libcudf Affects libcudf (C++/CUDA) code. label Feb 29, 2024
@GregoryKimball GregoryKimball added this to the Nested JSON reader milestone Feb 29, 2024
rapids-bot bot pushed a commit that referenced this issue Apr 30, 2024
This piece of work seeks to achieve two goals - (i) reducing repeated reading of byte range chunks in the JSON reader, and (ii) enabling multi-source byte range reading for chunks spanning sources. 
- We expand on the idea outlined in #15185 to reduce the repeated reading of follow-on chunks while searching for the end of the last row in the requested chunk. After the requested chunk, the following chunks are divided into subchunks, and read until the delimiter character is reached. 
- We estimate the buffer size needed for the entire byte range, and compute offsets per source into the buffer.
 
[Visualization of the performance improvement with this optimization](#15396 (comment))

Authors:
  - Shruti Shivakumar (https://github.com/shrshi)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - MithunR (https://github.com/mythrocks)
  - Mike Wilson (https://github.com/hyperbolic2346)

URL: #15396
@vuule
Copy link
Contributor Author

vuule commented Nov 5, 2024

fixed by #15396

@vuule vuule closed this as completed Nov 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue
Projects
Status: No status
Development

No branches or pull requests

2 participants