Reduce IO when `byte_range` option is used in `read_json` #15185

vuule · 2024-02-29T02:13:59Z

When reading a byte range from a file, JSON reader has to read data beyond the actual byte range to get the remainder of the last row that starts within the range. In order to find the end of this row, current implementation reads the next byte range. Once the reader finds the delimiter, it reads the full required range of data, discarding everything previously read.
This leads to 3x data read in most cases:
https://github.com/rapidsai/cudf/blob/branch-24.04/cpp/src/io/json/read_json.cu#L162-L202
We could implement a solution that reads additional data in smaller chunks and does not discard the byte range that was initially read. Once we find the next delimiter, we can concatenate all required data into a single buffer. This does include a D2D copy of all data, but this will be a lot faster than IO or the H2D copy that we now make. With this, we can limit the IO to just the requested byte range + few extra KBs.

vuule · 2024-02-29T02:23:33Z

An interesting interaction with #15186: datasource can prevent full reading of the next byte range by limiting the returned data to the mapped range. This means that fixing #15186 would increase the IO overhead in this issue.

This piece of work seeks to achieve two goals - (i) reducing repeated reading of byte range chunks in the JSON reader, and (ii) enabling multi-source byte range reading for chunks spanning sources. - We expand on the idea outlined in #15185 to reduce the repeated reading of follow-on chunks while searching for the end of the last row in the requested chunk. After the requested chunk, the following chunks are divided into subchunks, and read until the delimiter character is reached. - We estimate the buffer size needed for the entire byte range, and compute offsets per source into the buffer. [Visualization of the performance improvement with this optimization](#15396 (comment)) Authors: - Shruti Shivakumar (https://github.com/shrshi) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - MithunR (https://github.com/mythrocks) - Mike Wilson (https://github.com/hyperbolic2346) URL: #15396

vuule · 2024-11-05T19:21:01Z

fixed by #15396

vuule added cuIO cuIO issue Performance Performance related issue labels Feb 29, 2024

GregoryKimball added the libcudf Affects libcudf (C++/CUDA) code. label Feb 29, 2024

GregoryKimball added this to libcudf Feb 29, 2024

GregoryKimball added this to the Nested JSON reader milestone Feb 29, 2024

This was referenced Mar 12, 2024

Introduce benchmark suite for JSON reader options #15124

Merged

[FEA] JSON reader improvements for Spark-RAPIDS #13525

Open

shrshi mentioned this issue Mar 26, 2024

Optimizing multi-source byte range reading in JSON reader #15396

Merged

3 tasks

vuule closed this as completed Nov 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce IO when `byte_range` option is used in `read_json` #15185

Reduce IO when `byte_range` option is used in `read_json` #15185

vuule commented Feb 29, 2024 •

edited

Loading

vuule commented Feb 29, 2024

vuule commented Nov 5, 2024

Reduce IO when byte_range option is used in read_json #15185

Reduce IO when byte_range option is used in read_json #15185

Comments

vuule commented Feb 29, 2024 • edited Loading

vuule commented Feb 29, 2024

vuule commented Nov 5, 2024

Reduce IO when `byte_range` option is used in `read_json` #15185

Reduce IO when `byte_range` option is used in `read_json` #15185

vuule commented Feb 29, 2024 •

edited

Loading