Reduce IO when byte_range
option is used in read_json
#15185
Labels
Milestone
byte_range
option is used in read_json
#15185
When reading a byte range from a file, JSON reader has to read data beyond the actual byte range to get the remainder of the last row that starts within the range. In order to find the end of this row, current implementation reads the next byte range. Once the reader finds the delimiter, it reads the full required range of data, discarding everything previously read.
This leads to 3x data read in most cases:
https://github.com/rapidsai/cudf/blob/branch-24.04/cpp/src/io/json/read_json.cu#L162-L202
We could implement a solution that reads additional data in smaller chunks and does not discard the byte range that was initially read. Once we find the next delimiter, we can concatenate all required data into a single buffer. This does include a D2D copy of all data, but this will be a lot faster than IO or the H2D copy that we now make. With this, we can limit the IO to just the requested byte range + few extra KBs.
The text was updated successfully, but these errors were encountered: