Skip to content

Commit

Permalink
Ignore byte_range in read_json when the size is not smaller than …
Browse files Browse the repository at this point in the history
…the input data (#15180)

Deduce that the entire file will the loaded when byte_range is not smaller than the input size and use the faster "no byte_range" path.

Avoids double IO that happens with regular `byte_range` code path.

Authors:
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - Shruti Shivakumar (https://github.com/shrshi)
  - Mike Wilson (https://github.com/hyperbolic2346)

URL: #15180
  • Loading branch information
vuule authored Mar 5, 2024
1 parent 3ea947a commit 2d1e3c7
Showing 1 changed file with 6 additions and 5 deletions.
11 changes: 6 additions & 5 deletions cpp/src/io/json/read_json.cu
Original file line number Diff line number Diff line change
Expand Up @@ -140,10 +140,11 @@ size_type find_first_delimiter_in_chunk(host_span<std::unique_ptr<cudf::io::data
return find_first_delimiter(buffer, delimiter, stream);
}

bool should_load_whole_source(json_reader_options const& reader_opts)
bool should_load_whole_source(json_reader_options const& opts, size_t source_size)
{
return reader_opts.get_byte_range_offset() == 0 and //
reader_opts.get_byte_range_size() == 0;
auto const range_offset = opts.get_byte_range_offset();
auto const range_size = opts.get_byte_range_size();
return range_offset == 0 and (range_size == 0 or range_size >= source_size);
}

/**
Expand All @@ -168,7 +169,7 @@ auto get_record_range_raw_input(host_span<std::unique_ptr<datasource>> sources,
reader_opts.get_byte_range_offset(),
reader_opts.get_byte_range_size(),
stream);
if (should_load_whole_source(reader_opts)) return buffer;
if (should_load_whole_source(reader_opts, sources[0]->size())) return buffer;
auto first_delim_pos =
reader_opts.get_byte_range_offset() == 0 ? 0 : find_first_delimiter(buffer, '\n', stream);
if (first_delim_pos == -1) {
Expand Down Expand Up @@ -212,7 +213,7 @@ table_with_metadata read_json(host_span<std::unique_ptr<datasource>> sources,
return legacy::read_json(sources, reader_opts, stream, mr);
}

if (not should_load_whole_source(reader_opts)) {
if (reader_opts.get_byte_range_offset() != 0 or reader_opts.get_byte_range_size() != 0) {
CUDF_EXPECTS(reader_opts.is_enabled_lines(),
"Specifying a byte range is supported only for JSON Lines");
CUDF_EXPECTS(sources.size() == 1,
Expand Down

0 comments on commit 2d1e3c7

Please sign in to comment.