Ignore `byte_range` in `read_json` when the size is not smaller than the input data #15180

vuule · 2024-02-28T20:28:16Z

Description

Deduce that the entire file will the loaded when byte_range is not smaller than the input size and use the faster "no byte_range" path.

Avoids double IO that happens with regular byte_range code path.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

shrshi · 2024-02-29T22:02:51Z

cpp/src/io/json/read_json.cu

@@ -212,7 +213,7 @@ table_with_metadata read_json(host_span<std::unique_ptr<datasource>> sources,
    return legacy::read_json(sources, reader_opts, stream, mr);
  }

-  if (not should_load_whole_source(reader_opts)) {
+  if (reader_opts.get_byte_range_offset() != 0 or reader_opts.get_byte_range_size() != 0) {


Do we not need a range_size < source_size check here?

In that case we would allow users to pass giant byte_range when reading non-JSONLines files. I'm not sure this is something we want to do, since failing with byte_range + no JSONLines would become less obvious ("but it works sometimes!").
The change here keeps the original behavior.

Makes sense, thanks for clarifying!

should_load_whole_source changed meaning from "is byte_range used" to "does byte_range have any impact" but the name stayed the same, so I understand the confusion.

shrshi

Thank you, this looks good to me!

hyperbolic2346

Small nit about const, but looks good to me.

hyperbolic2346 · 2024-03-05T19:56:15Z

cpp/src/io/json/read_json.cu

@@ -140,10 +140,11 @@ size_type find_first_delimiter_in_chunk(host_span<std::unique_ptr<cudf::io::data
  return find_first_delimiter(buffer, delimiter, stream);
 }

-bool should_load_whole_source(json_reader_options const& reader_opts)
+bool should_load_whole_source(json_reader_options const& opts, size_t source_size)


Any reason this shouldn't be const?

Suggested change

bool should_load_whole_source(json_reader_options const& opts, size_t source_size)

bool should_load_whole_source(json_reader_options const& opts, size_t const source_size)

We don't usually mark parameters passed by value as const, since it does not impact the caller in any way.

vuule · 2024-03-05T21:10:37Z

/merge

load whole source when range size too big

25b5692

vuule self-assigned this Feb 28, 2024

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Feb 28, 2024

vuule added cuIO cuIO issue Performance Performance related issue non-breaking Non-breaking change improvement Improvement / enhancement to an existing function and removed libcudf Affects libcudf (C++/CUDA) code. labels Feb 28, 2024

Merge branch 'branch-24.04' into bug-json-load-whole-file

6a42feb

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Feb 28, 2024

Merge branch 'branch-24.04' into bug-json-load-whole-file

15f7054

shrshi reviewed Feb 29, 2024

View reviewed changes

Merge branch 'branch-24.04' into bug-json-load-whole-file

5eacd55

vuule marked this pull request as ready for review March 1, 2024 17:29

vuule requested a review from a team as a code owner March 1, 2024 17:29

vuule requested review from hyperbolic2346 and shrshi March 1, 2024 17:29

shrshi approved these changes Mar 1, 2024

View reviewed changes

vuule changed the title ~~Ignore byte_range in read_json when the size is not smaller than the input data~~ Ignore byte_range in read_json when the size is not smaller than the input data Mar 1, 2024

Merge branch 'branch-24.04' into bug-json-load-whole-file

b468820

hyperbolic2346 approved these changes Mar 5, 2024

View reviewed changes

rapids-bot bot merged commit 2d1e3c7 into rapidsai:branch-24.04 Mar 5, 2024
74 checks passed

vuule deleted the bug-json-load-whole-file branch March 5, 2024 21:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ignore `byte_range` in `read_json` when the size is not smaller than the input data #15180

Ignore `byte_range` in `read_json` when the size is not smaller than the input data #15180

vuule commented Feb 28, 2024 •

edited

Loading

shrshi Feb 29, 2024

vuule Feb 29, 2024

shrshi Feb 29, 2024

vuule Feb 29, 2024

shrshi left a comment

hyperbolic2346 left a comment

hyperbolic2346 Mar 5, 2024

vuule Mar 5, 2024

vuule commented Mar 5, 2024

	bool should_load_whole_source(json_reader_options const& opts, size_t source_size)
	bool should_load_whole_source(json_reader_options const& opts, size_t const source_size)

Ignore byte_range in read_json when the size is not smaller than the input data #15180

Ignore byte_range in read_json when the size is not smaller than the input data #15180

Conversation

vuule commented Feb 28, 2024 • edited Loading

Description

Checklist

shrshi Feb 29, 2024

Choose a reason for hiding this comment

vuule Feb 29, 2024

Choose a reason for hiding this comment

shrshi Feb 29, 2024

Choose a reason for hiding this comment

vuule Feb 29, 2024

Choose a reason for hiding this comment

shrshi left a comment

Choose a reason for hiding this comment

hyperbolic2346 left a comment

Choose a reason for hiding this comment

hyperbolic2346 Mar 5, 2024

Choose a reason for hiding this comment

vuule Mar 5, 2024

Choose a reason for hiding this comment

vuule commented Mar 5, 2024

Ignore `byte_range` in `read_json` when the size is not smaller than the input data #15180

Ignore `byte_range` in `read_json` when the size is not smaller than the input data #15180

vuule commented Feb 28, 2024 •

edited

Loading