-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ignore byte_range
in read_json
when the size is not smaller than the input data
#15180
Ignore byte_range
in read_json
when the size is not smaller than the input data
#15180
Conversation
@@ -212,7 +213,7 @@ table_with_metadata read_json(host_span<std::unique_ptr<datasource>> sources, | |||
return legacy::read_json(sources, reader_opts, stream, mr); | |||
} | |||
|
|||
if (not should_load_whole_source(reader_opts)) { | |||
if (reader_opts.get_byte_range_offset() != 0 or reader_opts.get_byte_range_size() != 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we not need a range_size < source_size
check here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case we would allow users to pass giant byte_range
when reading non-JSONLines files. I'm not sure this is something we want to do, since failing with byte_range + no JSONLines would become less obvious ("but it works sometimes!").
The change here keeps the original behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, thanks for clarifying!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should_load_whole_source
changed meaning from "is byte_range
used" to "does byte_range
have any impact" but the name stayed the same, so I understand the confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, this looks good to me!
byte_range
in read_json
when the size is not smaller than the input data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small nit about const, but looks good to me.
@@ -140,10 +140,11 @@ size_type find_first_delimiter_in_chunk(host_span<std::unique_ptr<cudf::io::data | |||
return find_first_delimiter(buffer, delimiter, stream); | |||
} | |||
|
|||
bool should_load_whole_source(json_reader_options const& reader_opts) | |||
bool should_load_whole_source(json_reader_options const& opts, size_t source_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason this shouldn't be const?
bool should_load_whole_source(json_reader_options const& opts, size_t source_size) | |
bool should_load_whole_source(json_reader_options const& opts, size_t const source_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't usually mark parameters passed by value as const, since it does not impact the caller in any way.
/merge |
Description
Deduce that the entire file will the loaded when byte_range is not smaller than the input size and use the faster "no byte_range" path.
Avoids double IO that happens with regular
byte_range
code path.Checklist