Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignore byte_range in read_json when the size is not smaller than the input data #15180

Merged
merged 5 commits into from
Mar 5, 2024

Conversation

vuule
Copy link
Contributor

@vuule vuule commented Feb 28, 2024

Description

Deduce that the entire file will the loaded when byte_range is not smaller than the input size and use the faster "no byte_range" path.

Avoids double IO that happens with regular byte_range code path.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@vuule vuule self-assigned this Feb 28, 2024
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Feb 28, 2024
@vuule vuule added cuIO cuIO issue Performance Performance related issue non-breaking Non-breaking change improvement Improvement / enhancement to an existing function and removed libcudf Affects libcudf (C++/CUDA) code. labels Feb 28, 2024
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Feb 28, 2024
@@ -212,7 +213,7 @@ table_with_metadata read_json(host_span<std::unique_ptr<datasource>> sources,
return legacy::read_json(sources, reader_opts, stream, mr);
}

if (not should_load_whole_source(reader_opts)) {
if (reader_opts.get_byte_range_offset() != 0 or reader_opts.get_byte_range_size() != 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we not need a range_size < source_size check here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case we would allow users to pass giant byte_range when reading non-JSONLines files. I'm not sure this is something we want to do, since failing with byte_range + no JSONLines would become less obvious ("but it works sometimes!").
The change here keeps the original behavior.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, thanks for clarifying!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should_load_whole_source changed meaning from "is byte_range used" to "does byte_range have any impact" but the name stayed the same, so I understand the confusion.

@vuule vuule marked this pull request as ready for review March 1, 2024 17:29
@vuule vuule requested a review from a team as a code owner March 1, 2024 17:29
@vuule vuule requested review from hyperbolic2346 and shrshi March 1, 2024 17:29
Copy link
Contributor

@shrshi shrshi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, this looks good to me!

@vuule vuule changed the title Ignore byte_range in read_json when the size is not smaller than the input data Ignore byte_range in read_json when the size is not smaller than the input data Mar 1, 2024
Copy link
Contributor

@hyperbolic2346 hyperbolic2346 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nit about const, but looks good to me.

@@ -140,10 +140,11 @@ size_type find_first_delimiter_in_chunk(host_span<std::unique_ptr<cudf::io::data
return find_first_delimiter(buffer, delimiter, stream);
}

bool should_load_whole_source(json_reader_options const& reader_opts)
bool should_load_whole_source(json_reader_options const& opts, size_t source_size)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason this shouldn't be const?

Suggested change
bool should_load_whole_source(json_reader_options const& opts, size_t source_size)
bool should_load_whole_source(json_reader_options const& opts, size_t const source_size)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't usually mark parameters passed by value as const, since it does not impact the caller in any way.

@vuule
Copy link
Contributor Author

vuule commented Mar 5, 2024

/merge

@rapids-bot rapids-bot bot merged commit 2d1e3c7 into rapidsai:branch-24.04 Mar 5, 2024
74 checks passed
@vuule vuule deleted the bug-json-load-whole-file branch March 5, 2024 21:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuIO cuIO issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Performance Performance related issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants