Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug in recovering invalid lines in JSONL inputs #17098

Merged
merged 41 commits into from
Oct 30, 2024

Conversation

shrshi
Copy link
Contributor

@shrshi shrshi commented Oct 16, 2024

Description

Addresses #16999

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@shrshi shrshi added bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue labels Oct 16, 2024
Copy link

copy-pr-bot bot commented Oct 16, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@shrshi shrshi added the non-breaking Non-breaking change label Oct 16, 2024
Comment on lines 228 to 229
if (last_char != '\n') {
last_char = '\n';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is hardcoded to \n, should it be delimiter mentioned in json reader options instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Qn: Does this force reader to do extra copy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, I've changed the hardcoded \n to the reader options delimiter. Also fixed it in a few other places.

Yes, we now have two extra copies between host and device each of size 1 byte. We also perform a stream sync between the copies. I'll run the JSON benchmarks to see what the impact of this change is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once we merge #17028, we need to check the last character in the buffer only when both nullify_empty_rows and recover_with_null are enabled. Otherwise, I think we can always add a delimiter since empty rows are anyway ignored.

@shrshi
Copy link
Contributor Author

shrshi commented Oct 16, 2024

/ok to test

@shrshi shrshi marked this pull request as ready for review October 16, 2024 15:51
@shrshi shrshi requested a review from a team as a code owner October 16, 2024 15:51
@shrshi shrshi requested review from mythrocks and lamarrr October 16, 2024 15:51

auto const shift_for_nonzero_offset = std::min<std::int64_t>(chunk_offset, 1);
auto const first_delim_pos =
chunk_offset == 0 ? 0 : find_first_delimiter(readbufspan, '\n', stream);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean it was a long-standing bug until now? Since we already supported customized delimiter for a long time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this has been bug until now. I suspect that when we enable recover_with_null, the FST removing excess characters after the delimiter in each line fixes the error in partial lines read due to the hard-coded \n delimiter, preventing us from encountering an error. But I think this bug would have caused lines in the input spanning byte ranges to be skipped.
Also, if the size of the input file is less than 2GB and we always read the whole file i.e. not in byte ranges, then again we would not encounter this bug.

@ttnghia
Copy link
Contributor

ttnghia commented Oct 21, 2024

/ok to test

@karthikeyann karthikeyann added the 5 - DO NOT MERGE Hold off on merging; see PR for details label Oct 21, 2024
@shrshi shrshi added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 5 - DO NOT MERGE Hold off on merging; see PR for details labels Oct 29, 2024
@ttnghia
Copy link
Contributor

ttnghia commented Oct 29, 2024

/ok to test

cpp/tests/io/json/json_test.cpp Outdated Show resolved Hide resolved
cpp/src/io/json/read_json.cu Outdated Show resolved Hide resolved
@shrshi shrshi added 5 - DO NOT MERGE Hold off on merging; see PR for details and removed 5 - Ready to Merge Testing and reviews complete, ready to merge labels Oct 29, 2024
@shrshi shrshi removed the 5 - DO NOT MERGE Hold off on merging; see PR for details label Oct 29, 2024
@shrshi
Copy link
Contributor Author

shrshi commented Oct 30, 2024

/ok to test

@shrshi
Copy link
Contributor Author

shrshi commented Oct 30, 2024

/merge

@rapids-bot rapids-bot bot merged commit 0b9277b into rapidsai:branch-24.12 Oct 30, 2024
103 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants