-
Notifications
You must be signed in to change notification settings - Fork 912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix bug in recovering invalid lines in JSONL inputs #17098
Merged
rapids-bot
merged 41 commits into
rapidsai:branch-24.12
from
shrshi:json-quote-char-parsing-fix
Oct 30, 2024
Merged
Changes from 8 commits
Commits
Show all changes
41 commits
Select commit
Hold shift + click to select a range
9ff3129
add option to nullify empty lines
karthikeyann 624743b
printf debugging
shrshi bcecb25
Merge branch 'branch-24.12' into json-quote-char-parsing-fix
shrshi f9b7e08
Merge branch 'enh-json_nullify_empty_lines' into json-quote-char-pars…
shrshi 55c13a0
added test; fixed small bug in nullifying empty rows
shrshi 9d2a2f0
formatting
shrshi 3d0a51d
removing from modifications to dfa
shrshi 911e065
remove hardcoding of delimiter
shrshi ab7659b
Merge branch 'branch-24.12' into enh-json_nullify_empty_lines
karthikeyann 0ef5108
Merge branch 'branch-24.12' into enh-json_nullify_empty_lines
shrshi 1dffbf0
Merge branch 'enh-json_nullify_empty_lines' of github.com:karthikeyan…
shrshi 293521f
Update cpp/tests/io/json/json_test.cpp
shrshi ca8ee32
Merge branch 'branch-24.12' into json-quote-char-parsing-fix
ttnghia ebc5275
pre-process concat
shrshi 679833b
formatting
shrshi b192fd2
Merge branch 'branch-24.12' into enh-json_nullify_empty_lines
shrshi 31d5cab
some logic fixes
shrshi 7c3e0f0
formatting
shrshi 35b7177
test
shrshi 9370dc5
formatting
shrshi 6d87031
test cleanup
shrshi b9005ae
formatting
shrshi 4382ef8
pr reviews
shrshi f75d8ee
formatting
shrshi bb9584e
formatting fix
shrshi 6ad06ca
Merge branch 'branch-24.12' into enh-json_nullify_empty_lines
shrshi 424f90f
pr reviews
shrshi 8b48297
Merge branch 'enh-json_nullify_empty_lines' of github.com:karthikeyan…
shrshi f651087
merge
shrshi dfba4cd
Merge branch 'json-quote-char-parsing-fix' of github.com:shrshi/cudf …
shrshi eb82450
Merge branch 'branch-24.12' into json-quote-char-parsing-fix
shrshi d3193e3
Merge branch 'branch-24.12' into json-quote-char-parsing-fix
shrshi 18f1a6e
Merge branch 'branch-24.12' into json-quote-char-parsing-fix
shrshi 96dce9d
pr reviews
shrshi f8c5de3
formatting
shrshi c0d0b3e
Merge branch 'json-quote-char-parsing-fix' of github.com:shrshi/cudf …
shrshi 234c19d
Merge branch 'branch-24.12' into json-quote-char-parsing-fix
shrshi 77b2f99
oops, undoing accidental merge
shrshi 2e37ed4
Merge branch 'json-quote-char-parsing-fix' of github.com:shrshi/cudf …
shrshi 3784be9
Merge branch 'branch-24.12' into json-quote-char-parsing-fix
shrshi f351242
Merge branch 'branch-24.12' into json-quote-char-parsing-fix
shrshi File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean it was a long-standing bug until now? Since we already supported customized delimiter for a long time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this has been bug until now. I suspect that when we enable
recover_with_null
, the FST removing excess characters after the delimiter in each line fixes the error in partial lines read due to the hard-coded\n
delimiter, preventing us from encountering an error. But I think this bug would have caused lines in the input spanning byte ranges to be skipped.Also, if the size of the input file is less than 2GB and we always read the whole file i.e. not in byte ranges, then again we would not encounter this bug.