Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
POC for whitespace removal in input JSON data using FST #14931
POC for whitespace removal in input JSON data using FST #14931
Changes from 13 commits
5f47b0b
e3871de
302ec76
5c0ab9b
690a61a
8e5256d
529776d
4aaa610
58f697d
5957a71
9c51a21
78e47c0
dc0c75d
081c282
de4bd03
183e66b
55d3bc5
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The example makes it sound like this FST does that transformation. Maybe write:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the current FST, we would get the transformation described in the comment, but that is not the expected behaviour i.e. we should not remove whitespace characters within quotes. I think the following would make it clearer -
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused. We are documenting a known bug in the current implementation? Are we intending to fix this before merging?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For compatibility with Spark, we don't need to consider newlines within strings as a part of the string. While reading from JSON lines with the option set to recover from invalid lines, I think newline characters present before the end of the record (like in the example
{"a":"x\n y"}
) will result in the parser treating it as an invalid line.I have added the note for the sake of completeness and to clarify the scope of the FST.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is error state not expected to happen since we don't have a state for error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no error state for both the quote normalization and whitespace normalization. In case of invalid JSON inputs (such as the
GroundTruth_InvalidInput
test case), it processes them anyway and leaves the error-handling and recovery to the next parsing FST.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be
cudf::get_default_stream()
for tests?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great idea! It's better to call
cudf::test::get_default_stream()
here instead of creating a new stream. Fixed.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question (not change suggestion):
Why do some strings cases use raw string literal, but some cases are escaped strings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With raw strings, it's hard to see the positions of spaces and tabs when they are next to each other, especially when editors map tabs to different number of spaces. With escaped strings, I think we have more control.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a really good answer! I hadn't considered that.