Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] \n is not considered whitespace when tokenizing JSON #16915

Closed
revans2 opened this issue Sep 25, 2024 · 5 comments
Closed

[BUG] \n is not considered whitespace when tokenizing JSON #16915

revans2 opened this issue Sep 25, 2024 · 5 comments
Labels
bug Something isn't working

Comments

@revans2
Copy link
Contributor

revans2 commented Sep 25, 2024

Describe the bug
CUDF support setting a line delimiter when parsing/tokenizing JSON. But it appears that \r is not considered to be valid white space at least not if json lines parsing is enabled.

Steps/Code to reproduce bug
I need to do some more work to get a repo case in C++. But if I try to parse/tokenize JSON with

{"a":
100}

with the delimiter set to 0, I get back a null indicating that the data is invalid. But if I put the value in quotes

{"a":"
100"}

it is parsed as expected.

Expected behavior
according to the json spec white space include \n, but it does not appear to be treating it that way.

@revans2 revans2 added the bug Something isn't working label Sep 25, 2024
@shrshi
Copy link
Contributor

shrshi commented Sep 25, 2024

Thank you for sharing the bug report, @revans2. To clarify, are you setting the delimiter to 0 or \0? If it is 0, then get_token_stream returns error tokens for the three invalid JSON lines.
Also, are you enabling white space normalization for this repro?

@shrshi
Copy link
Contributor

shrshi commented Sep 26, 2024

Opened #16923 to fix this bug.

@revans2
Copy link
Contributor Author

revans2 commented Sep 27, 2024

I set the delimiter to the nul character. 0x00. This was meant to be a stop gap because 0x00 is much less likely to show up in a JSON column than \n is.

rapids-bot bot pushed a commit that referenced this issue Sep 27, 2024
…ith non-newline delimiter (#16923)

Addresses #16915

Authors:
  - Shruti Shivakumar (https://github.com/shrshi)

Approvers:
  - Basit Ayantunde (https://github.com/lamarrr)
  - Karthikeyan (https://github.com/karthikeyann)
  - Vukasin Milovanovic (https://github.com/vuule)

URL: #16923
shrshi added a commit to shrshi/cudf that referenced this issue Sep 27, 2024
raydouglass pushed a commit that referenced this issue Sep 30, 2024
…ith non-newline delimiter (#16950)

Backporting PR #16923: : Parse newline as whitespace character while
tokenizing JSONL inputs

Addresses #16915
@ttnghia
Copy link
Contributor

ttnghia commented Sep 30, 2024

Can this be closed, given that #16923 is merged?

@shrshi
Copy link
Contributor

shrshi commented Sep 30, 2024

I believe this can be closed. @revans2 please feel free to re-open if any further issues are encountered.

@shrshi shrshi closed this as completed Sep 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants