Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] JSON reader with recover_with_null fails to detect incomplete JSON as invalid #14227

Closed
andygrove opened this issue Sep 28, 2023 · 1 comment · Fixed by #14252
Closed
Assignees
Labels
bug Something isn't working Spark Functionality that helps Spark RAPIDS

Comments

@andygrove
Copy link
Contributor

Describe the bug
The following input has one invalid JSON record on line 3 (missing the closing }).

{ "number": 1 }
{ "number": 2 }
{ "number": 3
{ "number": 4 }

When reading this file with cuDF and specifying a schema with column number of type int and also specifying recover_with_null, the column returned is a struct instead of an int and has null values for the first two records and then has a struct with a child int containing the number 4.

COLUMN input - STRUCT
0 NULL
1 NULL
COLUMN input:CHILD_0 - INT64
0 NULL
1 NULL
2 4

It appears that line 3 is treated as valid and multiline, and line 4 is read as a child.

Steps/Code to reproduce bug
I am testing via Spark RAPIDS, but I suspect that this issue could be reproduced by adding this scenario to the TEST_F(JsonReaderTest, JSONLinesRecovering) test in cpp/tests/io/json_test.cpp.

Expected behavior
I would expect line 3 to be treated as invalid and return a NULL.

Environment overview (please complete the following information)
N/A

Environment details
N/A

Additional context

@andygrove andygrove added bug Something isn't working Needs Triage Need team to review and classify Spark Functionality that helps Spark RAPIDS labels Sep 28, 2023
@GregoryKimball GregoryKimball added this to the Nested JSON reader milestone Oct 4, 2023
@elstehle elstehle self-assigned this Oct 4, 2023
@elstehle
Copy link
Contributor

elstehle commented Oct 5, 2023

I've opened #14252 to address this issue. Feel free to test against the PR to verify if the new behaviour matches Spark for JSON lines with incomplete records (or incomplete records, strings, field names, etc.).

rapids-bot bot pushed a commit that referenced this issue Oct 11, 2023
…bled (#14252)

Closes #14227. Adapts the behaviour of the JSON finite-state transducer (FST) when `recover_with_nulls` is `true` to be more strict and reject lines that contain incomplete JSON objects (aka records) or JSON arrays (aka lists).

Authors:
  - Elias Stehle (https://github.com/elstehle)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - Nghia Truong (https://github.com/ttnghia)

URL: #14252
@GregoryKimball GregoryKimball removed this from libcudf Oct 26, 2023
@bdice bdice removed the Needs Triage Need team to review and classify label Mar 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants