-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG Table.readJson dropping valid JSON lines #14282
Comments
Thanks for reporting the issue, @andygrove. Unfortunately, I couldn't reproduce the issue. Could you share the full string you're trying to parse? It's surprising that line |
I now have a repro case: #14291 |
Just to follow up, @andygrove, I have finally gotten to the bottom of the issue. The issue only occurs when I'm currently elaborating options for resolving the issue and will likely have a resolution by the end of the week. |
@andygrove, I've put up #14309 to address this issue. Feel free to check if it properly addresses your issue. |
Thank you @elstehle. I have confirmed that this resolves the issue. |
@elstehle I found one edge case where the last line will be dropped rather than replaced with null if it is invalid. This results in |
Thanks for sharing, @andygrove! I'll look into it. |
Thanks, Andy. I investigated the issue and it should only be an issue when the last line is both (a) incomplete, e.g., |
…JSON lines (#14309) Addresses #14282. For the JSON lines format that recovers after an invalid JSON line, we've had two issues when we were generating the stack context that is used downstream in the full JSON pushdown transducer. For that format, we need to make sure that we "reset" the stack context after each JSON line. That is, 1. We need to reset the stack to the empty stack after each JSON line, as the stack may not be empty after an erroneous JSON line. E.g. `{"this opening brace is never closed":123\n{"<=this brace should be on the empty stack":...}` 2. We need to reset that we are outside of a string: `{"no matching end-quote on this line\n{"<=this quote is the beginning of a field name, not the end of the previous line's field name"` This fixes above requirements as follows: 1. Was already implemented - but with an inappropriate scan operator that is not associative: ``` StackLevelT new_level = (symbol_to_stack_op_type(rhs.value) == stack_op_type::RESET) ? 0 : (lhs.stack_level + rhs.stack_level); ``` E.g. (`{,n,{`,`},n,{`,`{,n,}`,`},n,}` all fail the associativity test). This was replaced with a `ScanByKey` that would start with a "fresh" stack level with each new key segment. 2. Was addressed by changing the transition table of the finite-state transducer that filters out brackets and braces that are enclosed in quotes to go back to the `OOS` (`outside-of-string`) state after every newline. This behaviour requires that _every_ newline character is treated as a delimiter of a JSON line. This was confirmed by Spark Rapids, who is the targeted user for the recovery option to be the case. Authors: - Elias Stehle (https://github.com/elstehle) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - Karthikeyan (https://github.com/karthikeyann) URL: #14309
Hi @andygrove! Please let me know if the following issue for incomplete last lines can be addressed by appending a newline on the Spark side. If that should turn out to not be feasible please let me know.
|
Thanks @elstehle. Yes, I have confirmed that adding a newline at the end resolves the issue for us. Thanks for looking into this. |
Describe the bug
In the plugin, we have a ColumnVector containing JSON lines (concatenated together in one large string). Here is a sample:
We pass this to
Table.readJson
using the following code:The resulting table is missing some values. Note that the entry for
{"teacher": "Spmydj","student": {"name": "Ggeyhv", "age": 16}}
isNULL
here.Steps/Code to reproduce bug
The repro case is in NVIDIA/spark-rapids#9423 in
test_from_json_struct_of_struct
.Expected behavior
Data should not be dropped.
Environment overview (please complete the following information)
N/A
Environment details
N/A
Additional context
The text was updated successfully, but these errors were encountered: