Fixes stack context for json lines format that recovers from invalid JSON lines #14309

elstehle · 2023-10-22T15:54:39Z

Description

Addresses #14282.

For the JSON lines format that recovers after an invalid JSON line, we've had two issues when we were generating the stack context that is used downstream in the full JSON pushdown transducer.

For that format, we need to make sure that we "reset" the stack context after each JSON line. That is,

We need to reset the stack to the empty stack after each JSON line, as the stack may not be empty after an erroneous JSON line. E.g. {"this opening brace is never closed":123\n{"<=this brace should be on the empty stack":...}
We need to reset that we are outside of a string: {"no matching end-quote on this line\n{"<=this quote is the beginning of a field name, not the end of the previous line's field name"

This fixes above requirements as follows:

Was already implemented - but with an inappropriate scan operator that is not associative:

StackLevelT new_level = (symbol_to_stack_op_type(rhs.value) == stack_op_type::RESET)
                              ? 0
                              : (lhs.stack_level + rhs.stack_level);

E.g. ({,n,{,},n,{,{,n,},},n,} all fail the associativity test). This was replaced with a ScanByKey that would start with a "fresh" stack level with each new key segment.

Was addressed by changing the transition table of the finite-state transducer that filters out brackets and braces that are enclosed in quotes to go back to the OOS (outside-of-string) state after every newline. This behaviour requires that every newline character is treated as a delimiter of a JSON line. This was confirmed by Spark Rapids, who is the targeted user for the recovery option to be the case.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

vuule

looks good. just one question to confirm my understanding

cpp/src/io/fst/logical_stack.cuh

…nes-recovering-stack-ctx

cpp/src/io/fst/logical_stack.cuh

…nes-recovering-stack-ctx

karthikeyann

Looks good to me.

elstehle · 2023-10-31T07:09:45Z

/merge

fixes stack context for json lines recovering from errors

bc876cf

elstehle requested a review from a team as a code owner October 22, 2023 15:54

elstehle requested review from mythrocks and vuule October 22, 2023 15:54

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Oct 22, 2023

elstehle added bug Something isn't working 3 - Ready for Review Ready for review by team cuIO cuIO issue non-breaking Non-breaking change labels Oct 22, 2023

elstehle mentioned this pull request Oct 22, 2023

[BUG Table.readJson dropping valid JSON lines #14282

Closed

elstehle requested a review from karthikeyann October 22, 2023 16:05

addresses unused var for if constexpr

40c6b55

vuule approved these changes Oct 23, 2023

View reviewed changes

cpp/src/io/fst/logical_stack.cuh Show resolved Hide resolved

andygrove mentioned this pull request Oct 24, 2023

Re-enable from_json / JsonToStructs NVIDIA/spark-rapids#9423

Merged

Merge remote-tracking branch 'upstream/branch-23.12' into fix/json-li…

8bc594a

…nes-recovering-stack-ctx

elstehle changed the title ~~Fixes stack context for json lines format that recovers from invalid JSON lines~~ Fixes stack context for json lines format that recovers from invalid JSON lines Oct 26, 2023

GregoryKimball assigned elstehle Oct 26, 2023

Merge remote-tracking branch 'upstream/branch-23.12' into fix/json-li…

ae1c62d

…nes-recovering-stack-ctx

karthikeyann reviewed Oct 30, 2023

View reviewed changes

cpp/src/io/fst/logical_stack.cuh Outdated Show resolved Hide resolved

elstehle added 3 commits October 30, 2023 04:11

Merge remote-tracking branch 'upstream/branch-23.12' into fix/json-li…

b60b035

…nes-recovering-stack-ctx

fixes return type

7109874

Merge remote-tracking branch 'upstream/branch-23.12' into fix/json-li…

95c4ddc

…nes-recovering-stack-ctx

elstehle requested a review from karthikeyann October 31, 2023 03:49

karthikeyann approved these changes Oct 31, 2023

View reviewed changes

rapids-bot bot merged commit 2abf9a6 into rapidsai:branch-23.12 Oct 31, 2023
54 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes stack context for json lines format that recovers from invalid JSON lines #14309

Fixes stack context for json lines format that recovers from invalid JSON lines #14309

elstehle commented Oct 22, 2023

vuule left a comment

karthikeyann left a comment

elstehle commented Oct 31, 2023

Fixes stack context for json lines format that recovers from invalid JSON lines #14309

Fixes stack context for json lines format that recovers from invalid JSON lines #14309

Conversation

elstehle commented Oct 22, 2023

Description

Checklist

vuule left a comment

Choose a reason for hiding this comment

karthikeyann left a comment

Choose a reason for hiding this comment

elstehle commented Oct 31, 2023