[FEA] Add some config options to the JSON tokenizer #2031

revans2 · 2024-05-13T14:41:41Z

Is your feature request related to a problem? Please describe.

For us to be able to support from_json or the JSON input format/SCAN using the same tokenizer currently used by get_json_object, that tokenizer needs to support more configuration options. Mostly because the defaults for from_json and for get_json_object are different.

This is to add in enough configuration that the default options for from_json will work.

allowNumericLeadingZeros get_json_object has this off, but from_json has it enabled by default.
allowNonNumericNumbers get_json_object has this off, but from_json has it enabled by default.
allowUnquotedControlChars get_json_object has this on, but from_json has it off by default.

maxNestingDepth is one that we need to handle, but we probably want to handle it very differently so we are going to do that work as a separate issue.

The following options are not needed and might be added in follow on work.

allowSingleQuotes. This is on by default for both, and we have not seen anyone disable it.
allowComments: This is off by default for both and until a customer asks for it I don't think we will try to support it.
allowUnquotedFieldNames: This is off by default in both and again until a customer asks for it we will not try to support it.
allowBackslashEscapingAnyCharacter: This is off by default in both and we will not support it until a customer asks for it.
maxNumLen: This is for newer versions of Spark and is a DOS fix that we don't need to worry about.
maxStringLen: Again this is for newer versions of Spark is a DOS fix that we don't need to worry about.

We need to be very careful as we do this work that we do not regress the performance of get_json_object. Adding more functionality will cause more registers to be used and might impact the occupancy, which is already bad.

NVIDIA/spark-rapids#10803

The text was updated successfully, but these errors were encountered:

revans2 added ? - Needs Triage feature request labels May 13, 2024

This was referenced May 13, 2024

[FEA] Move all JSON parsing to the same backend as get_json_object NVIDIA/spark-rapids#10804

Open

[FEA] Write a from_json implementation using the json_parser form spark-rapids-jni #2035

Open

revans2 removed the ? - Needs Triage label May 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Add some config options to the JSON tokenizer #2031

[FEA] Add some config options to the JSON tokenizer #2031

revans2 commented May 13, 2024

[FEA] Add some config options to the JSON tokenizer #2031

[FEA] Add some config options to the JSON tokenizer #2031

Comments

revans2 commented May 13, 2024