You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
For us to be able to support from_json or the JSON input format/SCAN using the same tokenizer currently used by get_json_object, that tokenizer needs to support more configuration options. Mostly because the defaults for from_json and for get_json_object are different.
This is to add in enough configuration that the default options for from_json will work.
allowNumericLeadingZeros get_json_object has this off, but from_json has it enabled by default.
allowNonNumericNumbers get_json_object has this off, but from_json has it enabled by default.
allowUnquotedControlChars get_json_object has this on, but from_json has it off by default.
maxNestingDepth is one that we need to handle, but we probably want to handle it very differently so we are going to do that work as a separate issue.
The following options are not needed and might be added in follow on work.
allowSingleQuotes. This is on by default for both, and we have not seen anyone disable it.
allowComments: This is off by default for both and until a customer asks for it I don't think we will try to support it.
allowUnquotedFieldNames: This is off by default in both and again until a customer asks for it we will not try to support it.
allowBackslashEscapingAnyCharacter: This is off by default in both and we will not support it until a customer asks for it.
maxNumLen: This is for newer versions of Spark and is a DOS fix that we don't need to worry about.
maxStringLen: Again this is for newer versions of Spark is a DOS fix that we don't need to worry about.
We need to be very careful as we do this work that we do not regress the performance of get_json_object. Adding more functionality will cause more registers to be used and might impact the occupancy, which is already bad.
Is your feature request related to a problem? Please describe.
For us to be able to support from_json or the JSON input format/SCAN using the same tokenizer currently used by get_json_object, that tokenizer needs to support more configuration options. Mostly because the defaults for from_json and for get_json_object are different.
This is to add in enough configuration that the default options for from_json will work.
maxNestingDepth
is one that we need to handle, but we probably want to handle it very differently so we are going to do that work as a separate issue.The following options are not needed and might be added in follow on work.
We need to be very careful as we do this work that we do not regress the performance of
get_json_object
. Adding more functionality will cause more registers to be used and might impact the occupancy, which is already bad.NVIDIA/spark-rapids#10803
The text was updated successfully, but these errors were encountered: