Enable JSON Scan and from_json by default (#11753)

Signed-off-by: Robert (Bobby) Evans <[email protected]> Co-authored-by: Nghia Truong <[email protected]>
NVIDIA · Nov 25, 2024 · 6539441 · 6539441
1 parent 6cba00d
commit 6539441
Show file tree

Hide file tree

Showing 53 changed files with 151 additions and 176 deletions.
diff --git a/docs/additional-functionality/advanced_configs.md b/docs/additional-functionality/advanced_configs.md
@@ -95,8 +95,8 @@ Name | Description | Default Value | Applicable at
 <a name="sql.format.hive.text.write.enabled"></a>spark.rapids.sql.format.hive.text.write.enabled|When set to false disables Hive text table write acceleration|false|Runtime
 <a name="sql.format.iceberg.enabled"></a>spark.rapids.sql.format.iceberg.enabled|When set to false disables all Iceberg acceleration|true|Runtime
 <a name="sql.format.iceberg.read.enabled"></a>spark.rapids.sql.format.iceberg.read.enabled|When set to false disables Iceberg input acceleration|true|Runtime
-<a name="sql.format.json.enabled"></a>spark.rapids.sql.format.json.enabled|When set to true enables all json input and output acceleration. (only input is currently supported anyways)|false|Runtime
-<a name="sql.format.json.read.enabled"></a>spark.rapids.sql.format.json.read.enabled|When set to true enables json input acceleration|false|Runtime
+<a name="sql.format.json.enabled"></a>spark.rapids.sql.format.json.enabled|When set to true enables all json input and output acceleration. (only input is currently supported anyways)|true|Runtime
+<a name="sql.format.json.read.enabled"></a>spark.rapids.sql.format.json.read.enabled|When set to true enables json input acceleration|true|Runtime
 <a name="sql.format.orc.enabled"></a>spark.rapids.sql.format.orc.enabled|When set to false disables all orc input and output acceleration|true|Runtime
 <a name="sql.format.orc.floatTypesToString.enable"></a>spark.rapids.sql.format.orc.floatTypesToString.enable|When reading an ORC file, the source data schemas(schemas of ORC file) may differ from the target schemas (schemas of the reader), we need to handle the castings from source type to target type. Since float/double numbers in GPU have different precision with CPU, when casting float/double to string, the result of GPU is different from result of CPU spark. Its default value is `true` (this means the strings result will differ from result of CPU). If it's set `false` explicitly and there exists casting from float/double to string in the job, then such behavior will cause an exception, and the job will fail.|true|Runtime
 <a name="sql.format.orc.multiThreadedRead.maxNumFilesParallel"></a>spark.rapids.sql.format.orc.multiThreadedRead.maxNumFilesParallel|A limit on the maximum number of files per task processed in parallel on the CPU side before the file is sent to the GPU. This affects the amount of host memory used when reading the files in parallel. Used with MULTITHREADED reader, see spark.rapids.sql.format.orc.reader.type.|2147483647|Runtime
@@ -278,7 +278,7 @@ Name | SQL Function(s) | Description | Default Value | Notes
 <a name="sql.expression.IsNaN"></a>spark.rapids.sql.expression.IsNaN|`isnan`|Checks if a value is NaN|true|None|
 <a name="sql.expression.IsNotNull"></a>spark.rapids.sql.expression.IsNotNull|`isnotnull`|Checks if a value is not null|true|None|
 <a name="sql.expression.IsNull"></a>spark.rapids.sql.expression.IsNull|`isnull`|Checks if a value is null|true|None|
-<a name="sql.expression.JsonToStructs"></a>spark.rapids.sql.expression.JsonToStructs|`from_json`|Returns a struct value with the given `jsonStr` and `schema`|false|This is disabled by default because it is currently in beta and undergoes continuous enhancements. Please consult the [compatibility documentation](../compatibility.md#json-supporting-types) to determine whether you can enable this configuration for your use case|
+<a name="sql.expression.JsonToStructs"></a>spark.rapids.sql.expression.JsonToStructs|`from_json`|Returns a struct value with the given `jsonStr` and `schema`|true|None|
 <a name="sql.expression.JsonTuple"></a>spark.rapids.sql.expression.JsonTuple|`json_tuple`|Returns a tuple like the function get_json_object, but it takes multiple names. All the input parameters and output column types are string.|false|This is disabled by default because Experimental feature that could be unstable or have performance issues.|
 <a name="sql.expression.KnownFloatingPointNormalized"></a>spark.rapids.sql.expression.KnownFloatingPointNormalized| |Tag to prevent redundant normalization|true|None|
 <a name="sql.expression.KnownNotNull"></a>spark.rapids.sql.expression.KnownNotNull| |Tag an expression as known to not be null|true|None|

diff --git a/docs/compatibility.md b/docs/compatibility.md
@@ -316,133 +316,110 @@ case.
 
 ## JSON
 
-The JSON format read is an experimental feature which is expected to have some issues, so we disable
-it by default. If you would like to test it, you need to enable `spark.rapids.sql.format.json.enabled` and
-`spark.rapids.sql.format.json.read.enabled`.
+JSON, despite being a standard format, has some ambiguity in it. Spark also offers the ability to allow 
+some invalid JSON to be parsed. We have tried to provide JSON parsing that is compatible with 
+what Apache Spark does support. Note that Spark itself has changed through different releases, and we will
+try to call out which releases we offer different results for. JSON parsing is enabled by default
+except for date and timestamp types where we still have work to complete. If you wish to disable
+JSON Scan you can set `spark.rapids.sql.format.json.enabled` or
+`spark.rapids.sql.format.json.read.enabled` to false. To disable `from_json` you can set 
+`spark.rapids.sql.expression.JsonToStructs` to false.
 
-### Invalid JSON
+### Limits
 
-In Apache Spark on the CPU if a line in the JSON file is invalid the entire row is considered
-invalid and will result in nulls being returned for all columns. It is considered invalid if it
-violates the JSON specification, but with a few extensions.
+In versions of Spark before 3.5.0 there is no maximum to how deeply nested JSON can be. After 
+3.5.0 this was updated to be 1,000 by default. The current GPU implementation of JSON Scan and 
+`from_json` limits this to 254 no matter what version of Spark is used. If the nesting level is
+over this the JSON is considered invalid and all values will be returned as nulls.
+`get_json_object` and `json_tuple` have a maximum nesting depth of 64. An exception is thrown if
+the nesting depth goes over the maximum.
 
-  * Single quotes are allowed to quote strings and keys
-  * Unquoted values like NaN and Infinity can be parsed as floating point values
-  * Control characters do not need to be replaced with the corresponding escape sequences in a 
-    quoted string.
-  * Garbage at the end of a row, if there is valid JSON at the beginning of the row, is ignored.
+Spark 3.5.0 and above have limits on maximum string length 20,000,000 and maximum number length of
+1,000. We do not have any of these limits on the GPU.
 
-The GPU implementation does the same kinds of validations, but many of them are done on a per-column
-basis, which, for example, means if a number is formatted incorrectly, it is likely only that value
-will be considered invalid and return a null instead of nulls for the entire row.  
+We, like Spark, cannot support an JSON string that is larger than 2 GiB is size.
 
-There are options that can be used to enable and disable many of these features which are mostly
-listed below.
+### JSON Validation
 
-### JSON options
+Spark supports the option `allowNonNumericNumbers`. Versions of Spark prior to 3.3.0 where inconsistent between
+quoted and non-quoted values ([SPARK-38060](https://issues.apache.org/jira/browse/SPARK-38060)). The
+GPU implementation is consistent with 3.3.0 and above.
 
-Spark supports passing options to the JSON parser when reading a dataset.  In most cases if the RAPIDS Accelerator
-sees one of these options that it does not support it will fall back to the CPU. In some cases we do not. The
-following options are documented below.
+### JSON Floating Point Types
 
-- `allowNumericLeadingZeros`  - Allows leading zeros in numbers (e.g. 00012). By default this is set to false.
-  When it is false Spark considers the JSON invalid if it encounters this type of number. The RAPIDS
-  Accelerator supports validating columns that are returned to the user with this option on or off. 
-
-- `allowUnquotedControlChars` - Allows JSON Strings to contain unquoted control characters (ASCII characters with
-  value less than 32, including tab and line feed characters) or not. By default this is set to false. If the schema
-  is provided while reading JSON file, then this flag has no impact on the RAPIDS Accelerator as it always allows
-  unquoted control characters but Spark sees these are invalid are returns nulls. However, if the schema is not provided
-  and this option is false, then RAPIDS Accelerator's behavior is same as Spark where an exception is thrown
-  as discussed in `JSON Schema discovery` section.
-
-- `allowNonNumericNumbers` - Allows `NaN` and `Infinity` values to be parsed (note that these are not valid numeric
-  values in the [JSON specification](https://json.org)). Spark versions prior to 3.3.0 have inconsistent behavior and will
-  parse some variants of `NaN` and `Infinity` even when this option is disabled
-  ([SPARK-38060](https://issues.apache.org/jira/browse/SPARK-38060)). The RAPIDS Accelerator behavior is consistent with
-  Spark version 3.3.0 and later.
-
-### Nesting
-In versions of Spark before 3.5.0 there is no maximum to how deeply nested JSON can be.  After 
-3.5.0 this was updated to be 1000 by default. The current GPU implementation limits this to 254 
-no matter what version of Spark is used. If the nesting level is over this the JSON is considered
-invalid and all values will be returned as nulls.
-
-Mixed types can have some problems. If an item being read could have some lines that are arrays 
-and others that are structs/dictionaries it is possible an error will be thrown.
-
-Dates and Timestamps have some issues and may return values for technically invalid inputs.
-
-Floating point numbers have issues generally like with the rest of Spark, and we can parse them into
-a valid floating point number, but it might not match 100% with the way Spark does it.
-
-Strings are supported, but the data returned might not be normalized in the same way as the CPU
-implementation. Generally this comes down to the GPU not modifying the input, whereas Spark will
-do things like remove extra white space and parse numbers before turning them back into a string.
+Parsing floating-point values has the same limitations as [casting from string to float](#string-to-float).
 
-### JSON Floating Point
+### JSON Integral Types
 
-Parsing floating-point values has the same limitations as [casting from string to float](#string-to-float).
+Versions of Spark prior to 3.3.0 would parse quoted integer values, like "1". But 3.3.0 and above consider
+these to be invalid and will return `null` when parsed as an Integral types. The GPU implementation
+follows 3.3.0 and above.
 
-Prior to Spark 3.3.0, reading JSON strings such as `"+Infinity"` when specifying that the data type is `FloatType`
-or `DoubleType` caused these values to be parsed even when `allowNonNumericNumbers` is set to false. Also, Spark
-versions prior to 3.3.0 only supported the `"Infinity"` and `"-Infinity"` representations of infinity and did not
-support `"+INF"`, `"-INF"`, or `"+Infinity"`, which Spark considers valid when unquoted. The GPU JSON reader is
-consistent with the behavior in Spark 3.3.0 and later.
+### JSON Decimal Types
 
-Another limitation of the GPU JSON reader is that it will parse strings containing non-string boolean or numeric values where
-Spark will treat them as invalid inputs and will just return `null`.
+Spark supports parsing decimal types either formatted as floating point number or integral numbers, even if it is
+in a quoted string. If it is in a quoted string the local of the JVM is used to determine the number format.
+If the local is not for the `US`, which is the default we will fall back to the CPU because we do not currently
+parse those numbers correctly. The `US` format removes all commas ',' from the quoted string.
+As a part of this, though, non-arabic numbers are also supported. We do not support parsing these numbers
+see (issue 10532)[https://github.com/NVIDIA/spark-rapids/issues/10532].
 
-### JSON Dates/Timestamps
+### JSON Date/Timestamp Types 
 
 Dates and timestamps are not supported by default in JSON parser, since the GPU implementation is not 100%
 compatible with Apache Spark.
 If needed, they can be turned on through the config `spark.rapids.sql.json.read.datetime.enabled`.
-Once enabled, the JSON parser still does not support the `TimestampNTZ` type and will fall back to CPU
-if `spark.sql.timestampType` is set to `TIMESTAMP_NTZ` or if an explicit schema is provided that
-contains the `TimestampNTZ` type.
+This config works for both JSON scan and `from_json`. Once enabled, the JSON parser still does
+not support the `TimestampNTZ` type and will fall back to CPU if `spark.sql.timestampType` is set
+to `TIMESTAMP_NTZ` or if an explicit schema is provided that contains the `TimestampNTZ` type.
 
 There is currently no support for reading numeric values as timestamps and null values are returned instead
-([#4940](https://github.com/NVIDIA/spark-rapids/issues/4940)). A workaround would be to read as longs and then cast
-to timestamp.
+([#4940](https://github.com/NVIDIA/spark-rapids/issues/4940)). A workaround would be to read as longs and then cast to timestamp.
 
-### JSON Schema discovery
+### JSON Arrays and Structs with Overflowing Numbers
 
-Spark SQL can automatically infer the schema of a JSON dataset if schema is not provided explicitly. The CPU
-handles schema discovery and there is no GPU acceleration of this. By default Spark will read/parse the entire
-dataset to determine the schema. This means that some options/errors which are ignored by the GPU may still
-result in an exception if used with schema discovery.
+Spark is inconsistent between versions in how it handles numbers that overflow that are nested in either an array
+or a non-top-level struct. In some versions only the value that overflowed is marked as null. In other versions the
+wrapping array or struct is marked as null. We currently only mark the individual value as null. This matches 
+versions 3.4.2 and above of Spark for structs. Arrays on most versions of spark invalidate the entire array if there
+is a single value that overflows within it.
 
-### `from_json` function
+### Duplicate Struct Names
 
-`JsonToStructs` of `from_json` is based on the same code as reading a JSON lines file.  There are
-a few differences with it.
+The JSON specification technically allows for duplicate keys in a struct, but does not explain what to 
+do with them. In the case of Spark it is inconsistent between operators which value wins. `get_json_object`
+depends on the query being performed. We do not always match what Spark does. We do match it in many cases,
+but we consider this enough of a corner case that we have not tried to make it work in all cases.
 
-The `from_json` function is disabled by default because it is experimental and has some known
-incompatibilities with Spark, and can be enabled by setting 
-`spark.rapids.sql.expression.JsonToStructs=true`. You don't need to set 
-`spark.rapids.sql.format.json.enabled` and`spark.rapids.sql.format.json.read.enabled` to true.
-In addition, if the input schema contains date and/or timestamp types, an additional config 
-`spark.rapids.sql.json.read.datetime.enabled` also needs to be set to `true` in order 
-to enable this function on the GPU.
+We also do not support schemas where there are duplicate column names. We just fall back to the CPU for those cases.
 
-There is no schema discovery as a schema is required as input to `from_json`
+### JSON Normalization (String Types)
 
-In addition to `structs`, a top level `map` type is supported, but only if the key and value are
-strings.
+In versions of Spark prior to 4.0.0 input JSON Strings were parsed to JSON tokens and then converted back to
+strings. This effectively normalizes the output string. So things like single quotes are transformed into double
+quotes, floating point numbers are parsed and converted back to strings possibly changing the format, and
+escaped characters are converted back to their simplest form. We try to support this on the GPU as well. Single quotes
+will be converted to double quotes. Only `get_json_object` and `json_tuple` attempt to normalize floating point
+numbers. There is no implementation on the GPU right now that tries to normalize escape characters.
+
+### `from_json` Function
+
+`JsonToStructs` or `from_json` is based on the same code as reading a JSON lines file.  There are
+a few differences with it.
 
-### `to_json` function
+The main difference is that `from_json` supports parsing Maps and Arrays directly from a JSON column, whereas
+JSON Scan only supports parsing top level structs. The GPU implementation of `from_json` has support for parsing
+a `MAP<STRING,STRING>` as a top level schema, but does not currently support arrays at the top level.
 
-The `to_json` function is disabled by default because it is experimental and has some known incompatibilities 
-with Spark, and can be enabled by setting `spark.rapids.sql.expression.StructsToJson=true`.
+### `to_json` Function
 
 Known issues are:
 
 - There can be rounding differences when formatting floating-point numbers as strings. For example, Spark may
   produce `-4.1243574E26` but the GPU may produce `-4.124357351E26`.
 - Not all JSON options are respected
 
-### get_json_object
+### `get_json_object` Function
 
 Known issue:
 - [Floating-point number normalization error](https://github.com/NVIDIA/spark-rapids-jni/issues/1922). `get_json_object` floating-point number normalization on the GPU could sometimes return incorrect results if the string contains high-precision values, see the String to Float and Float to String section for more details.

diff --git a/docs/supported_ops.md b/docs/supported_ops.md
@@ -9279,7 +9279,7 @@ are limited.
 <td rowSpan="2">JsonToStructs</td>
 <td rowSpan="2">`from_json`</td>
 <td rowSpan="2">Returns a struct value with the given `jsonStr` and `schema`</td>
-<td rowSpan="2">This is disabled by default because it is currently in beta and undergoes continuous enhancements. Please consult the [compatibility documentation](../compatibility.md#json-supporting-types) to determine whether you can enable this configuration for your use case</td>
+<td rowSpan="2">None</td>
 <td rowSpan="2">project</td>
 <td>jsonStr</td>
 <td> </td>
@@ -9320,7 +9320,7 @@ are limited.
 <td> </td>
 <td> </td>
 <td><b>NS</b></td>
-<td><em>PS<br/>MAP only supports keys and values that are of STRING type;<br/>UTC is only supported TZ for child TIMESTAMP;<br/>unsupported child types NULL, BINARY, CALENDAR, MAP, UDT, DAYTIME, YEARMONTH</em></td>
+<td><em>PS<br/>MAP only supports keys and values that are of STRING type and is only supported at the top level;<br/>UTC is only supported TZ for child TIMESTAMP;<br/>unsupported child types NULL, BINARY, CALENDAR, MAP, UDT, DAYTIME, YEARMONTH</em></td>
 <td><em>PS<br/>UTC is only supported TZ for child TIMESTAMP;<br/>unsupported child types NULL, BINARY, CALENDAR, MAP, UDT, DAYTIME, YEARMONTH</em></td>
 <td> </td>
 <td> </td>

diff --git a/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala b/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala
@@ -3780,7 +3780,8 @@ object GpuOverrides extends Logging {
       ExprChecks.projectOnly(
         TypeSig.STRUCT.nested(jsonStructReadTypes) +
           TypeSig.MAP.nested(TypeSig.STRING).withPsNote(TypeEnum.MAP,
-          "MAP only supports keys and values that are of STRING type"),
+          "MAP only supports keys and values that are of STRING type " +
+            "and is only supported at the top level"),
         (TypeSig.STRUCT + TypeSig.MAP + TypeSig.ARRAY).nested(TypeSig.all),
         Seq(ParamCheck("jsonStr", TypeSig.STRING, TypeSig.STRING))),
       (a, conf, p, r) => new UnaryExprMeta[JsonToStructs](a, conf, p, r) {
@@ -3821,10 +3822,7 @@ object GpuOverrides extends Logging {
         override def convertToGpu(child: Expression): GpuExpression =
           // GPU implementation currently does not support duplicated json key names in input
           GpuJsonToStructs(a.schema, a.options, child, a.timeZoneId)
-      }).disabledByDefault("it is currently in beta and undergoes continuous enhancements."+
-      " Please consult the "+
-      "[compatibility documentation](../compatibility.md#json-supporting-types)"+
-      " to determine whether you can enable this configuration for your use case"),
+      }),
     expr[StructsToJson](
       "Converts structs to JSON text format",
       ExprChecks.projectOnly(