Use new jni kernel for getJsonObject #10581

thirtiseven · 2024-03-13T10:35:26Z

Fixes #10218
Fixes #10212
Fixes #10194
Fixes #10196
Fixes #10537
Fixes #10216
Fixes #10217
Fixes #9033

This PR uses new kernel from NVIDIA/spark-rapids-jni#1893 to replace the implementation in cudf to match Spark's behavior.

This PR is ready for review, but some docs are out of date, will be updated soon.

Use new kernel for json_tuple will be in a separate PR.

perf test

val data = Seq.fill(3000000)("""{"store":{"fruit":[{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}],"basket":[[1,2,{"b":"y","a":"x"}],[3,4],[5,6]],"book":[{"author":"Nigel Rees","title":"Sayings of the Century","category":"reference","price":8.95},{"author":"Herman Melville","title":"Moby Dick","category":"fiction","price":8.99,"isbn":"0-553-21311-3"},{"author":"J. R. R. Tolkien","title":"The Lord of the Rings","category":"fiction","reader":[{"age":25,"name":"bob"},{"age":26,"name":"jack"}],"price":22.99,"isbn":"0-395-19395-8"}],"bicycle":{"price":19.95,"color":"red"}},"email":"amy@only_for_json_udf_test.net","owner":"amy","zip code":"94025","fb:testid":"1234"}""")

import spark.implicits._
data.toDF("a").write.mode("overwrite").parquet("JSON")

val df = spark.read.parquet("JSON")

spark.time(df.selectExpr("COUNT(get_json_object(a, '$.store.bicycle')) as pr0", "COUNT(get_json_object(a, '$.store.book[0].non_exist_key')) as pr2", "COUNT(get_json_object(a, '$.store.basket[0][*].b')) as pr3", "COUNT(get_json_object(a, '$.store.book[*].reader')) as pr4", "COUNT(get_json_object(a, '$.store.book[*].category')) as pr5", "COUNT(get_json_object(a, '$.store.basket[*]')) as pr6", "COUNT(get_json_object(a, '$.store.basket[0][*]')) as pr7", "COUNT(get_json_object(a, '$.store.basket[0][2].b')) as pr8", "COUNT(get_json_object(a, '$')) as pr9").show())

cpu: 10649ms
jni new kernel: 4820ms
cudf no fallback: 1527ms

no nested path similar to customer's usage:

spark.time(df.selectExpr("COUNT(get_json_object(a, '$.owner')) as pr0", "COUNT(get_json_object(a, '$.owner')) as pr2", "COUNT(get_json_object(a, '$.owner')) as pr3", "COUNT(get_json_object(a, '$.owner')) as pr4", "COUNT(get_json_object(a, '$.owner')) as pr5", "COUNT(get_json_object(a, '$.owner')) as pr6", "COUNT(get_json_object(a, '$.owner')) as pr7", "COUNT(get_json_object(a, '$.owner')) as pr8", "COUNT(get_json_object(a, '$.owner')) as pr9").show())

cpu: 1038 ms
jni new kernel: 626 ms
cudf no fallback: 381 ms

also closes NVIDIA/spark-rapids-jni#1894

Signed-off-by: Haoyang Li <[email protected]>

res-life · 2024-03-14T07:52:08Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuGetJsonObject.scala

-      case _ => false
+  def unzipInstruction(instruction: PathInstruction): (Int, String, Long) = {
+    instruction match {
+      case Subscript => (0, "", -1)


Use string for types instead of int to improve readability.
In JNI repo, magic integer is hard to understand.

Updated to enum now.

res-life · 2024-03-14T07:54:06Z

Please test this:
In JNI code, add some printf to ensure the JNI interface is working.

res-life · 2024-03-14T08:07:23Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuGetJsonObject.scala

-              GetJsonObjectOptions.builder().allowSingleQuotes(true).build())
+      case Some(instructions) => instructions match {
+        case (a: List[Int], b: List[String], c: List[Long]) => {
+          JSONUtils.getJsonObject(lhs.getBase, a.toArray, b.toArray, c.toArray)


Rename a b c to have more meaningful name.
Why not use GPU column view here for a b c as lhs did?
The GPU column view life will remain until getJsonObject is done.
In the JNI repo, we can safely refer to the string::view in them.

Signed-off-by: Haoyang Li <[email protected]>

res-life · 2024-03-25T02:39:18Z

Help update doc compatibility.md

The following is a list of known differences.
  * [No input validation](https://github.com/NVIDIA/spark-rapids/issues/10218). If the input string
    is not valid JSON Apache Spark returns a null result, but ours will still try to find a match.
  * [Escapes are not properly processed for Strings](https://github.com/NVIDIA/spark-rapids/issues/10196).
    When returning a result for a quoted string Apache Spark will remove the quotes and replace
    any escape sequences with the proper characters. The escape sequence processing does not happen
    on the GPU.
  * [Invalid JSON paths could throw exceptions](https://github.com/NVIDIA/spark-rapids/issues/10212)
    If a JSON path is not valid Apache Spark returns a null result, but ours may throw an exception
    and fail the query.
  * [Non-string output is not normalized](https://github.com/NVIDIA/spark-rapids/issues/10218)
    When returning a result for things other than strings, a number of things are normalized by
    Apache Spark, but are not normalized by the GPU, like removing unnecessary white space,
    parsing and then serializing floating point numbers, turning single quotes to double quotes,
    and removing unneeded escapes for single quotes.

The following is a list of bugs in either the GPU version or arguably in Apache Spark itself.
   * https://github.com/NVIDIA/spark-rapids/issues/10219 non-matching quotes in quoted strings

res-life · 2024-03-25T02:40:44Z

Should first merge JNI PR NVIDIA/spark-rapids-jni#1893, then merge this.

res-life · 2024-03-25T02:46:38Z

integration_tests/src/main/python/json_matrix_test.py

@@ -60,12 +60,7 @@ def read_json_as_text(spark, data_path, column_name):
    'spark.rapids.sql.expression.JsonToStructs': 'true',
    'spark.rapids.sql.json.read.float.enabled': 'true',
    'spark.rapids.sql.json.read.double.enabled': 'true',
-    'spark.rapids.sql.json.read.decimal.enabled': 'true',
-    'spark.rapids.sql.json.read.mixedTypesAsString.enabled': 'true'


Why remove this config: mixedTypesAsString?

good catch, it's a merge error, will revert.

res-life · 2024-03-25T02:52:07Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuGetJsonObject.scala

    if (path.isValid) {
      val pathStr = path.getValue.toString()
-      JsonPathParser.parse(pathStr).map(JsonPathParser.normalize)
+      JsonPathParser.parse(pathStr)
    } else {
      None
    }
  }

  override def doColumnar(lhs: GpuColumnVector, rhs: GpuScalar): ColumnVector = {


How to handle invalid json path string?
If json path is invalid, should return all null column vector.

It did so in line 170.

res-life · 2024-03-25T02:52:48Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuGetJsonObject.scala

-      case _ => false
+  def fallbackCheck(instructions: List[PathInstruction]): Boolean = {
+    // JNI kernel has a limit of 16 nested nodes, fallback to CPU if we exceed that
+    instructions.length > 16


JNI now is using 32

done but maybe needs more test on it. Will test: If kernel works good under 32, and will plugin fallback above 32.

Signed-off-by: Haoyang Li <[email protected]>

revans2 · 2024-03-25T15:27:29Z

json_tuple was not updated to use the new get_json_object kernel

revans2 · 2024-03-25T18:40:00Z

In my performance tests I see between a 3x and 6x reduction in performance compared to the old implementation, but it does do the right thing most of the time, so I am happy with the results.

Signed-off-by: Haoyang Li <[email protected]>

thirtiseven · 2024-03-26T08:10:36Z

json_tuple was not updated to use the new get_json_object kernel

I drafted another PR #10635 to not let it block this p0 issue, please take a look. All current xfailed cases got passed but I guess the performance will be bad. Will test soon.

Signed-off-by: Haoyang Li <[email protected]>

thirtiseven · 2024-03-26T14:55:10Z

Depends on NVIDIA/spark-rapids-jni#1893
All current test cases passed now.

Tested locally that cases from #10604 also passed.

Signed-off-by: Haoyang Li <[email protected]>

thirtiseven · 2024-03-27T10:37:20Z

Also added some cases related to get json object logic cases from NVIDIA/spark-rapids-jni#1893

revans2 · 2024-03-27T19:03:03Z

integration_tests/src/main/python/get_json_test.py

-    data = [[r'''{'a':'A'}'''],
-            [r'''{'b':'"B'}'''],
-            [r'''{"c":"'C"}''']]
+    data = [[r'''{'a':'A'}''']]


Why did this data change? Are we dropping these from the tests??

revans2 · 2024-03-27T19:03:41Z

integration_tests/src/main/python/get_json_test.py


-@pytest.mark.xfail(reason="https://github.com/NVIDIA/spark-rapids/issues/10218")
+# @pytest.mark.xfail(reason="https://github.com/NVIDIA/spark-rapids/issues/10218")


Can we delete this instead of commenting it out?

ttnghia · 2024-03-27T19:52:18Z

integration_tests/src/main/python/get_json_test.py

@@ -37,8 +37,7 @@ def test_get_json_object(json_str_pattern):
            'get_json_object(a, "$.store.fruit[0]")',
            'get_json_object(\'%s\', "$.store.fruit[0]")' % scalar_json,
            ),
-        conf={'spark.sql.parser.escapedStringLiterals': 'true',
-            'spark.rapids.sql.expression.GetJsonObject': 'true'})
+        conf={'spark.sql.parser.escapedStringLiterals': 'true'})


Now GetJsonObject is on by default. Do we have the option spark.rapids.sql.expression.GetJsonObject configured somewhere so we will remove it too? Or do we have to leave this option so the user can disable?

Yes the same option is still there, users can disable it if they want. But we don’t need to set it on in tests because it is on by default now.

ttnghia · 2024-03-27T19:53:44Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuGetJsonObject.scala

+    // JNI kernel has a limit of 16 nested nodes, fallback to CPU if we exceed that
+    instructions.length > 32


Nit: The max length value should be queried from JNI so we always have such value sync. This can be addressed in the follow up work (which requires JNI to expose such value).

sameerz · 2024-03-27T21:09:38Z

build

thirtiseven · 2024-03-28T09:03:58Z

build

thirtiseven · 2024-03-28T12:47:08Z

I will file a follow-up issue and address comments soon. Merging it now…

Use new kernel for getJsonObject

b67d52b

Signed-off-by: Haoyang Li <[email protected]>

res-life reviewed Mar 14, 2024

View reviewed changes

Use table to pass parsed path

a5ea88b

Signed-off-by: Haoyang Li <[email protected]>

This was referenced Mar 19, 2024

[BUG] GetJsonObject should return null for invalid query instead of throwing an exception #10212

Closed

[FEA] Fix GetJsonObject #10254

Open

thirtiseven added 3 commits March 21, 2024 09:33

use list/vector of instruction objects

257a6d6

Signed-off-by: Haoyang Li <[email protected]>

fallback when nested too long

baad1e5

Signed-off-by: Haoyang Li <[email protected]>

cancel xfail cases

f74ac0f

Signed-off-by: Haoyang Li <[email protected]>

thirtiseven self-assigned this Mar 25, 2024

Merge branch 'branch-24.04' into get-json-object-new-kernel

517e6d2

thirtiseven changed the title ~~WIP: Use new kernel for getJsonObject~~ Use new jni kernel for getJsonObject Mar 25, 2024

cancel xfail cases

02ff03a

Signed-off-by: Haoyang Li <[email protected]>

thirtiseven marked this pull request as ready for review March 25, 2024 02:43

res-life reviewed Mar 25, 2024

View reviewed changes

generated and modified docs

bea3b45

Signed-off-by: Haoyang Li <[email protected]>

wip

d368eb8

Signed-off-by: Haoyang Li <[email protected]>

This was referenced Mar 26, 2024

[FEA] GetJsonObject: Implement get-json-object in JNI repo as Spark does NVIDIA/spark-rapids-jni#1823

Closed

[FEA] GetJsonObject: perf test for new version of GetJsonObject NVIDIA/spark-rapids-jni#1894

Closed

wip

310916f

Signed-off-by: Haoyang Li <[email protected]>

apply jni change and remove xpass

700cf5a

Signed-off-by: Haoyang Li <[email protected]>

sameerz added the task Work required that improves the product but is not user facing label Mar 26, 2024

GaryShen2008 requested review from revans2 and ttnghia March 27, 2024 00:26

Adds test cases

a1a4623

Signed-off-by: Haoyang Li <[email protected]>

res-life mentioned this pull request Mar 27, 2024

New implementation of getJsonObject NVIDIA/spark-rapids-jni#1893

Merged

revans2 approved these changes Mar 27, 2024

View reviewed changes

ttnghia reviewed Mar 27, 2024

View reviewed changes

thirtiseven merged commit d7942e2 into NVIDIA:branch-24.04 Mar 28, 2024
43 checks passed

thirtiseven mentioned this pull request Apr 1, 2024

[FOLLOWUP] Query max instructions length in getJsonObject from JNI #10646

Closed

thirtiseven deleted the get-json-object-new-kernel branch April 1, 2024 05:03

thirtiseven mentioned this pull request Apr 2, 2024

Add a config to switch back to old impl for getJsonObject #10654

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use new jni kernel for getJsonObject #10581

Use new jni kernel for getJsonObject #10581

thirtiseven commented Mar 13, 2024 •

edited

Loading

res-life Mar 14, 2024

thirtiseven Mar 25, 2024

res-life commented Mar 14, 2024

res-life Mar 14, 2024

thirtiseven Mar 25, 2024

res-life commented Mar 25, 2024

res-life commented Mar 25, 2024

res-life Mar 25, 2024

thirtiseven Mar 25, 2024

res-life Mar 25, 2024

thirtiseven Mar 26, 2024

res-life Mar 25, 2024

thirtiseven Mar 26, 2024

thirtiseven Mar 27, 2024

revans2 commented Mar 25, 2024

revans2 commented Mar 25, 2024

thirtiseven commented Mar 26, 2024

thirtiseven commented Mar 26, 2024 •

edited

Loading

thirtiseven commented Mar 27, 2024 •

edited

Loading

revans2 Mar 27, 2024

revans2 Mar 27, 2024

ttnghia Mar 27, 2024 •

edited

Loading

thirtiseven Mar 28, 2024

ttnghia Mar 27, 2024 •

edited

Loading

sameerz commented Mar 27, 2024

thirtiseven commented Mar 28, 2024

thirtiseven commented Mar 28, 2024 •

edited

Loading


		@pytest.mark.xfail(reason="https://github.com/NVIDIA/spark-rapids/issues/10218")
		# @pytest.mark.xfail(reason="https://github.com/NVIDIA/spark-rapids/issues/10218")

		// JNI kernel has a limit of 16 nested nodes, fallback to CPU if we exceed that
		instructions.length > 32

Use new jni kernel for getJsonObject #10581

Use new jni kernel for getJsonObject #10581

Conversation

thirtiseven commented Mar 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

res-life commented Mar 14, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

res-life commented Mar 25, 2024

res-life commented Mar 25, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

revans2 commented Mar 25, 2024

revans2 commented Mar 25, 2024

thirtiseven commented Mar 26, 2024

thirtiseven commented Mar 26, 2024 • edited Loading

thirtiseven commented Mar 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ttnghia Mar 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ttnghia Mar 27, 2024 • edited Loading

Choose a reason for hiding this comment

sameerz commented Mar 27, 2024

thirtiseven commented Mar 28, 2024

thirtiseven commented Mar 28, 2024 • edited Loading

thirtiseven commented Mar 13, 2024 •

edited

Loading

thirtiseven commented Mar 26, 2024 •

edited

Loading

thirtiseven commented Mar 27, 2024 •

edited

Loading

ttnghia Mar 27, 2024 •

edited

Loading

ttnghia Mar 27, 2024 •

edited

Loading

thirtiseven commented Mar 28, 2024 •

edited

Loading