Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use new jni kernel for getJsonObject #10581

Merged

Conversation

thirtiseven
Copy link
Collaborator

@thirtiseven thirtiseven commented Mar 13, 2024

Fixes #10218
Fixes #10212
Fixes #10194
Fixes #10196
Fixes #10537
Fixes #10216
Fixes #10217
Fixes #9033

This PR uses new kernel from NVIDIA/spark-rapids-jni#1893 to replace the implementation in cudf to match Spark's behavior.

This PR is ready for review, but some docs are out of date, will be updated soon.

Use new kernel for json_tuple will be in a separate PR.

perf test

val data = Seq.fill(3000000)("""{"store":{"fruit":[{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}],"basket":[[1,2,{"b":"y","a":"x"}],[3,4],[5,6]],"book":[{"author":"Nigel Rees","title":"Sayings of the Century","category":"reference","price":8.95},{"author":"Herman Melville","title":"Moby Dick","category":"fiction","price":8.99,"isbn":"0-553-21311-3"},{"author":"J. R. R. Tolkien","title":"The Lord of the Rings","category":"fiction","reader":[{"age":25,"name":"bob"},{"age":26,"name":"jack"}],"price":22.99,"isbn":"0-395-19395-8"}],"bicycle":{"price":19.95,"color":"red"}},"email":"amy@only_for_json_udf_test.net","owner":"amy","zip code":"94025","fb:testid":"1234"}""")

import spark.implicits._
data.toDF("a").write.mode("overwrite").parquet("JSON")

val df = spark.read.parquet("JSON")

spark.time(df.selectExpr("COUNT(get_json_object(a, '$.store.bicycle')) as pr0", "COUNT(get_json_object(a, '$.store.book[0].non_exist_key')) as pr2", "COUNT(get_json_object(a, '$.store.basket[0][*].b')) as pr3", "COUNT(get_json_object(a, '$.store.book[*].reader')) as pr4", "COUNT(get_json_object(a, '$.store.book[*].category')) as pr5", "COUNT(get_json_object(a, '$.store.basket[*]')) as pr6", "COUNT(get_json_object(a, '$.store.basket[0][*]')) as pr7", "COUNT(get_json_object(a, '$.store.basket[0][2].b')) as pr8", "COUNT(get_json_object(a, '$')) as pr9").show())

cpu: 10649ms
jni new kernel: 4820ms
cudf no fallback: 1527ms

no nested path similar to customer's usage:

spark.time(df.selectExpr("COUNT(get_json_object(a, '$.owner')) as pr0", "COUNT(get_json_object(a, '$.owner')) as pr2", "COUNT(get_json_object(a, '$.owner')) as pr3", "COUNT(get_json_object(a, '$.owner')) as pr4", "COUNT(get_json_object(a, '$.owner')) as pr5", "COUNT(get_json_object(a, '$.owner')) as pr6", "COUNT(get_json_object(a, '$.owner')) as pr7", "COUNT(get_json_object(a, '$.owner')) as pr8", "COUNT(get_json_object(a, '$.owner')) as pr9").show())

cpu: 1038 ms
jni new kernel: 626 ms
cudf no fallback: 381 ms

also closes NVIDIA/spark-rapids-jni#1894

case _ => false
def unzipInstruction(instruction: PathInstruction): (Int, String, Long) = {
instruction match {
case Subscript => (0, "", -1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use string for types instead of int to improve readability.
In JNI repo, magic integer is hard to understand.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to enum now.

@res-life
Copy link
Collaborator

Please test this:
In JNI code, add some printf to ensure the JNI interface is working.

GetJsonObjectOptions.builder().allowSingleQuotes(true).build())
case Some(instructions) => instructions match {
case (a: List[Int], b: List[String], c: List[Long]) => {
JSONUtils.getJsonObject(lhs.getBase, a.toArray, b.toArray, c.toArray)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename a b c to have more meaningful name.
Why not use GPU column view here for a b c as lhs did?
The GPU column view life will remain until getJsonObject is done.
In the JNI repo, we can safely refer to the string::view in them.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@thirtiseven thirtiseven self-assigned this Mar 25, 2024
@thirtiseven thirtiseven changed the title WIP: Use new kernel for getJsonObject Use new jni kernel for getJsonObject Mar 25, 2024
Signed-off-by: Haoyang Li <[email protected]>
@res-life
Copy link
Collaborator

Help update doc compatibility.md

The following is a list of known differences.
  * [No input validation](https://github.com/NVIDIA/spark-rapids/issues/10218). If the input string
    is not valid JSON Apache Spark returns a null result, but ours will still try to find a match.
  * [Escapes are not properly processed for Strings](https://github.com/NVIDIA/spark-rapids/issues/10196).
    When returning a result for a quoted string Apache Spark will remove the quotes and replace
    any escape sequences with the proper characters. The escape sequence processing does not happen
    on the GPU.
  * [Invalid JSON paths could throw exceptions](https://github.com/NVIDIA/spark-rapids/issues/10212)
    If a JSON path is not valid Apache Spark returns a null result, but ours may throw an exception
    and fail the query.
  * [Non-string output is not normalized](https://github.com/NVIDIA/spark-rapids/issues/10218)
    When returning a result for things other than strings, a number of things are normalized by
    Apache Spark, but are not normalized by the GPU, like removing unnecessary white space,
    parsing and then serializing floating point numbers, turning single quotes to double quotes,
    and removing unneeded escapes for single quotes.

The following is a list of bugs in either the GPU version or arguably in Apache Spark itself.
   * https://github.com/NVIDIA/spark-rapids/issues/10219 non-matching quotes in quoted strings

@res-life
Copy link
Collaborator

Should first merge JNI PR NVIDIA/spark-rapids-jni#1893, then merge this.

@thirtiseven thirtiseven marked this pull request as ready for review March 25, 2024 02:43
@@ -60,12 +60,7 @@ def read_json_as_text(spark, data_path, column_name):
'spark.rapids.sql.expression.JsonToStructs': 'true',
'spark.rapids.sql.json.read.float.enabled': 'true',
'spark.rapids.sql.json.read.double.enabled': 'true',
'spark.rapids.sql.json.read.decimal.enabled': 'true',
'spark.rapids.sql.json.read.mixedTypesAsString.enabled': 'true'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove this config: mixedTypesAsString?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, it's a merge error, will revert.

if (path.isValid) {
val pathStr = path.getValue.toString()
JsonPathParser.parse(pathStr).map(JsonPathParser.normalize)
JsonPathParser.parse(pathStr)
} else {
None
}
}

override def doColumnar(lhs: GpuColumnVector, rhs: GpuScalar): ColumnVector = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How to handle invalid json path string?
If json path is invalid, should return all null column vector.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It did so in line 170.

case _ => false
def fallbackCheck(instructions: List[PathInstruction]): Boolean = {
// JNI kernel has a limit of 16 nested nodes, fallback to CPU if we exceed that
instructions.length > 16
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JNI now is using 32

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done but maybe needs more test on it. Will test: If kernel works good under 32, and will plugin fallback above 32.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@revans2
Copy link
Collaborator

revans2 commented Mar 25, 2024

json_tuple was not updated to use the new get_json_object kernel

@revans2
Copy link
Collaborator

revans2 commented Mar 25, 2024

In my performance tests I see between a 3x and 6x reduction in performance compared to the old implementation, but it does do the right thing most of the time, so I am happy with the results.

Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
@thirtiseven
Copy link
Collaborator Author

json_tuple was not updated to use the new get_json_object kernel

I drafted another PR #10635 to not let it block this p0 issue, please take a look. All current xfailed cases got passed but I guess the performance will be bad. Will test soon.

@thirtiseven
Copy link
Collaborator Author

thirtiseven commented Mar 26, 2024

Depends on NVIDIA/spark-rapids-jni#1893
All current test cases passed now.

Tested locally that cases from #10604 also passed.

@sameerz sameerz added the task Work required that improves the product but is not user facing label Mar 26, 2024
@GaryShen2008 GaryShen2008 requested review from revans2 and ttnghia March 27, 2024 00:26
Signed-off-by: Haoyang Li <[email protected]>
@thirtiseven
Copy link
Collaborator Author

thirtiseven commented Mar 27, 2024

Also added some cases related to get json object logic cases from NVIDIA/spark-rapids-jni#1893

data = [[r'''{'a':'A'}'''],
[r'''{'b':'"B'}'''],
[r'''{"c":"'C"}''']]
data = [[r'''{'a':'A'}''']]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did this data change? Are we dropping these from the tests??


@pytest.mark.xfail(reason="https://github.com/NVIDIA/spark-rapids/issues/10218")
# @pytest.mark.xfail(reason="https://github.com/NVIDIA/spark-rapids/issues/10218")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we delete this instead of commenting it out?

@@ -37,8 +37,7 @@ def test_get_json_object(json_str_pattern):
'get_json_object(a, "$.store.fruit[0]")',
'get_json_object(\'%s\', "$.store.fruit[0]")' % scalar_json,
),
conf={'spark.sql.parser.escapedStringLiterals': 'true',
'spark.rapids.sql.expression.GetJsonObject': 'true'})
conf={'spark.sql.parser.escapedStringLiterals': 'true'})
Copy link
Collaborator

@ttnghia ttnghia Mar 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now GetJsonObject is on by default. Do we have the option spark.rapids.sql.expression.GetJsonObject configured somewhere so we will remove it too? Or do we have to leave this option so the user can disable?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes the same option is still there, users can disable it if they want. But we don’t need to set it on in tests because it is on by default now.

Comment on lines +88 to +89
// JNI kernel has a limit of 16 nested nodes, fallback to CPU if we exceed that
instructions.length > 32
Copy link
Collaborator

@ttnghia ttnghia Mar 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: The max length value should be queried from JNI so we always have such value sync. This can be addressed in the follow up work (which requires JNI to expose such value).

@sameerz
Copy link
Collaborator

sameerz commented Mar 27, 2024

build

1 similar comment
@thirtiseven
Copy link
Collaborator Author

build

@thirtiseven
Copy link
Collaborator Author

thirtiseven commented Mar 28, 2024

I will file a follow-up issue and address comments soon. Merging it now…

@thirtiseven thirtiseven merged commit d7942e2 into NVIDIA:branch-24.04 Mar 28, 2024
43 checks passed
@thirtiseven thirtiseven deleted the get-json-object-new-kernel branch April 1, 2024 05:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment