-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use new jni kernel for getJsonObject #10581
Use new jni kernel for getJsonObject #10581
Conversation
Signed-off-by: Haoyang Li <[email protected]>
case _ => false | ||
def unzipInstruction(instruction: PathInstruction): (Int, String, Long) = { | ||
instruction match { | ||
case Subscript => (0, "", -1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use string for types instead of int to improve readability.
In JNI repo, magic integer is hard to understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to enum now.
Please test this: |
GetJsonObjectOptions.builder().allowSingleQuotes(true).build()) | ||
case Some(instructions) => instructions match { | ||
case (a: List[Int], b: List[String], c: List[Long]) => { | ||
JSONUtils.getJsonObject(lhs.getBase, a.toArray, b.toArray, c.toArray) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename a b c to have more meaningful name.
Why not use GPU column view here for a b c as lhs did?
The GPU column view life will remain until getJsonObject
is done.
In the JNI repo, we can safely refer to the string::view in them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
Help update doc compatibility.md
|
Should first merge JNI PR NVIDIA/spark-rapids-jni#1893, then merge this. |
@@ -60,12 +60,7 @@ def read_json_as_text(spark, data_path, column_name): | |||
'spark.rapids.sql.expression.JsonToStructs': 'true', | |||
'spark.rapids.sql.json.read.float.enabled': 'true', | |||
'spark.rapids.sql.json.read.double.enabled': 'true', | |||
'spark.rapids.sql.json.read.decimal.enabled': 'true', | |||
'spark.rapids.sql.json.read.mixedTypesAsString.enabled': 'true' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why remove this config: mixedTypesAsString?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch, it's a merge error, will revert.
if (path.isValid) { | ||
val pathStr = path.getValue.toString() | ||
JsonPathParser.parse(pathStr).map(JsonPathParser.normalize) | ||
JsonPathParser.parse(pathStr) | ||
} else { | ||
None | ||
} | ||
} | ||
|
||
override def doColumnar(lhs: GpuColumnVector, rhs: GpuScalar): ColumnVector = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How to handle invalid json path string?
If json path is invalid, should return all null column vector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It did so in line 170.
case _ => false | ||
def fallbackCheck(instructions: List[PathInstruction]): Boolean = { | ||
// JNI kernel has a limit of 16 nested nodes, fallback to CPU if we exceed that | ||
instructions.length > 16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
JNI now is using 32
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done but maybe needs more test on it. Will test: If kernel works good under 32, and will plugin fallback above 32.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Signed-off-by: Haoyang Li <[email protected]>
json_tuple was not updated to use the new get_json_object kernel |
In my performance tests I see between a 3x and 6x reduction in performance compared to the old implementation, but it does do the right thing most of the time, so I am happy with the results. |
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
I drafted another PR #10635 to not let it block this p0 issue, please take a look. All current xfailed cases got passed but I guess the performance will be bad. Will test soon. |
Signed-off-by: Haoyang Li <[email protected]>
Depends on NVIDIA/spark-rapids-jni#1893 Tested locally that cases from #10604 also passed. |
Signed-off-by: Haoyang Li <[email protected]>
Also added some cases related to get json object logic cases from NVIDIA/spark-rapids-jni#1893 |
data = [[r'''{'a':'A'}'''], | ||
[r'''{'b':'"B'}'''], | ||
[r'''{"c":"'C"}''']] | ||
data = [[r'''{'a':'A'}''']] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did this data change? Are we dropping these from the tests??
|
||
@pytest.mark.xfail(reason="https://github.com/NVIDIA/spark-rapids/issues/10218") | ||
# @pytest.mark.xfail(reason="https://github.com/NVIDIA/spark-rapids/issues/10218") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we delete this instead of commenting it out?
@@ -37,8 +37,7 @@ def test_get_json_object(json_str_pattern): | |||
'get_json_object(a, "$.store.fruit[0]")', | |||
'get_json_object(\'%s\', "$.store.fruit[0]")' % scalar_json, | |||
), | |||
conf={'spark.sql.parser.escapedStringLiterals': 'true', | |||
'spark.rapids.sql.expression.GetJsonObject': 'true'}) | |||
conf={'spark.sql.parser.escapedStringLiterals': 'true'}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now GetJsonObject
is on by default. Do we have the option spark.rapids.sql.expression.GetJsonObject
configured somewhere so we will remove it too? Or do we have to leave this option so the user can disable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes the same option is still there, users can disable it if they want. But we don’t need to set it on in tests because it is on by default now.
// JNI kernel has a limit of 16 nested nodes, fallback to CPU if we exceed that | ||
instructions.length > 32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: The max length value should be queried from JNI so we always have such value sync. This can be addressed in the follow up work (which requires JNI to expose such value).
build |
1 similar comment
build |
I will file a follow-up issue and address comments soon. Merging it now… |
Fixes #10218
Fixes #10212
Fixes #10194
Fixes #10196
Fixes #10537
Fixes #10216
Fixes #10217
Fixes #9033
This PR uses new kernel from NVIDIA/spark-rapids-jni#1893 to replace the implementation in cudf to match Spark's behavior.
This PR is ready for review, but some docs are out of date, will be updated soon.
Use new kernel for json_tuple will be in a separate PR.
perf test
cpu: 10649ms
jni new kernel: 4820ms
cudf no fallback: 1527ms
no nested path similar to customer's usage:
cpu: 1038 ms
jni new kernel: 626 ms
cudf no fallback: 381 ms
also closes NVIDIA/spark-rapids-jni#1894