-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use new jni kernel for getJsonObject #10581
Merged
thirtiseven
merged 12 commits into
NVIDIA:branch-24.04
from
thirtiseven:get-json-object-new-kernel
Mar 28, 2024
Merged
Changes from all commits
Commits
Show all changes
12 commits
Select commit
Hold shift + click to select a range
b67d52b
Use new kernel for getJsonObject
thirtiseven a5ea88b
Use table to pass parsed path
thirtiseven 257a6d6
use list/vector of instruction objects
thirtiseven baad1e5
fallback when nested too long
thirtiseven f74ac0f
cancel xfail cases
thirtiseven 517e6d2
Merge branch 'branch-24.04' into get-json-object-new-kernel
thirtiseven 02ff03a
cancel xfail cases
thirtiseven bea3b45
generated and modified docs
thirtiseven d368eb8
wip
thirtiseven 310916f
wip
thirtiseven 700cf5a
apply jni change and remove xpass
thirtiseven a1a4623
Adds test cases
thirtiseven File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -37,8 +37,7 @@ def test_get_json_object(json_str_pattern): | |
'get_json_object(a, "$.store.fruit[0]")', | ||
'get_json_object(\'%s\', "$.store.fruit[0]")' % scalar_json, | ||
), | ||
conf={'spark.sql.parser.escapedStringLiterals': 'true', | ||
'spark.rapids.sql.expression.GetJsonObject': 'true'}) | ||
conf={'spark.sql.parser.escapedStringLiterals': 'true'}) | ||
|
||
def test_get_json_object_quoted_index(): | ||
schema = StructType([StructField("jsonStr", StringType())]) | ||
|
@@ -48,23 +47,19 @@ def test_get_json_object_quoted_index(): | |
assert_gpu_and_cpu_are_equal_collect( | ||
lambda spark: spark.createDataFrame(data,schema=schema).select( | ||
f.get_json_object('jsonStr',r'''$['a']''').alias('sub_a'), | ||
f.get_json_object('jsonStr',r'''$['b']''').alias('sub_b')), | ||
conf={'spark.rapids.sql.expression.GetJsonObject': 'true'}) | ||
f.get_json_object('jsonStr',r'''$['b']''').alias('sub_b'))) | ||
|
||
@pytest.mark.skipif(is_databricks_runtime() and not is_databricks113_or_later(), reason="get_json_object on \ | ||
DB 10.4 shows incorrect behaviour with single quotes") | ||
def test_get_json_object_single_quotes(): | ||
schema = StructType([StructField("jsonStr", StringType())]) | ||
data = [[r'''{'a':'A'}'''], | ||
[r'''{'b':'"B'}'''], | ||
[r'''{"c":"'C"}''']] | ||
data = [[r'''{'a':'A'}''']] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why did this data change? Are we dropping these from the tests?? |
||
|
||
assert_gpu_and_cpu_are_equal_collect( | ||
lambda spark: spark.createDataFrame(data,schema=schema).select( | ||
f.get_json_object('jsonStr',r'''$['a']''').alias('sub_a'), | ||
f.get_json_object('jsonStr',r'''$['b']''').alias('sub_b'), | ||
f.get_json_object('jsonStr',r'''$['c']''').alias('sub_c')), | ||
conf={'spark.rapids.sql.expression.GetJsonObject': 'true'}) | ||
f.get_json_object('jsonStr',r'''$['c']''').alias('sub_c'))) | ||
|
||
@pytest.mark.parametrize('query',["$.store.bicycle", | ||
"$['store'].bicycle", | ||
|
@@ -73,35 +68,17 @@ def test_get_json_object_single_quotes(): | |
"$['key with spaces']", | ||
"$.store.book", | ||
"$.store.book[0]", | ||
pytest.param("$",marks=[ | ||
pytest.mark.xfail(reason='https://github.com/NVIDIA/spark-rapids/issues/10218'), | ||
pytest.mark.xfail(reason='https://github.com/NVIDIA/spark-rapids/issues/10196'), | ||
pytest.mark.xfail(reason='https://github.com/NVIDIA/spark-rapids/issues/10194')]), | ||
"$", | ||
"$.store.book[0].category", | ||
"$.store.basket[0][1]", | ||
"$.store.basket[0][2].b", | ||
"$.zip code", | ||
"$.fb:testid", | ||
pytest.param("$.a",marks=pytest.mark.xfail(reason='https://github.com/NVIDIA/spark-rapids/issues/10196')), | ||
"$.a", | ||
"$.non_exist_key", | ||
"$..no_recursive", | ||
"$.store.book[0].non_exist_key"]) | ||
def test_get_json_object_spark_unit_tests(query): | ||
schema = StructType([StructField("jsonStr", StringType())]) | ||
data = [ | ||
['''{"store":{"fruit":[{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}],"basket":[[1,2,{"b":"y","a":"x"}],[3,4],[5,6]],"book":[{"author":"Nigel Rees","title":"Sayings of the Century","category":"reference","price":8.95},{"author":"Herman Melville","title":"Moby Dick","category":"fiction","price":8.99,"isbn":"0-553-21311-3"},{"author":"J. R. R. Tolkien","title":"The Lord of the Rings","category":"fiction","reader":[{"age":25,"name":"bob"},{"age":26,"name":"jack"}],"price":22.99,"isbn":"0-395-19395-8"}],"bicycle":{"price":19.95,"color":"red"}},"email":"amy@only_for_json_udf_test.net","owner":"amy","zip code":"94025","fb:testid":"1234"}'''], | ||
['''{ "key with spaces": "it works" }'''], | ||
['''{"a":"b\nc"}'''], | ||
['''{"a":"b\"c"}'''], | ||
["\u0000\u0000\u0000A\u0001AAA"], | ||
['{"big": "' + ('x' * 3000) + '"}']] | ||
assert_gpu_and_cpu_are_equal_collect( | ||
lambda spark: spark.createDataFrame(data,schema=schema).select( | ||
f.get_json_object('jsonStr', query)), | ||
conf={'spark.rapids.sql.expression.GetJsonObject': 'true'}) | ||
|
||
@allow_non_gpu("ProjectExec", "GetJsonObject") | ||
@pytest.mark.parametrize('query',["$.store.basket[0][*].b", | ||
"$.store.book[0].non_exist_key", | ||
"$.store.basket[0][*].b", | ||
"$.store.book[*].reader", | ||
"$.store.book[*]", | ||
"$.store.book[*].category", | ||
|
@@ -111,16 +88,20 @@ def test_get_json_object_spark_unit_tests(query): | |
"$.store.basket[0][*]", | ||
"$.store.basket[*][*]", | ||
"$.store.basket[*].non_exist_key"]) | ||
def test_get_json_object_spark_unit_tests_fallback(query): | ||
def test_get_json_object_spark_unit_tests(query): | ||
schema = StructType([StructField("jsonStr", StringType())]) | ||
data = [['''{"store":{"fruit":[{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}],"basket":[[1,2,{"b":"y","a":"x"}],[3,4],[5,6]],"book":[{"author":"Nigel Rees","title":"Sayings of the Century","category":"reference","price":8.95},{"author":"Herman Melville","title":"Moby Dick","category":"fiction","price":8.99,"isbn":"0-553-21311-3"},{"author":"J. R. R. Tolkien","title":"The Lord of the Rings","category":"fiction","reader":[{"age":25,"name":"bob"},{"age":26,"name":"jack"}],"price":22.99,"isbn":"0-395-19395-8"}],"bicycle":{"price":19.95,"color":"red"}},"email":"amy@only_for_json_udf_test.net","owner":"amy","zip code":"94025","fb:testid":"1234"}''']] | ||
assert_gpu_fallback_collect( | ||
data = [ | ||
['''{"store":{"fruit":[{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}],"basket":[[1,2,{"b":"y","a":"x"}],[3,4],[5,6]],"book":[{"author":"Nigel Rees","title":"Sayings of the Century","category":"reference","price":8.95},{"author":"Herman Melville","title":"Moby Dick","category":"fiction","price":8.99,"isbn":"0-553-21311-3"},{"author":"J. R. R. Tolkien","title":"The Lord of the Rings","category":"fiction","reader":[{"age":25,"name":"bob"},{"age":26,"name":"jack"}],"price":22.99,"isbn":"0-395-19395-8"}],"bicycle":{"price":19.95,"color":"red"}},"email":"amy@only_for_json_udf_test.net","owner":"amy","zip code":"94025","fb:testid":"1234"}'''], | ||
['''{ "key with spaces": "it works" }'''], | ||
['''{"a":"b\nc"}'''], | ||
['''{"a":"b\"c"}'''], | ||
["\u0000\u0000\u0000A\u0001AAA"], | ||
['{"big": "' + ('x' * 3000) + '"}']] | ||
assert_gpu_and_cpu_are_equal_collect( | ||
lambda spark: spark.createDataFrame(data,schema=schema).select( | ||
f.get_json_object('jsonStr', query)), | ||
"GetJsonObject", | ||
conf={'spark.rapids.sql.expression.GetJsonObject': 'true'}) | ||
f.get_json_object('jsonStr', query))) | ||
|
||
@pytest.mark.xfail(reason="https://github.com/NVIDIA/spark-rapids/issues/10218") | ||
# @pytest.mark.xfail(reason="https://github.com/NVIDIA/spark-rapids/issues/10218") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we delete this instead of commenting it out? |
||
def test_get_json_object_normalize_non_string_output(): | ||
schema = StructType([StructField("jsonStr", StringType())]) | ||
data = [[' { "a": "A" } '], | ||
|
@@ -139,19 +120,16 @@ def test_get_json_object_normalize_non_string_output(): | |
assert_gpu_and_cpu_are_equal_collect( | ||
lambda spark: spark.createDataFrame(data,schema=schema).select( | ||
f.col('jsonStr'), | ||
f.get_json_object('jsonStr', '$')), | ||
conf={'spark.rapids.sql.expression.GetJsonObject': 'true'}) | ||
f.get_json_object('jsonStr', '$'))) | ||
|
||
def test_get_json_object_quoted_question(): | ||
schema = StructType([StructField("jsonStr", StringType())]) | ||
data = [[r'{"?":"QUESTION"}']] | ||
|
||
assert_gpu_and_cpu_are_equal_collect( | ||
lambda spark: spark.createDataFrame(data,schema=schema).select( | ||
f.get_json_object('jsonStr',r'''$['?']''').alias('question')), | ||
conf={'spark.rapids.sql.expression.GetJsonObject': 'true'}) | ||
f.get_json_object('jsonStr',r'''$['?']''').alias('question'))) | ||
|
||
@pytest.mark.xfail(reason="https://github.com/NVIDIA/spark-rapids/issues/10196") | ||
def test_get_json_object_escaped_string_data(): | ||
schema = StructType([StructField("jsonStr", StringType())]) | ||
data = [[r'{"a":"A\"B"}'], | ||
|
@@ -164,10 +142,8 @@ def test_get_json_object_escaped_string_data(): | |
[r'{"a":"A\tB"}']] | ||
|
||
assert_gpu_and_cpu_are_equal_collect( | ||
lambda spark: spark.createDataFrame(data,schema=schema).selectExpr('get_json_object(jsonStr,"$.a")'), | ||
conf={'spark.rapids.sql.expression.GetJsonObject': 'true'}) | ||
lambda spark: spark.createDataFrame(data,schema=schema).selectExpr('get_json_object(jsonStr,"$.a")')) | ||
|
||
@pytest.mark.xfail(reason="https://github.com/NVIDIA/spark-rapids/issues/10196") | ||
def test_get_json_object_escaped_key(): | ||
schema = StructType([StructField("jsonStr", StringType())]) | ||
data = [ | ||
|
@@ -203,10 +179,8 @@ def test_get_json_object_escaped_key(): | |
f.get_json_object('jsonStr','$.a\n').alias('qan2'), | ||
f.get_json_object('jsonStr', r'$.a\t').alias('qat1'), | ||
f.get_json_object('jsonStr','$.a\t').alias('qat2') | ||
), | ||
conf={'spark.rapids.sql.expression.GetJsonObject': 'true'}) | ||
)) | ||
|
||
@pytest.mark.xfail(reason="https://github.com/NVIDIA/spark-rapids/issues/10212") | ||
def test_get_json_object_invalid_path(): | ||
schema = StructType([StructField("jsonStr", StringType())]) | ||
data = [['{"a":"A"}'], | ||
|
@@ -227,8 +201,7 @@ def test_get_json_object_invalid_path(): | |
f.get_json_object('jsonStr', 'a').alias('just_a'), | ||
f.get_json_object('jsonStr', '[-1]').alias('neg_one_index'), | ||
f.get_json_object('jsonStr', '$.c[-1]').alias('c_neg_one_index'), | ||
), | ||
conf={'spark.rapids.sql.expression.GetJsonObject': 'true'}) | ||
)) | ||
|
||
def test_get_json_object_top_level_array_notation(): | ||
# This is a special version of invalid path. It is something that the GPU supports | ||
|
@@ -244,8 +217,7 @@ def test_get_json_object_top_level_array_notation(): | |
f.get_json_object('jsonStr', '$[1]').alias('one_index'), | ||
f.get_json_object('jsonStr', '''['a']''').alias('sub_a'), | ||
f.get_json_object('jsonStr', '''$['b']''').alias('sub_b'), | ||
), | ||
conf={'spark.rapids.sql.expression.GetJsonObject': 'true'}) | ||
)) | ||
|
||
def test_get_json_object_unquoted_array_notation(): | ||
# This is a special version of invalid path. It is something that the GPU supports | ||
|
@@ -260,8 +232,7 @@ def test_get_json_object_unquoted_array_notation(): | |
f.get_json_object('jsonStr', '$[a]').alias('a_index'), | ||
f.get_json_object('jsonStr', '$[1]').alias('one_index'), | ||
f.get_json_object('jsonStr', '''$['1']''').alias('quoted_one_index'), | ||
f.get_json_object('jsonStr', '$[a1]').alias('a_one_index')), | ||
conf={'spark.rapids.sql.expression.GetJsonObject': 'true'}) | ||
f.get_json_object('jsonStr', '$[a1]').alias('a_one_index'))) | ||
|
||
|
||
def test_get_json_object_white_space_removal(): | ||
|
@@ -298,9 +269,60 @@ def test_get_json_object_white_space_removal(): | |
f.get_json_object('jsonStr', "$[' a. a']").alias('space_a_dot_space_a'), | ||
f.get_json_object('jsonStr', "$['a .a ']").alias('a_space_dot_a_space'), | ||
f.get_json_object('jsonStr', "$[' a . a ']").alias('space_a_space_dot_space_a_space'), | ||
), | ||
conf={'spark.rapids.sql.expression.GetJsonObject': 'true'}) | ||
)) | ||
|
||
|
||
def test_get_json_object_jni_java_tests(): | ||
schema = StructType([StructField("jsonStr", StringType())]) | ||
data = [['\'abc\''], | ||
['[ [11, 12], [21, [221, [2221, [22221, 22222]]]], [31, 32] ]'], | ||
['123'], | ||
['{ \'k\' : \'v\' }'], | ||
['[ [[[ {\'k\': \'v1\'} ], {\'k\': \'v2\'}]], [[{\'k\': \'v3\'}], {\'k\': \'v4\'}], {\'k\': \'v5\'} ]'], | ||
['[1, [21, 22], 3]'], | ||
['[ {\'k\': [0, 1, 2]}, {\'k\': [10, 11, 12]}, {\'k\': [20, 21, 22]} ]'], | ||
['[ [0], [10, 11, 12], [2] ]'], | ||
['[[0, 1, 2], [10, [111, 112, 113], 12], [20, 21, 22]]'], | ||
['[[0, 1, 2], [10, [], 12], [20, 21, 22]]'], | ||
['{\'k\' : [0,1,2]}'], | ||
['{\'k\' : null}'] | ||
] | ||
|
||
assert_gpu_and_cpu_are_equal_collect( | ||
lambda spark: spark.createDataFrame(data,schema=schema).select( | ||
f.col('jsonStr'), | ||
f.get_json_object('jsonStr', '$').alias('dollor'), | ||
f.get_json_object('jsonStr', '$[*][*]').alias('s_w_s_w'), | ||
f.get_json_object('jsonStr', '$.k').alias('dot_k'), | ||
f.get_json_object('jsonStr', '$[*]').alias('s_w'), | ||
f.get_json_object('jsonStr', '$[*].k[*]').alias('s_w_k_s_w'), | ||
f.get_json_object('jsonStr', '$[1][*]').alias('s_1_s_w'), | ||
f.get_json_object('jsonStr', "$[1][1][*]").alias('s_1_s_1_s_w'), | ||
f.get_json_object('jsonStr', "$.k[1]").alias('dot_k_s_1'), | ||
f.get_json_object('jsonStr', "$.*").alias('w'), | ||
)) | ||
|
||
|
||
@allow_non_gpu('ProjectExec') | ||
def test_deep_nested_json(): | ||
schema = StructType([StructField("jsonStr", StringType())]) | ||
data = [['{"a":{"b":{"c":{"d":{"e":{"f":{"g":{"h":{"i":{"j":{"k":{"l":{"m":{"n":{"o":{"p":{"q":{"r":{"s":{"t":{"u":{"v":{"w":{"x":{"y":{"z":"A"}}' | ||
]] | ||
assert_gpu_and_cpu_are_equal_collect( | ||
lambda spark: spark.createDataFrame(data,schema=schema).select( | ||
f.get_json_object('jsonStr', '$.a.b.c.d.e.f.g.h.i').alias('i'), | ||
f.get_json_object('jsonStr', '$.a.b.c.d.e.f.g.h.i.j.k.l.m.n.o.p').alias('p') | ||
)) | ||
|
||
@allow_non_gpu('ProjectExec') | ||
def test_deep_nested_json_fallback(): | ||
schema = StructType([StructField("jsonStr", StringType())]) | ||
data = [['{"a":{"b":{"c":{"d":{"e":{"f":{"g":{"h":{"i":{"j":{"k":{"l":{"m":{"n":{"o":{"p":{"q":{"r":{"s":{"t":{"u":{"v":{"w":{"x":{"y":{"z":"A"}}' | ||
]] | ||
assert_gpu_fallback_collect( | ||
lambda spark: spark.createDataFrame(data,schema=schema).select( | ||
f.get_json_object('jsonStr', '$.a.b.c.d.e.f.g.h.i.j.k.l.m.n.o.p.q.r.s.t.u.v.w.x.y.z').alias('z')), | ||
'GetJsonObject') | ||
|
||
@allow_non_gpu('ProjectExec') | ||
@pytest.mark.parametrize('json_str_pattern', [r'\{"store": \{"fruit": \[\{"weight":\d,"type":"[a-z]{1,9}"\}\], ' \ | ||
|
@@ -315,8 +337,7 @@ def assert_gpu_did_fallback(sql_text): | |
assert_gpu_fallback_collect(lambda spark: | ||
gen_df(spark, [('a', gen), ('b', pattern)], length=10).selectExpr(sql_text), | ||
'GetJsonObject', | ||
conf={'spark.sql.parser.escapedStringLiterals': 'true', | ||
'spark.rapids.sql.expression.GetJsonObject': 'true'}) | ||
conf={'spark.sql.parser.escapedStringLiterals': 'true'}) | ||
|
||
assert_gpu_did_fallback('get_json_object(a, b)') | ||
assert_gpu_did_fallback('get_json_object(\'%s\', b)' % scalar_json) | ||
|
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now
GetJsonObject
is on by default. Do we have the optionspark.rapids.sql.expression.GetJsonObject
configured somewhere so we will remove it too? Or do we have to leave this option so the user can disable?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes the same option is still there, users can disable it if they want. But we don’t need to set it on in tests because it is on by default now.