Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use new getJsonObject kernel for json_tuple #10635

Merged
merged 9 commits into from
Apr 25, 2024

Conversation

thirtiseven
Copy link
Collaborator

@thirtiseven thirtiseven commented Mar 26, 2024

This PR updates json_tuple with new getJsonObject kernel.

All current xfailed cases got passed:

./integration_tests/run_pyspark_from_build.sh -s -k json_tuple
......
============= 16 passed, 13 xpassed, 8 warnings in 49.36s ============

I think the performance will not be good because it calls getJsonObject kernel many times, which is not very fast by itself.

With the new json_parser in jni, I think we can implement a kernel for json_tuple to get much higher performance by passing all fields in one pass. So this PR will be a short-term workaround, even if it gets merged.

Depends on NVIDIA/spark-rapids-jni#1893

@thirtiseven
Copy link
Collaborator Author

a quick perf test:

val data = Seq.fill(3000000)("""{"store":{"fruit":[{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}],"basket":[[1,2,{"b":"y","a":"x"}],[3,4],[5,6]],"book":[{"author":"Nigel Rees","title":"Sayings of the Century","category":"reference","price":8.95},{"author":"Herman Melville","title":"Moby Dick","category":"fiction","price":8.99,"isbn":"0-553-21311-3"},{"author":"J. R. R. Tolkien","title":"The Lord of the Rings","category":"fiction","reader":[{"age":25,"name":"bob"},{"age":26,"name":"jack"}],"price":22.99,"isbn":"0-395-19395-8"}],"bicycle":{"price":19.95,"color":"red"}},"email":"amy@only_for_json_udf_test.net","owner":"amy","zip code":"94025","fb:testid":"1234"}""")

import spark.implicits._
data.toDF("a").write.mode("overwrite").parquet("JSON")

val df = spark.read.parquet("JSON")

spark.conf.set("spark.rapids.sql.expression.JsonTuple", true)

spark.time(df.selectExpr("json_tuple(a, 'store', 'reader', 'bicycle', 'owner', 'zip code', 'email')").count())

6 fields:
CPU: 65216 ms
GPU: 7851 ms

1 fields:
CPU: 68050 ms
GPU: 2070 ms

Wow, so it is actually quite fast. Not sure if I tested it right.

@revans2
Copy link
Collaborator

revans2 commented Mar 26, 2024

Wow, so it is actually quite fast. Not sure if I tested it right.

A bit of feedback on the quick test.

  1. Looks like you only ran it once. Cold runs are usually a lot slower than hot runs, but even then.
  2. All of your data is the same. Which means that there is no thread divergence in the GPU.
  3. You don't mention what CPU/system was used so it is hard to tell if it is a fair comparison or not.
  4. It would be nice to see how fast the parquet read is compared to the CPU. all of the performance gains might be in that, just because it is a single string column repeated 3,000,000 times.

revans2
revans2 previously approved these changes Mar 26, 2024
Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might also be nice to have a follow on issue to see if we can drop the special field name checks.

@thirtiseven thirtiseven self-assigned this Mar 27, 2024
@thirtiseven thirtiseven marked this pull request as ready for review March 27, 2024 06:34
@thirtiseven
Copy link
Collaborator Author

thirtiseven commented Mar 27, 2024

Might also be nice to have a follow on issue to see if we can drop the special field name checks.

Updated, the special field name checks are safe to drop.

@sameerz sameerz added the task Work required that improves the product but is not user facing label Mar 27, 2024
@thirtiseven
Copy link
Collaborator Author

build

@thirtiseven
Copy link
Collaborator Author

build

@thirtiseven thirtiseven changed the base branch from branch-24.04 to branch-24.06 April 2, 2024 02:27
@thirtiseven thirtiseven requested a review from revans2 April 24, 2024 09:36
revans2
revans2 previously approved these changes Apr 24, 2024
@thirtiseven
Copy link
Collaborator Author

build

Signed-off-by: Haoyang Li <[email protected]>
@thirtiseven
Copy link
Collaborator Author

Verified again to generate doc. Seems that ./build/buildall only generates docs per shims, not the main ones.

@thirtiseven
Copy link
Collaborator Author

build

revans2
revans2 previously approved these changes Apr 24, 2024
@thirtiseven
Copy link
Collaborator Author

build

@revans2 revans2 merged commit 63088f1 into NVIDIA:branch-24.06 Apr 25, 2024
43 of 44 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
task Work required that improves the product but is not user facing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants