-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] java gateway crashed due to hash_aggregate_test case intermittently #7092
Comments
FYI recent failed run are based on JNI built w/ rapidsai/cudf@c574ddf |
Signed-off-by: Peixin Li <[email protected]>
Confirmed this could be reproducible locally w/ smaller parallelism (spark 312 + TITAN V 12GB in my dev machine)
The issue also cause hash_aggregate_test hanging forever in xdist run Attached the log here |
This reverts commit c67927b.
… for #7092 [databricks] (#7102) * Temporarily lower parallelism for #7092 Signed-off-by: Peixin Li <[email protected]> * Revert "Temporarily lower parallelism for #7092" This reverts commit c67927b. * XFAIL test_hash_groupby_collect_partial_replace_with_distinct_fallback Signed-off-by: Peixin Li <[email protected]> * skip the case instead of xfail * xfail new case failure Signed-off-by: Peixin Li <[email protected]> Signed-off-by: Peixin Li <[email protected]>
also intermittently failed test_hash_groupby_collect_with_single_distinct in https://github.com/NVIDIA/spark-rapids/actions/runs/3495199975/jobs/5851695883 Note: this failure is from 23.02 branch, the cudf commit may not be the same as 22.12 one |
We will also keep monitoring this after new cudf commits get in |
I was able to hit an error in
|
…fallback for NVIDIA#7092 [databricks] (NVIDIA#7102)" This reverts commit c70bcfb. Signed-off-by: Jason Lowe <[email protected]>
#7142) * Revert "Skip test_hash_groupby_collect_with_single_distinct (#7107)" This reverts commit d5ba2e2. Signed-off-by: Jason Lowe <[email protected]> * Revert "Skip test_hash_groupby_collect_partial_replace_with_distinct_fallback for #7092 [databricks] (#7102)" This reverts commit c70bcfb. Signed-off-by: Jason Lowe <[email protected]> Signed-off-by: Jason Lowe <[email protected]>
Describe the bug
failed to connect to java gateway
The java gateway could crash intermittently (60%+) and fail all following cases belong to this gw
One of the failed run print out debug message,
seems mostly due some case (mostly started fail at different hash_aggregate_test cases) in gateway fail to get more gpu memory and shutdown the gateway.
This could also be related to recent cudf change
Steps/Code to reproduce bug
the failure was found in databricks IT in parallel (parallelism as 4) w/ tesla V100 (16GB)
I also saw this in normal spark IT in parallel runs (parallelism as 4) w/ TITAN V gpu (12GB) less frequently
The text was updated successfully, but these errors were encountered: