Fix the broadcast joins issues caused by InputFileBlockRule[databricks] #9673

firestarman · 2023-11-10T04:10:34Z

InputFileBlockRule may change the meta of a broadcast join and its child plans, and this change may break the rule of the broadcast join running on GPU, leading to errors. Because GPU broadcast joins require the build side BroadcastExchangeExec running on GPU, and similarly if BroadcastExchangeExec runs on CPU, the broadcast joins should also run on CPU.

Change made:

Optimize the InputFileBlockRule by skipping the BroadcastExchangeLike because the file info cannot come from a broadcast. (This idea is from [BUG] InputFileBlock walks through broadcasts and does not deal with mismatched broadcasts #9473)
Check the tagging for broadcast joins again after applying the InputFileBlockRule to fix the potential break.
Some API refactor, moving all input file related methods into the InputFileBlockRule object.

Add tests
I also tested the user case in the linked issue locally, and it can pass with this fix.

Signed-off-by: Firestarman <[email protected]>

winningsix · 2023-11-10T07:53:35Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/InputFileBlockRule.scala

+    case _: InputFileName => true
+    case _: InputFileBlockStart => true
+    case _: InputFileBlockLength => true
+    case _: GpuInputFileName => true


Hmm why do we still need to return true given it's already converted to Gpu case? Given the reason mentioned above is GPU plans may get incorrect file name or file start or file length from a CPU scan.

This will be used for two stages during the overiding process. The stage after inserting transitions for row and column may get a InputFileName or a GpuInputFileName.

Concerning this issue, we will never get a GpuInputFileName since plan conversion does not happen.

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsMeta.scala

Signed-off-by: Firestarman <[email protected]>

firestarman · 2023-11-10T12:14:28Z

build

firestarman · 2023-11-13T02:05:49Z

build

Signed-off-by: Firestarman <[email protected]>

firestarman · 2023-11-13T02:37:43Z

build

firestarman · 2023-11-14T02:01:16Z

Hi @revans2 , could you take a look at this ? thx

winningsix

LGTM

winningsix · 2023-11-15T09:59:37Z

...ugin/src/main/scala/org/apache/spark/sql/rapids/execution/GpuBroadcastHashJoinExecBase.scala

@@ -78,6 +78,22 @@ abstract class GpuBroadcastHashJoinMetaBase(
    }
  }

+  // Called in runAfterTagRules for a special post tagging for this broadcast join.
+  def checkTagForBuildSide(): Unit = {


Make more sense to move this into GpuBroadcastJoinMeta?

I do not do that because there are 4 shims for GpuBroadcastJoinMeta, which means I need to duplicate this code 4 times. The current option looks much simpler, only two times.

revans2 · 2023-11-15T16:33:24Z

integration_tests/src/main/python/join_test.py

+                         ids=["GpuParquetScan", "ParquetScan"])
+@pytest.mark.parametrize("is_gpu_broadcast", [True, False],
+                         ids=["GpuBroadcastExchange", "BroadcastExchange"])
+def test_broadcast_hash_join_fix_fallback_by_inputfile(spark_tmp_path, is_gpu_parquet,


I ran these tests on the current 23.12 and test_broadcast_hash_join_fix_fallback_by_inputfile[BroadcastExchange-ParquetScan] produced the wrong answer, but test_broadcast_hash_join_fix_fallback_by_inputfile[GpuBroadcastExchange-ParquetScan] failed with not falling back as expected.

java.lang.AssertionError: assertion failed: Could not find BroadcastHashJoinExec in the Spark plan

test_broadcast_nested_join_fix_fallback_by_inputfile passed in all cases and none of them triggered the error as described in #9469

Can we please add in a test that is the same as #9469 so we can be sure that it is fixed?

The case in #9469 requires Iceberg to run, so we can not test this for Spark 330+, is it OK?

I updated the tests, now they can reproduce the same error as #9469 on the current 23.12.

E raise Py4JJavaError( "An error occurred while calling {0}{1}{2}.\n". > format(target_id, ".", name), value) E py4j.protocol.Py4JJavaError: An error occurred while calling o659.collectToPython. E : java.lang.IllegalStateException: the broadcast must be on the GPU too E at com.nvidia.spark.rapids.shims.GpuBroadcastJoinMeta.verifyBuildSideWasReplaced(GpuBroadcastJoinMeta.scala:72) E at org.apache.spark.sql.rapids.execution.GpuBroadcastNestedLoopJoinMeta.convertToGpu(GpuBroadcastNestedLoopJoinExec.scala:59) E at org.apache.spark.sql.rapids.execution.GpuBroadcastNestedLoopJoinMeta.convertToGpu(GpuBroadcastNestedLoopJoinExec.scala:45) E at com.nvidia.spark.rapids.SparkPlanMeta.convertIfNeeded(RapidsMeta.scala:799) raise Py4JJavaError( "An error occurred while calling {0}{1}{2}.\n". > format(target_id, ".", name), value) E py4j.protocol.Py4JJavaError: An error occurred while calling o663.collectToPython. E : java.lang.IllegalStateException: the broadcast must be on the GPU too E at com.nvidia.spark.rapids.shims.GpuBroadcastJoinMeta.verifyBuildSideWasReplaced(GpuBroadcastJoinMeta.scala:72) E at org.apache.spark.sql.rapids.execution.GpuBroadcastHashJoinMeta.convertToGpu(GpuBroadcastHashJoinExec.scala:63) E at org.apache.spark.sql.rapids.execution.GpuBroadcastHashJoinMeta.convertToGpu(GpuBroadcastHashJoinExec.scala:44) E at com.nvidia.spark.rapids.SparkPlanMeta.convertIfNeeded(RapidsMeta.scala:799)

@revans2 Could you take a look again? Thx in advance.

Signed-off-by: Firestarman <[email protected]>

firestarman · 2023-11-16T03:12:23Z

build

firestarman · 2023-11-16T06:06:21Z

The failing test is not related, try again

firestarman · 2023-11-16T06:06:26Z

build

revans2

This looks better. I have not manually run the tests yet. But it looks correct.

fix broadcast join issues caused by InputFileBlockRule

2c742c0

Signed-off-by: Firestarman <[email protected]>

firestarman requested review from winningsix and revans2 November 10, 2023 04:11

firestarman changed the title ~~Fix the broadcast join issues caused by InputFileBlockRule~~ Fix the broadcast join issues caused by InputFileBlockRule[databricks] Nov 10, 2023

avoid duplicate override message

314e8ce

Signed-off-by: Firestarman <[email protected]>

winningsix reviewed Nov 10, 2023

View reviewed changes

also fix for BroadcastNestedLoopJoin

2ab7ad7

Signed-off-by: Firestarman <[email protected]>

firestarman changed the title ~~Fix the broadcast join issues caused by InputFileBlockRule[databricks]~~ Fix the broadcast joins issues caused by InputFileBlockRule[databricks] Nov 10, 2023

firestarman added 2 commits November 10, 2023 19:54

add tests

6cb8167

Signed-off-by: Firestarman <[email protected]>

Add tests for broadcast nested join

1da5d52

Signed-off-by: Firestarman <[email protected]>

doc update

24e9b7f

Signed-off-by: Firestarman <[email protected]>

sameerz added the bug Something isn't working label Nov 14, 2023

winningsix previously approved these changes Nov 15, 2023

View reviewed changes

revans2 reviewed Nov 15, 2023

View reviewed changes

fix an typo error in test

69f05f1

Signed-off-by: Firestarman <[email protected]>

firestarman dismissed winningsix’s stale review via 69f05f1 November 16, 2023 01:12

Update tests

6907c38

Signed-off-by: Firestarman <[email protected]>

revans2 approved these changes Nov 17, 2023

View reviewed changes

firestarman merged commit 9ed98c8 into NVIDIA:branch-23.12 Nov 20, 2023
37 checks passed

firestarman deleted the fix-join-inputfile branch November 20, 2023 02:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix the broadcast joins issues caused by InputFileBlockRule[databricks] #9673

Fix the broadcast joins issues caused by InputFileBlockRule[databricks] #9673

firestarman commented Nov 10, 2023 •

edited

Loading

winningsix Nov 10, 2023

firestarman Nov 10, 2023 •

edited

Loading

firestarman Nov 10, 2023

firestarman commented Nov 10, 2023

firestarman commented Nov 13, 2023

firestarman commented Nov 13, 2023

firestarman commented Nov 14, 2023

winningsix left a comment

winningsix Nov 15, 2023

firestarman Nov 15, 2023 •

edited

Loading

revans2 Nov 15, 2023

firestarman Nov 16, 2023 •

edited

Loading

firestarman Nov 16, 2023

firestarman Nov 17, 2023

firestarman commented Nov 16, 2023

firestarman commented Nov 16, 2023

firestarman commented Nov 16, 2023

revans2 left a comment

Fix the broadcast joins issues caused by InputFileBlockRule[databricks] #9673

Fix the broadcast joins issues caused by InputFileBlockRule[databricks] #9673

Conversation

firestarman commented Nov 10, 2023 • edited Loading

winningsix Nov 10, 2023

Choose a reason for hiding this comment

firestarman Nov 10, 2023 • edited Loading

Choose a reason for hiding this comment

firestarman Nov 10, 2023

Choose a reason for hiding this comment

firestarman commented Nov 10, 2023

firestarman commented Nov 13, 2023

firestarman commented Nov 13, 2023

firestarman commented Nov 14, 2023

winningsix left a comment

Choose a reason for hiding this comment

winningsix Nov 15, 2023

Choose a reason for hiding this comment

firestarman Nov 15, 2023 • edited Loading

Choose a reason for hiding this comment

revans2 Nov 15, 2023

Choose a reason for hiding this comment

firestarman Nov 16, 2023 • edited Loading

Choose a reason for hiding this comment

firestarman Nov 16, 2023

Choose a reason for hiding this comment

firestarman Nov 17, 2023

Choose a reason for hiding this comment

firestarman commented Nov 16, 2023

firestarman commented Nov 16, 2023

firestarman commented Nov 16, 2023

revans2 left a comment

Choose a reason for hiding this comment

firestarman commented Nov 10, 2023 •

edited

Loading

firestarman Nov 10, 2023 •

edited

Loading

firestarman Nov 15, 2023 •

edited

Loading

firestarman Nov 16, 2023 •

edited

Loading