Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: use inputRDD to get outputPartitions in CometScanExec #1162

Merged

Conversation

parthchandra
Copy link
Contributor

outputPartitioning for CometScanExec was incorrectly returning 0 causing unit tests to fail

@parthchandra
Copy link
Contributor Author

@viirya could you review?
Without this unit tests were failing with Can't zip RDDs with unequal numbers of partitions:... (in CometNativeExec.doExecuteColumnar)

@@ -132,7 +132,7 @@ case class CometScanExec(
lazy val bucketedScan: Boolean = wrapped.bucketedScan

override lazy val (outputPartitioning, outputOrdering): (Partitioning, Seq[SortOrder]) =
(wrapped.outputPartitioning, wrapped.outputOrdering)
(UnknownPartitioning(wrapped.inputRDD.getNumPartitions), wrapped.outputOrdering)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What the output partitioning is when it fails the test? I assume that the outputPartitioning should have correct partition number but not?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We were seeing zero partitions. Here is some debug output from testing:

firstNonBroadcastPlan = Some((CometScan parquet [_1#6,_2#7] Batched: true, DataFilters: [isnotnull(_1#6)], Format: CometParquet, Location: InMemoryFileIndex(1 paths)[file:/tmp/spark-a21845bf-72dc-49d1-94f0-f785bb6f2f18], PartitionFilters: [], PushedFilters: [IsNotNull(_1)], ReadSchema: struct<_1:int,_2:int>
,0))

firstNonBroadcastPlanNumPartitions = Some(0)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. wrapped.outputPartitioning was zero

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is somehow weird. It means the outputPartitioning info is not correct in original ScanExec and not matched the output RDD's partition number. 🤔

@@ -132,7 +132,7 @@ case class CometScanExec(
lazy val bucketedScan: Boolean = wrapped.bucketedScan

override lazy val (outputPartitioning, outputOrdering): (Partitioning, Seq[SortOrder]) =
(wrapped.outputPartitioning, wrapped.outputOrdering)
(UnknownPartitioning(wrapped.inputRDD.getNumPartitions), wrapped.outputOrdering)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, I think we should keep original partition instead a hard-coded UnknownPartitioning?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did this based on what we do in CometNativeScanExec as well as the original output partition. The original wrapped.outputPartitioning for CometScanExec inherits from FileSourceScanLike and this returns a hardcoded UnknownPartitioning(0) for non-bucketed scan.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay

@andygrove andygrove merged commit b63570b into apache:comet-parquet-exec Dec 11, 2024
14 of 75 checks passed
andygrove added a commit that referenced this pull request Dec 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants