feat: add scalar subquery pushdown to scan #678

parthchandra · 2024-07-17T18:37:47Z

Which issue does this PR close?

Part of #372 and #551

Rationale for this change

With Spark 4.0, the SubquerySuite in Spark fails as Comet scan did not support the scala subquery feature.

What changes are included in this PR?

Adds the support for scalar subquery pushdown into Comet scan

How are these changes tested?

Existing Spark/sql unit tests in SubquerySuite

parthchandra · 2024-07-17T18:39:37Z

Currently DRAFT to make sure ci for older versions passes.

Note the shims for the older versions had to be refactored as this required a change specifically for Spark 3.3 which was different from the change required for Spark 3.4 and above.

parthchandra · 2024-07-17T20:38:23Z

@kazuyukitanimura Ready for your review.

dev/diffs/4.0.0-preview1.diff

kazuyukitanimura · 2024-07-17T21:49:30Z

spark/src/main/spark-3.4/org/apache/spark/sql/comet/shims/ShimCometScanExec.scala

+
+  protected def isFileSourceConstantMetadataAttribute(attr: Attribute): Boolean = {
+    attr.getClass.getName match {
+      case " org.apache.spark.sql.catalyst.expressions.FileSourceConstantMetadataAttribute" => true


I think for Spark 3.4+ we can do a real class match instead of String?

kazuyukitanimura · 2024-07-17T21:50:55Z

spark/src/main/spark-3.3/org/apache/spark/sql/comet/shims/ShimCometScanExec.scala

+          case 6 =>
+            c.newInstance(
+              fsRelation.sparkSession,
+              readFunction,
+              filePartitions,
+              readSchema,
+              fileConstantMetadataColumns,
+              options)


(Optional) I think we can remove this reflection because the argument is 5 always for Spark 3.3

kazuyukitanimura · 2024-07-17T21:51:18Z

spark/src/main/spark-3.3/org/apache/spark/sql/comet/shims/ShimCometScanExec.scala

+  // TODO: remove after dropping Spark 3.3 support and directly call
+  //       QueryExecutionErrors.SparkException
+  protected def invalidBucketFile(path: String, sparkVersion: String): Throwable = {
+    val messageParameters = if (sparkVersion >= "3.4") Map("path" -> path) else Array(path)


(Optional) This can be optimized as well like if (sparkVersion >= "3.4")

kazuyukitanimura · 2024-07-17T21:55:26Z

spark/src/main/scala/org/apache/spark/sql/comet/CometScanExec.scala

@@ -94,7 +95,7 @@ case class CometScanExec(
    val startTime = System.nanoTime()
    val ret =
      relation.location.listFiles(partitionFilters.filterNot(isDynamicPruningFilter), dataFilters)
-    setFilesNumAndSizeMetric(ret, true)
+    setFilesNumAndSizeMetric(collection.immutable.Seq(ret: _*), true)


Hmm what would happen if we do not do this?

kazuyukitanimura · 2024-07-17T22:09:01Z

spark/src/main/scala/org/apache/spark/sql/comet/CometScanExec.scala

+  private lazy val pushedDownFilters =
+    translateToV1Filters(dataFilters, q => convertScalarSubqueryToLiteral(q))


Is it possible to define pushedDownFilters in Shims instead?
We can do the old way for Spark 3.x

For Spark 4.0, we can avoid the reflection like convertScalarSubqueryToLiteral

codecov-commenter · 2024-07-18T03:46:13Z

Codecov Report

Attention: Patch coverage is 0% with 1 line in your changes missing coverage. Please review.

Project coverage is 33.81%. Comparing base (de8c55e) to head (231d0e5).
Report is 6 commits behind head on main.

Files	Patch %	Lines
...ala/org/apache/spark/sql/comet/CometScanExec.scala	0.00%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main     #678      +/-   ##
============================================
+ Coverage     33.69%   33.81%   +0.11%     
+ Complexity      840      839       -1     
============================================
  Files           109      109              
  Lines         42527    42527              
  Branches       9343     9343              
============================================
+ Hits          14331    14381      +50     
+ Misses        25245    25186      -59     
- Partials       2951     2960       +9

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

parthchandra

@kazuyukitanimura completely refactored this so now the additional shim classes for pre-3.5 are gone and the change is now really simple.

dev/diffs/4.0.0-preview1.diff

kazuyukitanimura

LGTM

comphead

lgtm thanks @parthchandra

This reverts commit ae66527.

parthchandra · 2024-07-19T21:02:47Z

@kazuyukitanimura @andygrove @comphead Can we merge this?

kazuyukitanimura · 2024-07-19T22:00:51Z

Merged, thanks @parthchandra @andygrove @comphead

## Which issue does this PR close? Part of apache#372 and apache#551 ## Rationale for this change With Spark 4.0, the `SubquerySuite` in Spark fails as Comet scan did not support the scala subquery feature. ## What changes are included in this PR? Adds the support for scalar subquery pushdown into Comet scan ## How are these changes tested? Existing Spark/sql unit tests in `SubquerySuite`

feat: add scalar subquery pushdown to scan

1d5056c

parthchandra marked this pull request as draft July 17, 2024 18:37

parthchandra marked this pull request as ready for review July 17, 2024 20:37

kazuyukitanimura reviewed Jul 17, 2024

View reviewed changes

parthchandra added 2 commits July 17, 2024 17:48

refactor completely after review

609accb

fix shim and update diff

231d0e5

parthchandra commented Jul 18, 2024

View reviewed changes

dev/diffs/4.0.0-preview1.diff Show resolved Hide resolved

kazuyukitanimura approved these changes Jul 18, 2024

View reviewed changes

andygrove approved these changes Jul 18, 2024

View reviewed changes

comphead approved these changes Jul 18, 2024

View reviewed changes

parthchandra added 2 commits July 18, 2024 13:43

enable one more test

ae66527

Revert "enable one more test"

1a3f425

This reverts commit ae66527.

kazuyukitanimura merged commit 5806b82 into apache:main Jul 19, 2024
74 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add scalar subquery pushdown to scan #678

feat: add scalar subquery pushdown to scan #678

parthchandra commented Jul 17, 2024

parthchandra commented Jul 17, 2024

parthchandra commented Jul 17, 2024

kazuyukitanimura Jul 17, 2024

kazuyukitanimura Jul 17, 2024

kazuyukitanimura Jul 17, 2024

kazuyukitanimura Jul 17, 2024

kazuyukitanimura Jul 17, 2024

codecov-commenter commented Jul 18, 2024 •

edited

Loading

parthchandra left a comment

kazuyukitanimura left a comment

comphead left a comment

parthchandra commented Jul 19, 2024

kazuyukitanimura commented Jul 19, 2024

		private lazy val pushedDownFilters =
		translateToV1Filters(dataFilters, q => convertScalarSubqueryToLiteral(q))

feat: add scalar subquery pushdown to scan #678

feat: add scalar subquery pushdown to scan #678

Conversation

parthchandra commented Jul 17, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

parthchandra commented Jul 17, 2024

parthchandra commented Jul 17, 2024

kazuyukitanimura Jul 17, 2024

Choose a reason for hiding this comment

kazuyukitanimura Jul 17, 2024

Choose a reason for hiding this comment

kazuyukitanimura Jul 17, 2024

Choose a reason for hiding this comment

kazuyukitanimura Jul 17, 2024

Choose a reason for hiding this comment

kazuyukitanimura Jul 17, 2024

Choose a reason for hiding this comment

codecov-commenter commented Jul 18, 2024 • edited Loading

Codecov Report

parthchandra left a comment

Choose a reason for hiding this comment

kazuyukitanimura left a comment

Choose a reason for hiding this comment

comphead left a comment

Choose a reason for hiding this comment

parthchandra commented Jul 19, 2024

kazuyukitanimura commented Jul 19, 2024

codecov-commenter commented Jul 18, 2024 •

edited

Loading