Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add scalar subquery pushdown to scan #678

Merged

Conversation

parthchandra
Copy link
Contributor

Which issue does this PR close?

Part of #372 and #551

Rationale for this change

With Spark 4.0, the SubquerySuite in Spark fails as Comet scan did not support the scala subquery feature.

What changes are included in this PR?

Adds the support for scalar subquery pushdown into Comet scan

How are these changes tested?

Existing Spark/sql unit tests in SubquerySuite

@parthchandra parthchandra marked this pull request as draft July 17, 2024 18:37
@parthchandra
Copy link
Contributor Author

Currently DRAFT to make sure ci for older versions passes.

Note the shims for the older versions had to be refactored as this required a change specifically for Spark 3.3 which was different from the change required for Spark 3.4 and above.

@parthchandra parthchandra marked this pull request as ready for review July 17, 2024 20:37
@parthchandra
Copy link
Contributor Author

@kazuyukitanimura Ready for your review.

dev/diffs/4.0.0-preview1.diff Show resolved Hide resolved

protected def isFileSourceConstantMetadataAttribute(attr: Attribute): Boolean = {
attr.getClass.getName match {
case " org.apache.spark.sql.catalyst.expressions.FileSourceConstantMetadataAttribute" => true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for Spark 3.4+ we can do a real class match instead of String?

Comment on lines 64 to 71
case 6 =>
c.newInstance(
fsRelation.sparkSession,
readFunction,
filePartitions,
readSchema,
fileConstantMetadataColumns,
options)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Optional) I think we can remove this reflection because the argument is 5 always for Spark 3.3

// TODO: remove after dropping Spark 3.3 support and directly call
// QueryExecutionErrors.SparkException
protected def invalidBucketFile(path: String, sparkVersion: String): Throwable = {
val messageParameters = if (sparkVersion >= "3.4") Map("path" -> path) else Array(path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Optional) This can be optimized as well like if (sparkVersion >= "3.4")

@@ -94,7 +95,7 @@ case class CometScanExec(
val startTime = System.nanoTime()
val ret =
relation.location.listFiles(partitionFilters.filterNot(isDynamicPruningFilter), dataFilters)
setFilesNumAndSizeMetric(ret, true)
setFilesNumAndSizeMetric(collection.immutable.Seq(ret: _*), true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm what would happen if we do not do this?

Comment on lines 160 to 161
private lazy val pushedDownFilters =
translateToV1Filters(dataFilters, q => convertScalarSubqueryToLiteral(q))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to define pushedDownFilters in Shims instead?
We can do the old way for Spark 3.x

For Spark 4.0, we can avoid the reflection like convertScalarSubqueryToLiteral

@codecov-commenter
Copy link

codecov-commenter commented Jul 18, 2024

Codecov Report

Attention: Patch coverage is 0% with 1 line in your changes missing coverage. Please review.

Project coverage is 33.81%. Comparing base (de8c55e) to head (231d0e5).
Report is 6 commits behind head on main.

Files Patch % Lines
...ala/org/apache/spark/sql/comet/CometScanExec.scala 0.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main     #678      +/-   ##
============================================
+ Coverage     33.69%   33.81%   +0.11%     
+ Complexity      840      839       -1     
============================================
  Files           109      109              
  Lines         42527    42527              
  Branches       9343     9343              
============================================
+ Hits          14331    14381      +50     
+ Misses        25245    25186      -59     
- Partials       2951     2960       +9     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor Author

@parthchandra parthchandra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kazuyukitanimura completely refactored this so now the additional shim classes for pre-3.5 are gone and the change is now really simple.

dev/diffs/4.0.0-preview1.diff Show resolved Hide resolved
Copy link
Contributor

@kazuyukitanimura kazuyukitanimura left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm thanks @parthchandra

@parthchandra
Copy link
Contributor Author

@kazuyukitanimura @andygrove @comphead Can we merge this?

@kazuyukitanimura kazuyukitanimura merged commit 5806b82 into apache:main Jul 19, 2024
74 checks passed
@kazuyukitanimura
Copy link
Contributor

Merged, thanks @parthchandra @andygrove @comphead

himadripal pushed a commit to himadripal/datafusion-comet that referenced this pull request Sep 7, 2024
## Which issue does this PR close?
Part of apache#372  and apache#551 

## Rationale for this change
With Spark 4.0, the `SubquerySuite` in Spark fails as Comet scan did not support the scala subquery feature.

## What changes are included in this PR?
Adds the support for scalar subquery pushdown into Comet scan 

## How are these changes tested?
Existing Spark/sql unit tests in `SubquerySuite`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants