-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[1571] Fix COUNT(*)
aggregate queries to use metadata-only optimization for .show()
command
#1643
[1571] Fix COUNT(*)
aggregate queries to use metadata-only optimization for .show()
command
#1643
Conversation
} | ||
} | ||
|
||
protected def getDeltaScanGenerator(index: TahoeLogFileIndex): DeltaScanGenerator | ||
|
||
/** Return the number of rows in the table or `None` if we cannot calculate it from stats */ | ||
private def extractGlobalCount(tahoeLogFileIndex: TahoeLogFileIndex): Option[Long] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copied verbatim from below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with few minor test comments.
core/src/test/scala/org/apache/spark/sql/delta/perf/OptimizeMetadataOnlyDeltaQuerySuite.scala
Show resolved
Hide resolved
core/src/test/scala/org/apache/spark/sql/delta/perf/OptimizeMetadataOnlyDeltaQuerySuite.scala
Outdated
Show resolved
Hide resolved
core/src/test/scala/org/apache/spark/sql/delta/perf/OptimizeMetadataOnlyDeltaQuerySuite.scala
Outdated
Show resolved
Hide resolved
val showPlans = DeltaTestUtils.withLogicalPlansCaptured(spark, optimizedPlan = true) { | ||
spark.sql(s"SELECT COUNT(*) FROM $testTableName").show() | ||
} | ||
assert(showPlans.collect { case x: LocalRelation => x }.size === 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
core/src/main/scala/org/apache/spark/sql/delta/perf/OptimizeMetadataOnlyDeltaQuery.scala
Outdated
Show resolved
Hide resolved
Previously, metadata-only aggregate pushdown was only working for `COUNT(*)` queries when you were collecting the result, as opposed to calling `.show()`. This PR fixes that bug. Added a UT that captures the optimized logical plan and checks that it is using the LocalRelation created by OptimizeMetadataOnlyDeltaQuery. Also did a performance test locally. Created a table with 100M rows and 100K files and ran the query `sql("SELECT COUNT(*) FROM <delta-table>").show()` - master took ms ~161 seconds. - this PR took ~16 seconds. Thus, this is a ~10x improvement. Resolves delta-io#1571. Closes delta-io#1643 Signed-off-by: Scott Sandre <[email protected]> GitOrigin-RevId: e266e5d82220ca331e117f202abc6f085a99448c (cherry picked from commit 48388b9)
Description
Resolves #1571. Previously, metadata-only aggregate pushdown was only working for
COUNT(*)
queries when you were collecting the result, as opposed to calling.show()
. This PR fixes that bug.How was this patch tested?
Added a UT that captures the optimized logical plan and checks that it is using the LocalRelation created by OptimizeMetadataOnlyDeltaQuery.
Also did a performance test locally. Created a table with 100M rows and 100K files and ran the query
sql("SELECT COUNT(*) FROM <delta-table>").show()