Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[1571] Fix COUNT(*) aggregate queries to use metadata-only optimization for .show() command #1643

Conversation

scottsand-db
Copy link
Collaborator

@scottsand-db scottsand-db commented Mar 9, 2023

Description

Resolves #1571. Previously, metadata-only aggregate pushdown was only working for COUNT(*) queries when you were collecting the result, as opposed to calling .show(). This PR fixes that bug.

How was this patch tested?

Added a UT that captures the optimized logical plan and checks that it is using the LocalRelation created by OptimizeMetadataOnlyDeltaQuery.

Also did a performance test locally. Created a table with 100M rows and 100K files and ran the query sql("SELECT COUNT(*) FROM <delta-table>").show()

  • master took ms ~161 seconds.
  • this PR took ~16 seconds. Thus, this is a ~10x improvement.

}
}

protected def getDeltaScanGenerator(index: TahoeLogFileIndex): DeltaScanGenerator

/** Return the number of rows in the table or `None` if we cannot calculate it from stats */
private def extractGlobalCount(tahoeLogFileIndex: TahoeLogFileIndex): Option[Long] = {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copied verbatim from below

@scottsand-db scottsand-db self-assigned this Mar 9, 2023
@scottsand-db scottsand-db requested a review from tdas March 9, 2023 21:19
Copy link
Collaborator

@vkorukanti vkorukanti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with few minor test comments.

val showPlans = DeltaTestUtils.withLogicalPlansCaptured(spark, optimizedPlan = true) {
spark.sql(s"SELECT COUNT(*) FROM $testTableName").show()
}
assert(showPlans.collect { case x: LocalRelation => x }.size === 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

allisonport-db pushed a commit to allisonport-db/delta that referenced this pull request Mar 21, 2023
Previously, metadata-only aggregate pushdown was only working for `COUNT(*)` queries when you were collecting the result, as opposed to calling `.show()`. This PR fixes that bug.

Added a UT that captures the optimized logical plan and checks that it is using the LocalRelation created by OptimizeMetadataOnlyDeltaQuery.

Also did a performance test locally. Created a table with 100M rows and 100K files and ran the query `sql("SELECT COUNT(*) FROM <delta-table>").show()`
- master took ms ~161 seconds.
- this PR took ~16 seconds. Thus, this is a ~10x improvement.

Resolves delta-io#1571.
Closes delta-io#1643

Signed-off-by: Scott Sandre <[email protected]>
GitOrigin-RevId: e266e5d82220ca331e117f202abc6f085a99448c
(cherry picked from commit 48388b9)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Fix COUNT(*) aggregate pushdown with the .show() command
3 participants