Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Fix COUNT(*) aggregate pushdown with the .show() command #1571

Closed
scottsand-db opened this issue Jan 19, 2023 · 5 comments
Closed

[BUG] Fix COUNT(*) aggregate pushdown with the .show() command #1571

scottsand-db opened this issue Jan 19, 2023 · 5 comments
Assignees
Labels
backport-2.2.x bug Something isn't working good first issue Good for newcomers
Milestone

Comments

@scottsand-db
Copy link
Collaborator

Bug

Describe the problem

This commit 0c349da8 added support for COUNT(*) aggregate pushdown. For example, this command spark.sql("SELECT COUNT(*) FROM <my-table>").collect() is able to use just the metadata in the delta log to perform this count query, instead of scanning the parquet files.

However, this command spark.sql("SELECT COUNT(*) FROM <my-table>").show() does NOT perform the aggregate pushdown. The aggregate plan that is matched inside of OptimizeMetadataOnlyDeltaQuery does not match the plan used during the .show() command, which performs a CAST.

Expected results

COUNT(*) aggregate pushdown should also occur when you do .show(), just like how it works with .collect().

All you need to do is update OptimizeMetadataOnlyDeltaQuery and add a test.

Environment information

  • Delta Lake version: 2.2.0
  • Spark version: 3.3.1
@scottsand-db scottsand-db added bug Something isn't working good first issue Good for newcomers labels Jan 19, 2023
@fuzzy-memory
Copy link
Contributor

Can I take this up?

@scottsand-db
Copy link
Collaborator Author

@fuzzy-memory for sure!

@felipepessoto
Copy link
Contributor

Should we wait for my other PR #1525?

It rewrites most of the query plan transformations.

@scottsand-db
Copy link
Collaborator Author

@felipepessoto - yes we should. Will get to that soon. Sorry for the delay.

@scottsand-db
Copy link
Collaborator Author

Hi @felipepessoto - FYI we got some community asks to get this in for 2.3 so I opened up a PR here: #1643

When we go to merge your Min/Max aggregate PR, we can just remove the additions I've added here, so long as that PR also handles the .show() case.

allisonport-db pushed a commit to allisonport-db/delta that referenced this issue Mar 21, 2023
Previously, metadata-only aggregate pushdown was only working for `COUNT(*)` queries when you were collecting the result, as opposed to calling `.show()`. This PR fixes that bug.

Added a UT that captures the optimized logical plan and checks that it is using the LocalRelation created by OptimizeMetadataOnlyDeltaQuery.

Also did a performance test locally. Created a table with 100M rows and 100K files and ran the query `sql("SELECT COUNT(*) FROM <delta-table>").show()`
- master took ms ~161 seconds.
- this PR took ~16 seconds. Thus, this is a ~10x improvement.

Resolves delta-io#1571.
Closes delta-io#1643

Signed-off-by: Scott Sandre <[email protected]>
GitOrigin-RevId: e266e5d82220ca331e117f202abc6f085a99448c
(cherry picked from commit 48388b9)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-2.2.x bug Something isn't working good first issue Good for newcomers
Projects
None yet
3 participants