-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Fix COUNT(*)
aggregate pushdown with the .show()
command
#1571
Comments
Can I take this up? |
@fuzzy-memory for sure! |
Should we wait for my other PR #1525? It rewrites most of the query plan transformations. |
@felipepessoto - yes we should. Will get to that soon. Sorry for the delay. |
Hi @felipepessoto - FYI we got some community asks to get this in for 2.3 so I opened up a PR here: #1643 When we go to merge your Min/Max aggregate PR, we can just remove the additions I've added here, so long as that PR also handles the |
Previously, metadata-only aggregate pushdown was only working for `COUNT(*)` queries when you were collecting the result, as opposed to calling `.show()`. This PR fixes that bug. Added a UT that captures the optimized logical plan and checks that it is using the LocalRelation created by OptimizeMetadataOnlyDeltaQuery. Also did a performance test locally. Created a table with 100M rows and 100K files and ran the query `sql("SELECT COUNT(*) FROM <delta-table>").show()` - master took ms ~161 seconds. - this PR took ~16 seconds. Thus, this is a ~10x improvement. Resolves delta-io#1571. Closes delta-io#1643 Signed-off-by: Scott Sandre <[email protected]> GitOrigin-RevId: e266e5d82220ca331e117f202abc6f085a99448c (cherry picked from commit 48388b9)
Bug
Describe the problem
This commit 0c349da8 added support for
COUNT(*)
aggregate pushdown. For example, this commandspark.sql("SELECT COUNT(*) FROM <my-table>").collect()
is able to use just the metadata in the delta log to perform this count query, instead of scanning the parquet files.However, this command
spark.sql("SELECT COUNT(*) FROM <my-table>").show()
does NOT perform the aggregate pushdown. The aggregate plan that is matched inside of OptimizeMetadataOnlyDeltaQuery does not match the plan used during the.show()
command, which performs a CAST.Expected results
COUNT(*)
aggregate pushdown should also occur when you do.show()
, just like how it works with.collect()
.All you need to do is update OptimizeMetadataOnlyDeltaQuery and add a test.
Environment information
The text was updated successfully, but these errors were encountered: