Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mark ARRAY_AGG(DISTINCT ...) not implemented #1534

Merged
merged 1 commit into from
Jan 11, 2022

Conversation

james727
Copy link
Contributor

@james727 james727 commented Jan 9, 2022

Which issue does this PR close?

This is the last change required to close #1512

Rationale for this change

Right now array_agg(distinct ...) doesn't work. The physical plan construction logic uses the non-distinct array_agg whether or not distinct was specified. Interestingly enough it still works correctly under certain conditions, due to the SingleDistinctToGroupBy optimizer rule.

As an example, consider the following queries:

--- Works, since logical plan is rewritten with a subquery and non-distinct agg. Will return:
--- +------------------------------------------+
--- | ARRAYAGG(DISTINCT aggregate_test_100.c2) |
--- +------------------------------------------+
--- | [2, 3, 5, 1, 4]                          |
--- +------------------------------------------+
SELECT array_agg(DISTINCT c2) FROM aggregate_test_100;

--- Returns incorrect results, since SingleDistinctToGroupBy optimizer rule does not apply:
--- +--------------------------------------------------------------------------+
--- | ARRAYAGG(DISTINCT aggregate_test_100.c2)      | COUNT(DISTINCT UInt8(1)) |
--- +--------------------------------------------------------------------------+
--- | [2, 5, 1, 1, 5, 4, 3, 3, 1, 4, 1, 4, 3, ...]  | 1                        |
--- +--------------------------------------------------------------------------+
SELECT array_agg(DISTINCT c2), count(distinct 1) FROM aggregate_test_100;

After this change distinct array agg will throw an error when SingleDistinctToGroupBy does not apply.

I'm planning on working on actually implementing distinct array_agg after this, but figured this was worth fixing for now.

What changes are included in this PR?

This marks the aggregation not implemented, and adds a block for testing count/approx distinct/array agg with distinct = true.

@github-actions github-actions bot added the datafusion Changes in the datafusion crate label Jan 9, 2022
@james727 james727 force-pushed the array_distinct_notimpl branch from 33a5b1a to dda9b5a Compare January 9, 2022 20:02
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me -- thank you @james727

do distinct aggregates make sense for array aggregates? If they make sense could you possibly file at ticket to track implementing them?

@alamb alamb merged commit b05feda into apache:master Jan 11, 2022
@james727 james727 deleted the array_distinct_notimpl branch January 11, 2022 19:50
@james727
Copy link
Contributor Author

@alamb yep - they do! I think #1323 covers it but I can add another ticket and close the set_agg one if easier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datafusion Changes in the datafusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Distinct aggregations work only partially
2 participants