Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for GROUP BY * aggregation #18390

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ebyhr
Copy link
Member

@ebyhr ebyhr commented Jul 24, 2023

Description

This syntax allows omitting column positions or names after GROUP BY.
For instance, SELECT name, count(1) FROM GROUP BY * is equivalent to SELECT name, count(1) FROM GROUP BY name

This PR changes ALL as a reserved keyword to distinguish it from column names.

References in other database/query engines

Release notes

(x) Release notes are required, with the following suggested text:

# General
* Add support for `GROUP BY *` aggregation. ({issue}`18390`)

@cla-bot cla-bot bot added the cla-signed label Jul 24, 2023
@github-actions github-actions bot added the docs label Jul 24, 2023
@ebyhr ebyhr force-pushed the ebi/core-group-by-all branch from fc01df4 to e809697 Compare July 25, 2023 00:47
@github-actions github-actions bot added tests:hive hive Hive connector labels Jul 25, 2023
@ebyhr ebyhr self-assigned this Jul 25, 2023
@ilfrin ilfrin requested a review from martint July 25, 2023 16:26
@electrum
Copy link
Member

This is great. I fully support adding this syntax.

@ebyhr ebyhr force-pushed the ebi/core-group-by-all branch 2 times, most recently from 48bb65e to 36bff45 Compare July 31, 2023 02:34
Copy link
Member

@martint martint left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This conflicts with the standard syntax for GROUP BY:

 GROUP BY [ <set quantifier> ] <grouping element list>

where <set quantifier> can be ALL or DISTINCT and defaults to ALL if omitted. The quantifier affects the semantics for queries involving grouping sets. Overloading the meaning to indicate how the keys are selected instead of how the rows in the result are de-duplicated is confusing and error prone.

Before we could consider such syntax, we'd need to define the precise semantics and how it interacts and relates to the broader GROUP BY feature.

@martint
Copy link
Member

martint commented Jul 31, 2023

In particular, here are some inconsistencies:

SELECT k, count(*) 
FROM (VALUES 1,2,3) t(k) 
GROUP BY ALL

produces a result, but

SELECT k, count(*) 
FROM (VALUES 1,2,3) t(k) 
GROUP BY

fails with 'k' must be an aggregate expression or appear in GROUP BY clause, even though ALL is implied if missing, per the SQL standard.

SELECT k, count(*) 
FROM (VALUES 1,2,3) t(k) 
GROUP BY DISTINCT

fails with groupingElements must not be empty when type is DISTINCT. Presumably, it should also work. But, what would the meaning of such a query?

Without precise semantics, it's hard to tell what's the behavior of these queries after this change:

SELECt k, count(*) 
FROM (VALUES 1,2,3) t(k) 
GROUP BY ALL k
SELECT k, v, count(*) 
FROM (VALUES (1,1),(2,2),(3,3)) t(k, v) 
GROUP BY ALL
SELECT k, v, count(*) 
FROM (VALUES (1,1),(2,2),(3,3)) t(k, v) 
GROUP BY ALL k
SELECT k, v, count(*) 
FROM (VALUES (1,1),(2,2),(3,3)) t(k, v) 
GROUP BY ALL k, v
SELECT count(*) 
FROM (VALUES 1,2,3) t(k) 
GROUP BY ALL ()

@ebyhr
Copy link
Member Author

ebyhr commented Aug 1, 2023

  1. Since GROUP BY ALL without grouping element list isn't defined in standard, we don't need to default to ALL in a such case and the failure is expected for me.
  2. I don't think it should work. GROUP BY DISTINCT without grouping element list was disallowed even before this change. Also, the standard also disallows it in my understanding. The unclear semantics you mentioned implies the current behavior of throwing an exception is better.
  3. Only 2nd example should be affected in this PR because it's the only example which uses GROUP BY ALL without grouping element list. Other examples should keep the original behavior. This rule is clear to me.

@ebyhr ebyhr force-pushed the ebi/core-group-by-all branch from 36bff45 to 9153730 Compare August 1, 2023 06:17
@ebyhr ebyhr force-pushed the ebi/core-group-by-all branch from 9153730 to 9882169 Compare August 17, 2023 08:46
@martint
Copy link
Member

martint commented Aug 18, 2023

To be clear, the ALL and DISTINCT qualifiers control whether grouping sets that have the same combination of keys are deduped.

Therefore, the only way this syntax makes sense is if we give meaning to omitting the grouping set specification. Specifically, to be equivalent to having a single grouping set composed of all the expressions in the group by clause that don’t contain aggregations.

In that case, the qualifier is orthogonal to such feature. Allowing one but not the other looks arbitrary and introduces cognitive load for a user who has to understand that they are somehow connected even though intuitively they should not be.

Another aspect that complicates issues conceptually is that the GROUP BY operation occurs before the SELECT clause is computed, so it’s a chicken-and-egg problem to determine which columns are grouping keys and which ones are derived. Also, the GROUP BY clause operates on input columns (those coming from the FROM clause) not on those from the SELECT clause. The implication arrow goes the other way: an expression in the SELECT clause is valid if it’s functionally dependent on the input columns used for computing the grouping sets.

@ebyhr ebyhr force-pushed the ebi/core-group-by-all branch from 9882169 to 2d40f4a Compare November 1, 2023 01:53
@ebyhr
Copy link
Member Author

ebyhr commented Mar 1, 2024

@martint Thank you for your detailed explanation. Can you suggest alternative syntax? Or we don't want to add this feature?

@ebyhr ebyhr force-pushed the ebi/core-group-by-all branch from 2d40f4a to 7dc90bb Compare March 14, 2024 01:52
@ebyhr ebyhr force-pushed the ebi/core-group-by-all branch from 7dc90bb to e2b781e Compare March 24, 2024 23:22
@ebyhr ebyhr marked this pull request as ready for review April 17, 2024 22:18
@github-actions github-actions bot added the stale label May 10, 2024
@ebyhr ebyhr added the stale-ignore Use this label on PRs that should be ignored by the stale bot so they are not flagged or closed. label May 10, 2024
@trinodb trinodb deleted a comment from github-actions bot May 15, 2024
@ebyhr ebyhr force-pushed the ebi/core-group-by-all branch 2 times, most recently from 8890f18 to 0b7ccc1 Compare November 8, 2024 12:02
@ebyhr ebyhr changed the title Add support for GROUP BY ALL aggregation Add support for GROUP BY * aggregation Nov 8, 2024
@ebyhr ebyhr force-pushed the ebi/core-group-by-all branch from 0b7ccc1 to 98f0ff2 Compare November 12, 2024 01:54
@ebyhr ebyhr force-pushed the ebi/core-group-by-all branch from 98f0ff2 to 4353674 Compare January 7, 2025 01:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed docs hive Hive connector stale stale-ignore Use this label on PRs that should be ignored by the stale bot so they are not flagged or closed. syntax-needs-review
Development

Successfully merging this pull request may close these issues.

4 participants