Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support ROLLUP and CUBE grouping expansion #782

Open
yongchul opened this issue Jan 31, 2025 · 3 comments
Open

Support ROLLUP and CUBE grouping expansion #782

yongchul opened this issue Jan 31, 2025 · 3 comments

Comments

@yongchul
Copy link

yongchul commented Jan 31, 2025

Motivation

ROLLUP and CUBE are established way to specify multiple groupings. In many databases, ROLLUP and CUBE can appear as a grouping expression so that one does not have to tediously expand all the combinations of the grouping expressions (see the survey of state of RDBMS).

In Substrait, Aggregate rel supports grouping sets. ROLLUP and CUBE can be spelled out by expanding all possible combinations in the plan. The only problem is that CUBE will expand to exponential number of groupings, and sending the fully expanded form is not compact and not aligned well with other parts of Substrait, where compact representation is preferred (e.g., anchor and referring by indexes). Also, fully expanded groupings are hard to grasp the actual operation, thus not helping readability.

The proposal is to support ROLLUP and CUBE as an expansion method of Grouping when specified. The default is point, thus backward compatible, and rollup and cube will specify how to expand the specified sequence of grouping expression. This not only prevents exponential expansions in the plan and is easier to understand the higher-level semantic.

State of the Art

Following databases (but not limited to) all support ROLLUP and CUBE as valid expansions of grouping.

Proposed Change

The proposed changes are two fold.

  1. Add an optional grouping expansion enum field to Grouping to describe the type of expansion. If omitted, compatible to prior semantic.
  2. Add repeated field to specify grouping of expressions (e.g., CUBE( (a, b), c, (d, e)) where (a, b) and (d, e) are treated as a unit in expansion).
enum GroupingExpansion {
  // Default.
  Unspecified = 0;

  // The N grouped expressions are expanded to N + 1 groupings.
  ROLLUP = 1;
  // The N grouped expressions are expanded to 2^N  groupings.
  CUBE = 2;
}

message Grouping {
  repeated uint32 expression_references = 2;

  // Denotes which expression references are bundled for ROLLUP and CUBE.
  // 0 means the expression reference is single element.
  // For all other values, expressions with the same non-zero values are treated as one grouping unit. 
  //
  // For example, for groupings [ a, b, c, d, e] and the reference groupings [1, 1, 0, 2, 2] with CUBE expansion,
  // it is effectively CUBE( (a, b), c, (d, e)).
  // If expansion method is unspecified, the field is ignored.
  // If the field have fewer elements than expression_references, 0 is assumed for those missing elements. 
  repeated uint32 expression_reference_groupings = 3;

  // Specifies how the expression_references are expanded.
  optional GroupingExpansion expansion = 4; // default is Unspecified.
}

If we can get to consensus, I am happy to contribute the change!

@yongchul
Copy link
Author

Please let me know what do you think. I want to contribute more to the project! :)

@jacques-n
Copy link
Contributor

jacques-n commented Feb 4, 2025

I haven't spent a lot of time thinking about this but my inclination is that we can already represent both of these concepts in aggregaterel via the use of grouping sets. We're just doing so as explicitly expanded whereas you're requesting an implicitly expanded representation. In Substrait, we're okay having multiple ways to represent an operation. Additionally, we like to maintain context where it might be helpful to execution systems. As such, I'm not against adding abstractions that specifically support cube and rollup. However, if we were to do that, I'm be inclined to just create new relations rather than overloading aggregaterel.

Two questions:

  1. Do you agree that this is about convenience, not expressibility? (E.g. aggregaterel can already express the semantics of cube and rollup.)
  2. If you agree to one, what are your thoughts on keeping these as separate relation(s) as opposed to adding them to aggregate? Pros/Cons?

Part of the reason I argue for separate relations is that the current proposal in the attached PR actually creates things that I don't think really exist. (For example, a combination of groupings of multiple GroupingExpansion types).

@yongchul
Copy link
Author

yongchul commented Feb 4, 2025

@jacques-n thank you for feedback!

  1. Do you agree that this is about convenience, not expressibility? (E.g. aggregaterel can already express the semantics of cube and rollup.)

Yes, I agree about the expressibility. I'm not claiming that CUBE and ROLLUP can't be represented today. The second paragraph of the Motivation states that "the main problem is an exponential expansion of CUBE if you spell out as grouping sets".

  1. If you agree to one, what are your thoughts on keeping these as separate relation(s) as opposed to adding them to aggregate? Pros/Cons?

As stated in state of the art, in many RDBMSes, CUBE and ROLLUP can be specified as a part of group by or grouping sets (i.e., you can mix and match grouping columns, grouping sets, ROLLUP, and CUBE in GROUP BY GROUPING SETS. I wish I have SQL standard at my disposal for this...) So I believe extending "grouping spec", which is really what ROLLUP and CUBE is about, is the right place to specify expansion rather than a dedicated CUBE or ROLLUP rels, which is clearly restricted than what many systems do support.

Part of the reason I argue for separate relations is that the current proposal in the attached PR actually creates things that I don't think really exist. (For example, a combination of groupings of multiple GroupingExpansion types).

As stated above, systems do support these. Please let me know if there are other concerns.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants