[FEA] Add per algorithm `aggregation` derived types #7106

jrhemstad · 2021-01-08T22:05:10Z

Is your feature request related to a problem? Please describe.

libcudf uses a common aggregation polymorphic base class for uniformly specifying the requested aggregation to APIs like cudf::reduce, cudf::scan, cudf::groupby, etc. All of these APIs take a base class pointer aggregation* which can then be cast to a more derived type like min_aggregation.

The problem is, not all APIs that take an aggregation* support all possible aggregation kinds. For example, cudf::scan doesn't support a LEAD aggregation as LEAD is specific to rolling. However, there is nothing to tell a user of cudf::scan that attempting to call:

cudf::scan(..., make_lead_aggregation(...));

wouldn't work. The only recourse a user has is documentation (which is currently lacking for aggregations) or attempting to call an invalid combination and (hopefully) getting a runtime error in the form of an exception.

Describe the solution you'd like

We should use the type system to make it more explicit what aggregations are supported by an algorithm like scan or reduce.

One way I've thought of doing this is introducing per-algorithm tag types that can be selectively inherited from depending on if an aggregation is supported by a particular algorithm. For example,

class aggregation {}; 

class scan_aggregation : public virtual aggregation{};

class rolling_aggregation : public virtual aggregation{};

class min_aggregation : public scan_aggregation, public rolling_aggregation{}; // min works in both rolling and scan

class lead_aggregation : public rolling_aggregation {}; // lead only works in rolling

(note the virtual inheritance is important here to avoid the diamond problem. Multiple inheritance can be tricky. See https://isocpp.org/wiki/faq/multiple-inheritance)

We'd need to modify the aggregation factories to specify which base to use:

template <typename Base = aggregation>
std::unique_ptr<Base> make_min_aggregation(){
    return std::make_unique<min_aggregation>();
}

template <typename Base = aggregation>
std::unique_ptr<Base> make_lead_aggregation(){
    return std::make_unique<lead_aggregation>();
}

Using it would then look like:

    scan( *make_min_aggregation<scan_aggregation>() );

    rolling( *make_lead_aggregation<rolling_aggregation>() );

    // scan( *make_lead_aggregation<scan_aggregation>() ); // this won't work because `lead` doesn't derive from `scan_aggregation`

Here's a working example: https://godbolt.org/z/6b8KjY

The text was updated successfully, but these errors were encountered:

karthikeyann · 2021-02-12T13:57:27Z

Present code offers to throw an error if the aggregation is not supported for the respective algorithm. if support is added, it starts working without change to user of the code.

The suggested solution adds an extra cost for user of libcudf APIs (python and spark), to maintain a matrix of supported operations each in code. This matrix have to be updated simultaneously whenever new support at libcudf is added.
And for dynamically typed languages, it will still have to throw an runtime error. (also needs a documentation)

Which is preferred? getting a runtime error or maintaining a matrix of supported APIs at each user.

revans2 · 2021-02-12T15:00:01Z

Which is preferred? getting a runtime error or maintaining a matrix of supported APIs at each user.

The lazy part of me says runtime exceptions because we have code that works now and I am too lazy to change it.
From an API standpoint I think it would make for better code quality overall, both for the java cudf API and for the end user's code. However it would require a non-trivial amount of work and it would be a breaking change to the java APIs. I am not worried about the ongoing maintenance burden. Our end users already need to know if an aggregation is going to work or not, and if it is not going to work so it is really shifting that burden to some degree from them to us (although in this case I am also the main customer of the java API so the net work saved is a little questionable).

jrhemstad · 2021-02-12T17:07:33Z

Present code offers to throw an error if the aggregation is not supported for the respective algorithm. if support is added, it starts working without change to user of the code.

There are some aggregations that will never be supported by a particular algorithm and the current hierarchy makes it seem as if they should be supported. This is confusing at best and misleading at worst.

When it's possible to do so, detecting an error at compile time is always preferable to detecting an error at runtime.

The suggested solution adds an extra cost for user of libcudf APIs (python and spark), to maintain a matrix of supported operations each in code.

That's a feature, not a bug. The fact that the current system shares a common, single set of types was never something I was happy about. Users have absolutely no clue about what aggregations are supported by a particular algorithm without checking documentation (which gets stale) or simply trying all the possible aggs and seeing what works. What's more, when new agg support is added to an algorithm, it still requires callers to update their code to know that they can pass the new agg to libcudf.

As aggregations support becomes more complex, the current system is not going to be sustainable.

revans2 · 2021-02-16T17:03:52Z

As aggregations support becomes more complex, the current system is not going to be sustainable.

I totally agree and the longer we wait the harder it will be to fix it.

github-actions · 2021-03-18T17:32:45Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

jrhemstad · 2021-03-18T17:47:15Z

Still relevant.

github-actions · 2021-04-17T19:07:53Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

…ement. (#8052) Partially addresses #7106 Fundamentally, this changes the aggregation class hierarchy in the following ways: - The base `aggregation` class becomes abstract, with the `clone()` and` finalize()` functions being pure virtual. - Every aggregation type now has a concrete class associated with it, derived from `aggregation`. - "Intermediate" classes such as `rolling_aggregation` are used to allow individual algorithms to only accept aggregation types that are valid for it (as opposed to enforcing this internally at runtime). All of the rolling_window interfaces have been updated to take a `rolling_aggregation`. Other algorithms such as groupby are not yet converted and still take generic `aggregation` objects. Marking this as Do Not Merge for now since this is a breaking change with immediately implications for Spark. Authors: - https://github.com/nvdbaranec Approvers: - Jake Hemstad (https://github.com/jrhemstad) - Mike Wilson (https://github.com/hyperbolic2346) - Vyas Ramasubramani (https://github.com/vyasr) - Ashwin Srinath (https://github.com/shwina) URL: #8052

nvdbaranec · 2021-05-18T15:33:39Z

Done

vyasr · 2021-07-30T18:26:05Z

@nvdbaranec should this issue still be open? It's not quite clear how much of it was addressed in #8052, what's being done now in #8906, and whether there's further work after #8906 is merged (that perhaps should be in a different issue?).

…e their usage. (#8906) Followup to #8052 Partially addresses #7106 Adds the `groupby_aggregation` class and forces usage of that type when calling `groupby::aggregate()` Adds the `groupby_scan_aggregation` class and forces usage of that type when calling `groupby::scan()` Authors: - https://github.com/nvdbaranec Approvers: - Robert (Bobby) Evans (https://github.com/revans2) - Jake Hemstad (https://github.com/jrhemstad) - David Wendt (https://github.com/davidwendt) - GALI PREM SAGAR (https://github.com/galipremsagar) - Nghia Truong (https://github.com/ttnghia) - Devavret Makkar (https://github.com/devavret) URL: #8906

jrhemstad added feature request New feature or request Needs Triage Need team to review and classify labels Jan 8, 2021

jrhemstad added libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Jan 8, 2021

jrhemstad mentioned this issue Jan 26, 2021

Implement COLLECT rolling window aggregation #7189

Merged

harrism mentioned this issue Mar 3, 2021

[FEA] Refactor aggregation APIs #7456

Closed

github-actions bot added the inactive-30d label Mar 18, 2021

github-actions bot removed the inactive-30d label Mar 18, 2021

karthikeyann mentioned this issue Apr 16, 2021

Add groupby scan aggregation to cudf #7759

Merged

3 tasks

github-actions bot added the inactive-30d label Apr 17, 2021

jrhemstad removed the inactive-30d label Apr 17, 2021

jrhemstad assigned nvdbaranec Apr 17, 2021

nvdbaranec mentioned this issue Apr 24, 2021

Change aggregation class hierarchy to allow per-algorithm type enforcement. #8052

Merged

nvdbaranec closed this as completed May 18, 2021

nvdbaranec mentioned this issue Jul 29, 2021

Add groupby_aggregation and groupby_scan_aggregation classes and force their usage. #8906

Merged

jrhemstad mentioned this issue Feb 18, 2022

Support segmented reductions and null mask reductions #9621

Merged

bdice mentioned this issue Feb 26, 2022

Add scan_aggregation and reduce_aggregation derived types. #10357

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Add per algorithm `aggregation` derived types #7106

[FEA] Add per algorithm `aggregation` derived types #7106

jrhemstad commented Jan 8, 2021

karthikeyann commented Feb 12, 2021

revans2 commented Feb 12, 2021

jrhemstad commented Feb 12, 2021 •

edited

Loading

revans2 commented Feb 16, 2021

github-actions bot commented Mar 18, 2021

jrhemstad commented Mar 18, 2021

github-actions bot commented Apr 17, 2021

nvdbaranec commented May 18, 2021

vyasr commented Jul 30, 2021

[FEA] Add per algorithm aggregation derived types #7106

[FEA] Add per algorithm aggregation derived types #7106

Comments

jrhemstad commented Jan 8, 2021

karthikeyann commented Feb 12, 2021

revans2 commented Feb 12, 2021

jrhemstad commented Feb 12, 2021 • edited Loading

revans2 commented Feb 16, 2021

github-actions bot commented Mar 18, 2021

jrhemstad commented Mar 18, 2021

github-actions bot commented Apr 17, 2021

nvdbaranec commented May 18, 2021

vyasr commented Jul 30, 2021

[FEA] Add per algorithm `aggregation` derived types #7106

[FEA] Add per algorithm `aggregation` derived types #7106

jrhemstad commented Feb 12, 2021 •

edited

Loading