Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Add per algorithm aggregation derived types #7106

Closed
jrhemstad opened this issue Jan 8, 2021 · 9 comments
Closed

[FEA] Add per algorithm aggregation derived types #7106

jrhemstad opened this issue Jan 8, 2021 · 9 comments
Assignees
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.

Comments

@jrhemstad
Copy link
Contributor

Is your feature request related to a problem? Please describe.

libcudf uses a common aggregation polymorphic base class for uniformly specifying the requested aggregation to APIs like cudf::reduce, cudf::scan, cudf::groupby, etc. All of these APIs take a base class pointer aggregation* which can then be cast to a more derived type like min_aggregation.

The problem is, not all APIs that take an aggregation* support all possible aggregation kinds. For example, cudf::scan doesn't support a LEAD aggregation as LEAD is specific to rolling. However, there is nothing to tell a user of cudf::scan that attempting to call:

cudf::scan(..., make_lead_aggregation(...));

wouldn't work. The only recourse a user has is documentation (which is currently lacking for aggregations) or attempting to call an invalid combination and (hopefully) getting a runtime error in the form of an exception.

Describe the solution you'd like

We should use the type system to make it more explicit what aggregations are supported by an algorithm like scan or reduce.

One way I've thought of doing this is introducing per-algorithm tag types that can be selectively inherited from depending on if an aggregation is supported by a particular algorithm. For example,

class aggregation {}; 

class scan_aggregation : public virtual aggregation{};

class rolling_aggregation : public virtual aggregation{};

class min_aggregation : public scan_aggregation, public rolling_aggregation{}; // min works in both rolling and scan

class lead_aggregation : public rolling_aggregation {}; // lead only works in rolling

(note the virtual inheritance is important here to avoid the diamond problem. Multiple inheritance can be tricky. See https://isocpp.org/wiki/faq/multiple-inheritance)

We'd need to modify the aggregation factories to specify which base to use:

template <typename Base = aggregation>
std::unique_ptr<Base> make_min_aggregation(){
    return std::make_unique<min_aggregation>();
}

template <typename Base = aggregation>
std::unique_ptr<Base> make_lead_aggregation(){
    return std::make_unique<lead_aggregation>();
}

Using it would then look like:

    scan( *make_min_aggregation<scan_aggregation>() );

    rolling( *make_lead_aggregation<rolling_aggregation>() );

    // scan( *make_lead_aggregation<scan_aggregation>() ); // this won't work because `lead` doesn't derive from `scan_aggregation`

Here's a working example: https://godbolt.org/z/6b8KjY

@jrhemstad jrhemstad added feature request New feature or request Needs Triage Need team to review and classify labels Jan 8, 2021
@jrhemstad jrhemstad added libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Jan 8, 2021
@karthikeyann
Copy link
Contributor

Present code offers to throw an error if the aggregation is not supported for the respective algorithm. if support is added, it starts working without change to user of the code.

The suggested solution adds an extra cost for user of libcudf APIs (python and spark), to maintain a matrix of supported operations each in code. This matrix have to be updated simultaneously whenever new support at libcudf is added.
And for dynamically typed languages, it will still have to throw an runtime error. (also needs a documentation)

Which is preferred? getting a runtime error or maintaining a matrix of supported APIs at each user.

@revans2
Copy link
Contributor

revans2 commented Feb 12, 2021

Which is preferred? getting a runtime error or maintaining a matrix of supported APIs at each user.

The lazy part of me says runtime exceptions because we have code that works now and I am too lazy to change it.
From an API standpoint I think it would make for better code quality overall, both for the java cudf API and for the end user's code. However it would require a non-trivial amount of work and it would be a breaking change to the java APIs. I am not worried about the ongoing maintenance burden. Our end users already need to know if an aggregation is going to work or not, and if it is not going to work so it is really shifting that burden to some degree from them to us (although in this case I am also the main customer of the java API so the net work saved is a little questionable).

@jrhemstad
Copy link
Contributor Author

jrhemstad commented Feb 12, 2021

Present code offers to throw an error if the aggregation is not supported for the respective algorithm. if support is added, it starts working without change to user of the code.

There are some aggregations that will never be supported by a particular algorithm and the current hierarchy makes it seem as if they should be supported. This is confusing at best and misleading at worst.

When it's possible to do so, detecting an error at compile time is always preferable to detecting an error at runtime.

The suggested solution adds an extra cost for user of libcudf APIs (python and spark), to maintain a matrix of supported operations each in code.

That's a feature, not a bug. The fact that the current system shares a common, single set of types was never something I was happy about. Users have absolutely no clue about what aggregations are supported by a particular algorithm without checking documentation (which gets stale) or simply trying all the possible aggs and seeing what works. What's more, when new agg support is added to an algorithm, it still requires callers to update their code to know that they can pass the new agg to libcudf.

As aggregations support becomes more complex, the current system is not going to be sustainable.

@revans2
Copy link
Contributor

revans2 commented Feb 16, 2021

As aggregations support becomes more complex, the current system is not going to be sustainable.

I totally agree and the longer we wait the harder it will be to fix it.

@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@jrhemstad
Copy link
Contributor Author

Still relevant.

@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

rapids-bot bot pushed a commit that referenced this issue May 7, 2021
…ement. (#8052)

Partially addresses  #7106

Fundamentally, this changes the aggregation class hierarchy in the following ways:
- The base `aggregation` class becomes abstract, with the `clone()` and` finalize()` functions being pure virtual.
- Every aggregation type now has a concrete class associated with it, derived from `aggregation`.
- "Intermediate" classes such as `rolling_aggregation` are used to allow individual algorithms to only accept aggregation types that are valid for it (as opposed to enforcing this internally at runtime).

All of the rolling_window interfaces have been updated to take a `rolling_aggregation`.  Other algorithms such as groupby are not yet converted and still take generic `aggregation` objects.

Marking this as Do Not Merge for now since this is a breaking change with immediately implications for Spark.

Authors:
  - https://github.com/nvdbaranec

Approvers:
  - Jake Hemstad (https://github.com/jrhemstad)
  - Mike Wilson (https://github.com/hyperbolic2346)
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Ashwin Srinath (https://github.com/shwina)

URL: #8052
@nvdbaranec
Copy link
Contributor

Done

@vyasr
Copy link
Contributor

vyasr commented Jul 30, 2021

@nvdbaranec should this issue still be open? It's not quite clear how much of it was addressed in #8052, what's being done now in #8906, and whether there's further work after #8906 is merged (that perhaps should be in a different issue?).

rapids-bot bot pushed a commit that referenced this issue Aug 19, 2021
…e their usage. (#8906)

Followup to #8052
Partially addresses #7106

Adds the `groupby_aggregation` class and forces usage of that type when calling `groupby::aggregate()`

Adds the `groupby_scan_aggregation` class and forces usage of that type when calling `groupby::scan()`

Authors:
  - https://github.com/nvdbaranec

Approvers:
  - Robert (Bobby) Evans (https://github.com/revans2)
  - Jake Hemstad (https://github.com/jrhemstad)
  - David Wendt (https://github.com/davidwendt)
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Nghia Truong (https://github.com/ttnghia)
  - Devavret Makkar (https://github.com/devavret)

URL: #8906
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

No branches or pull requests

5 participants