Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return empty result for segmented_reduce if input and offsets are both empty #17437

Merged
merged 3 commits into from
Dec 4, 2024

Conversation

davidwendt
Copy link
Contributor

Description

Changes the behavior of cudf::segmented_reduce to return an empty column if both the input and the offsets parameter are empty.
Closes #17433

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@davidwendt davidwendt added feature request New feature or request 3 - Ready for Review Ready for review by team libcudf Affects libcudf (C++/CUDA) code. breaking Breaking change labels Nov 25, 2024
@davidwendt davidwendt self-assigned this Nov 25, 2024
if (segmented_values.is_empty() && offsets.empty()) {
return cudf::make_empty_column(output_dtype);
}

CUDF_EXPECTS(offsets.size() > 0, "`offsets` should have at least 1 element.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the early exit is not taken here, we probably need at least two elements in the offsets for the segments to be valid.

Suggested change
CUDF_EXPECTS(offsets.size() > 0, "`offsets` should have at least 1 element.");
CUDF_EXPECTS(offsets.size() > 1, "`offsets` should have at least 2 elements.");

Copy link
Contributor Author

@davidwendt davidwendt Dec 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there is a specific test for this case so I'm going to leave this line unchanged.
https://github.com/rapidsai/cudf/pull/17437/files#diff-274951d8b19a7c6bf16db78ac17d129ba021bc5135f3d3acfbb1bc814b37ee14R1046

Copy link
Contributor

@bdice bdice Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we did this? It seems like it would be safer and that existing test would still pass.

  • Return an empty result column if values OR offsets are empty (currently the conjunction is AND)
  • If values are non-empty, require at least two offsets (otherwise indexing the values is impossible)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current behavior requires a non-empty result when the input is empty and the offsets are not empty. This appears to be from a request from Spark (reference: #10556 (comment)).

It seems like a non-empty input with more than one offset is reasonable though may not be necessary to check explicitly. I can look into this one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, of course. It is possible to have empty values and non-empty offsets. Sorry, that was an oversight on my part.

Thanks for considering the non-empty input requiring >=2 offsets. It would be nice to make this stricter if we can make a good case for its correctness, but I don’t mean to block on this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess. It seems like it should be an invalid input but I don’t want us to go in circles here. A test exists so let’s maintain the status quo.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the correct behaviour would be to return one element, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is only one offset, it cannot represent a range. Normally we have N+1 offsets to represent N segments because the i-th offset is the (inclusive) start of range i and the (exclusive) end of range i-1.

If we have 1 offset we have at most zero segments, so the test returns an empty result. My claim is that this may be invalid input, because 1 offset cannot define a range, but the user somehow got a single offset. The legal inputs under my previous proposal (which I no longer want to push for) would be zero/empty offsets for zero segments, or >=2 offsets for representing >=1 segment.

Copy link
Contributor Author

@davidwendt davidwendt Dec 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the correct behaviour would be to return one element, right?

I'm not sure what you are referring to. I believe the testcases cover the expected outcomes. This PR only changes the specific behavior where both the input and offsets are empty and returns an empty result instead of throwing an error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The legal inputs under my previous proposal (which I no longer want to push for) would be zero/empty offsets for zero segments, or >=2 offsets for representing >=1 segment.

This certainly would be a breaking change we could consider in a separate PR.

Copy link

copy-pr-bot bot commented Dec 2, 2024

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@davidwendt
Copy link
Contributor Author

/ok to test

@davidwendt
Copy link
Contributor Author

/ok to test

@davidwendt davidwendt marked this pull request as ready for review December 3, 2024 16:08
@davidwendt davidwendt requested a review from a team as a code owner December 3, 2024 16:08
@davidwendt davidwendt requested a review from ttnghia December 4, 2024 13:26
Copy link
Contributor

@ttnghia ttnghia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this 🔥

@davidwendt
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit 4505c53 into rapidsai:branch-25.02 Dec 4, 2024
105 checks passed
@davidwendt davidwendt deleted the seg-red-empty branch December 4, 2024 18:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team breaking Breaking change feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] segmented_reduce does not accept empty offsets
5 participants