Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return empty result for segmented_reduce if input and offsets are both empty #17437

Merged
merged 3 commits into from
Dec 4, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions cpp/src/reductions/segmented/reductions.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
* See the License for the specific language governing permissions and
* limitations under the License.
*/
#include <cudf/column/column_factories.hpp>
#include <cudf/detail/aggregation/aggregation.hpp>
#include <cudf/detail/nvtx/ranges.hpp>
#include <cudf/reduction.hpp>
Expand Down Expand Up @@ -118,6 +119,11 @@ std::unique_ptr<column> segmented_reduce(column_view const& segmented_values,
CUDF_FAIL(
"Initial value is only supported for SUM, PRODUCT, MIN, MAX, ANY, and ALL aggregation types");
}

if (segmented_values.is_empty() && offsets.empty()) {
return cudf::make_empty_column(output_dtype);
}

CUDF_EXPECTS(offsets.size() > 0, "`offsets` should have at least 1 element.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the early exit is not taken here, we probably need at least two elements in the offsets for the segments to be valid.

Suggested change
CUDF_EXPECTS(offsets.size() > 0, "`offsets` should have at least 1 element.");
CUDF_EXPECTS(offsets.size() > 1, "`offsets` should have at least 2 elements.");

Copy link
Contributor Author

@davidwendt davidwendt Dec 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there is a specific test for this case so I'm going to leave this line unchanged.
https://github.com/rapidsai/cudf/pull/17437/files#diff-274951d8b19a7c6bf16db78ac17d129ba021bc5135f3d3acfbb1bc814b37ee14R1046

Copy link
Contributor

@bdice bdice Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we did this? It seems like it would be safer and that existing test would still pass.

  • Return an empty result column if values OR offsets are empty (currently the conjunction is AND)
  • If values are non-empty, require at least two offsets (otherwise indexing the values is impossible)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current behavior requires a non-empty result when the input is empty and the offsets are not empty. This appears to be from a request from Spark (reference: #10556 (comment)).

It seems like a non-empty input with more than one offset is reasonable though may not be necessary to check explicitly. I can look into this one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, of course. It is possible to have empty values and non-empty offsets. Sorry, that was an oversight on my part.

Thanks for considering the non-empty input requiring >=2 offsets. It would be nice to make this stricter if we can make a good case for its correctness, but I don’t mean to block on this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess. It seems like it should be an invalid input but I don’t want us to go in circles here. A test exists so let’s maintain the status quo.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the correct behaviour would be to return one element, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is only one offset, it cannot represent a range. Normally we have N+1 offsets to represent N segments because the i-th offset is the (inclusive) start of range i and the (exclusive) end of range i-1.

If we have 1 offset we have at most zero segments, so the test returns an empty result. My claim is that this may be invalid input, because 1 offset cannot define a range, but the user somehow got a single offset. The legal inputs under my previous proposal (which I no longer want to push for) would be zero/empty offsets for zero segments, or >=2 offsets for representing >=1 segment.

Copy link
Contributor Author

@davidwendt davidwendt Dec 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the correct behaviour would be to return one element, right?

I'm not sure what you are referring to. I believe the testcases cover the expected outcomes. This PR only changes the specific behavior where both the input and offsets are empty and returns an empty result instead of throwing an error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The legal inputs under my previous proposal (which I no longer want to push for) would be zero/empty offsets for zero segments, or >=2 offsets for representing >=1 segment.

This certainly would be a breaking change we could consider in a separate PR.


return cudf::detail::aggregation_dispatcher(
Expand Down
20 changes: 20 additions & 0 deletions cpp/tests/reductions/segmented_reduction_tests.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1122,6 +1122,26 @@ TEST_F(SegmentedReductionTestUntyped, EmptyInputWithOffsets)
CUDF_TEST_EXPECT_COLUMNS_EQUAL(*result, expect_bool);
}

TEST_F(SegmentedReductionTestUntyped, EmptyInputEmptyOffsets)
{
auto const str_empty = cudf::test::strings_column_wrapper{};
auto const int_empty = cudf::test::fixed_width_column_wrapper<cudf::size_type>{};
auto result =
cudf::segmented_reduce(str_empty,
cudf::column_view{int_empty},
*cudf::make_max_aggregation<cudf::segmented_reduce_aggregation>(),
cudf::data_type{cudf::type_id::STRING},
cudf::null_policy::EXCLUDE);
CUDF_TEST_EXPECT_COLUMNS_EQUAL(*result, str_empty);

result = cudf::segmented_reduce(int_empty,
cudf::column_view{int_empty},
*cudf::make_min_aggregation<cudf::segmented_reduce_aggregation>(),
cudf::data_type{cudf::type_id::INT32},
cudf::null_policy::INCLUDE);
CUDF_TEST_EXPECT_COLUMNS_EQUAL(*result, int_empty);
}

template <typename T>
struct SegmentedReductionFixedPointTest : public cudf::test::BaseFixture {};

Expand Down
Loading