Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Optimizations for groupby::scan #8522

Closed
jrhemstad opened this issue Jun 15, 2021 · 5 comments · Fixed by #9754
Closed

[FEA] Optimizations for groupby::scan #8522

jrhemstad opened this issue Jun 15, 2021 · 5 comments · Fixed by #9754
Assignees
Labels
0 - Backlog In queue waiting for assignment feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue

Comments

@jrhemstad
Copy link
Contributor

The groupby object has a parameter allowing a caller to indicate that their keys (and likewise values to aggregate) are already in sorted order such that all values belonging to a particular group are already contiguous.

However, this information is not fully leveraged in the groupby::scan implementation.

Internally, the groupby::scan implementation calls get_grouped_values() to rearrange the values such that all values in the same group are contiguous

detail::sum_scan(
get_grouped_values(), helper.num_groups(stream), helper.group_labels(stream), stream, mr));

This in turn calls the sort_groupby_helper::grouped_values to perform the grouping:

sort_groupby_helper::column_ptr sort_groupby_helper::grouped_values(
column_view const& values, rmm::cuda_stream_view stream, rmm::mr::device_memory_resource* mr)
{
auto gather_map = key_sort_order(stream);
auto grouped_values_table = cudf::detail::gather(table_view({values}),
gather_map,
cudf::out_of_bounds_policy::DONT_CHECK,
cudf::detail::negative_index_policy::NOT_ALLOWED,
stream,
mr);
return std::move(grouped_values_table->release()[0]);
}

Which in turn calls key_sort_order to get the indices of the sorted order of the keys:

column_view sort_groupby_helper::key_sort_order(rmm::cuda_stream_view stream)
{
auto sliced_key_sorted_order = [stream, this]() {
return cudf::detail::slice(this->_key_sorted_order->view(), 0, this->num_keys(stream));
};
if (_key_sorted_order) { return sliced_key_sorted_order(); }
// TODO (dm): optimization. When keys are pre sorted but ignore nulls is true,
// we still want all rows with nulls in the end. Sort is costly, so
// do a copy_if(counting, sorted_order, {bitmask.is_valid(i)})
if (_keys_pre_sorted == sorted::YES) {
_key_sorted_order = make_numeric_column(
data_type(type_to_id<size_type>()), _keys.num_rows(), mask_state::UNALLOCATED, stream);
auto d_key_sorted_order = _key_sorted_order->mutable_view().data<size_type>();
thrust::sequence(rmm::exec_policy(stream),
d_key_sorted_order,
d_key_sorted_order + _key_sorted_order->size(),
0);
return sliced_key_sorted_order();
}
if (_include_null_keys == null_policy::INCLUDE || !cudf::has_nulls(_keys)) { // SQL style
_key_sorted_order = cudf::detail::stable_sorted_order(
_keys,
{},
std::vector<null_order>(_keys.num_columns(), null_order::AFTER),
stream,
rmm::mr::get_current_device_resource());
} else { // Pandas style
// Temporarily prepend the keys table with a column that indicates the
// presence of a null value within a row. This allows moving all rows that
// contain a null value to the end of the sorted order.
auto augmented_keys = table_view({table_view({keys_bitmask_column(stream)}), _keys});
_key_sorted_order = cudf::detail::stable_sorted_order(
augmented_keys,
{},
std::vector<null_order>(_keys.num_columns() + 1, null_order::AFTER),
stream,
rmm::mr::get_current_device_resource());
// All rows with one or more null values are at the end of the resulting sorted order.
}
return sliced_key_sorted_order();
}

Then the values are gathered based on the sorted key order:

auto grouped_values_table = cudf::detail::gather(table_view({values}),
gather_map,
cudf::out_of_bounds_policy::DONT_CHECK,
cudf::detail::negative_index_policy::NOT_ALLOWED,
stream,
mr);

However, the steps of materializing the key_sort_order and performing the gather are completely redundant if the user has already specified that the inputs are already sorted.

I haven't full thought through the solution, but my intuition is that sort_groupby_helper should short circuit when they keys are already sorted and just return a view to the input values.

@jrhemstad jrhemstad added feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. labels Jun 15, 2021
@jrhemstad
Copy link
Contributor Author

This impacts more than just groupby::scan. This optimization will improve any time grouped_values is called when the input keys are already sorted.

@harrism harrism added the Performance Performance related issue label Jun 16, 2021
@PointKernel PointKernel self-assigned this Jun 29, 2021
@github-actions
Copy link

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

@revans2
Copy link
Contributor

revans2 commented Nov 16, 2021

Not sure if this is still an issue or not, but I would love to have groupby::scan go faster for already sorted data. As this is a use case we commonly have in Spark.

@PointKernel
Copy link
Member

Oops, I completely forgot about this issue. Will work on this shortly.

@PointKernel
Copy link
Member

Still desired.

@jrhemstad jrhemstad added the 0 - Backlog In queue waiting for assignment label Nov 16, 2021
rapids-bot bot pushed a commit that referenced this issue Jan 10, 2022
Closes #8522

This PR gets rid of redundant rearranging processes in `groupby::scan` if input values are presorted. Instead of a short circuit in `sort_helper`, it adds an early exit in the scan functor to avoid materializing `sorted_values`/`grouped_values` thus reducing memory footprint. This optimization brings a 1.6x speedup for presorted scan operations.

- Baseline
```
-----------------------------------------------------------------------------------------
Benchmark                                               Time             CPU   Iterations
-----------------------------------------------------------------------------------------
Groupby/BasicSumScan/1000000/manual_time            0.455 ms        0.472 ms         1388
Groupby/BasicSumScan/10000000/manual_time            8.80 ms         8.81 ms           61
Groupby/BasicSumScan/100000000/manual_time            543 ms          543 ms            1
Groupby/PreSortedSumScan/1000000/manual_time        0.217 ms        0.236 ms         3319
Groupby/PreSortedSumScan/10000000/manual_time        1.45 ms         1.47 ms          479
Groupby/PreSortedSumScan/100000000/manual_time       14.0 ms         14.0 ms           47
```
- After optimization
```
-----------------------------------------------------------------------------------------
Benchmark                                               Time             CPU   Iterations
-----------------------------------------------------------------------------------------
Groupby/BasicSumScan/1000000/manual_time            0.455 ms        0.472 ms         1393
Groupby/BasicSumScan/10000000/manual_time            8.81 ms         8.82 ms           60
Groupby/BasicSumScan/100000000/manual_time            546 ms          546 ms            1
Groupby/PreSortedSumScan/1000000/manual_time        0.129 ms        0.148 ms         5389
Groupby/PreSortedSumScan/10000000/manual_time       0.901 ms        0.921 ms          769
Groupby/PreSortedSumScan/100000000/manual_time       8.68 ms         8.70 ms           74
```

Authors:
  - Yunsong Wang (https://github.com/PointKernel)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - Karthikeyan (https://github.com/karthikeyann)

URL: #9754
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants