[FEA] Optimizations for groupby::scan #8522

jrhemstad · 2021-06-15T19:15:49Z

The groupby object has a parameter allowing a caller to indicate that their keys (and likewise values to aggregate) are already in sorted order such that all values belonging to a particular group are already contiguous.

However, this information is not fully leveraged in the groupby::scan implementation.

Internally, the groupby::scan implementation calls get_grouped_values() to rearrange the values such that all values in the same group are contiguous

cudf/cpp/src/groupby/sort/scan.cpp

Lines 74 to 75 in 6728c75

    
           detail::sum_scan( 
        
             get_grouped_values(), helper.num_groups(stream), helper.group_labels(stream), stream, mr));

This in turn calls the sort_groupby_helper::grouped_values to perform the grouping:

cudf/cpp/src/groupby/sort/sort_helper.cu

Lines 289 to 302 in 6728c75

    
           sort_groupby_helper::column_ptr sort_groupby_helper::grouped_values( 
        
             column_view const& values, rmm::cuda_stream_view stream, rmm::mr::device_memory_resource* mr) 
        
           { 
        
             auto gather_map = key_sort_order(stream); 
        
             auto grouped_values_table = cudf::detail::gather(table_view({values}), 
        
                                                              gather_map, 
        
                                                              cudf::out_of_bounds_policy::DONT_CHECK, 
        
                                                              cudf::detail::negative_index_policy::NOT_ALLOWED, 
        
                                                              stream, 
        
                                                              mr); 
        
             return std::move(grouped_values_table->release()[0]); 
        
           }

Which in turn calls key_sort_order to get the indices of the sorted order of the keys:

cudf/cpp/src/groupby/sort/sort_helper.cu

Lines 107 to 157 in 6728c75

    
           column_view sort_groupby_helper::key_sort_order(rmm::cuda_stream_view stream) 
        
           { 
        
             auto sliced_key_sorted_order = [stream, this]() { 
        
               return cudf::detail::slice(this->_key_sorted_order->view(), 0, this->num_keys(stream)); 
        
             }; 
        
             if (_key_sorted_order) { return sliced_key_sorted_order(); } 
        
             // TODO (dm): optimization. When keys are pre sorted but ignore nulls is true, 
        
             //            we still want all rows with nulls in the end. Sort is costly, so 
        
             //            do a copy_if(counting, sorted_order, {bitmask.is_valid(i)}) 
        
             if (_keys_pre_sorted == sorted::YES) { 
        
               _key_sorted_order = make_numeric_column( 
        
                 data_type(type_to_id<size_type>()), _keys.num_rows(), mask_state::UNALLOCATED, stream); 
        
               auto d_key_sorted_order = _key_sorted_order->mutable_view().data<size_type>(); 
        
               thrust::sequence(rmm::exec_policy(stream), 
        
                                d_key_sorted_order, 
        
                                d_key_sorted_order + _key_sorted_order->size(), 
        
                                0); 
        
               return sliced_key_sorted_order(); 
        
             } 
        
             if (_include_null_keys == null_policy::INCLUDE || !cudf::has_nulls(_keys)) {  // SQL style 
        
               _key_sorted_order = cudf::detail::stable_sorted_order( 
        
                 _keys, 
        
                 {}, 
        
                 std::vector<null_order>(_keys.num_columns(), null_order::AFTER), 
        
                 stream, 
        
                 rmm::mr::get_current_device_resource()); 
        
             } else {  // Pandas style 
        
               // Temporarily prepend the keys table with a column that indicates the 
        
               // presence of a null value within a row. This allows moving all rows that 
        
               // contain a null value to the end of the sorted order. 
        
               auto augmented_keys = table_view({table_view({keys_bitmask_column(stream)}), _keys}); 
        
               _key_sorted_order = cudf::detail::stable_sorted_order( 
        
                 augmented_keys, 
        
                 {}, 
        
                 std::vector<null_order>(_keys.num_columns() + 1, null_order::AFTER), 
        
                 stream, 
        
                 rmm::mr::get_current_device_resource()); 
        
               // All rows with one or more null values are at the end of the resulting sorted order. 
        
             } 
        
             return sliced_key_sorted_order(); 
        
           }

Then the values are gathered based on the sorted key order:

cudf/cpp/src/groupby/sort/sort_helper.cu

Lines 294 to 299 in 6728c75

    
           auto grouped_values_table = cudf::detail::gather(table_view({values}), 
        
                                                            gather_map, 
        
                                                            cudf::out_of_bounds_policy::DONT_CHECK, 
        
                                                            cudf::detail::negative_index_policy::NOT_ALLOWED, 
        
                                                            stream, 
        
                                                            mr);

However, the steps of materializing the key_sort_order and performing the gather are completely redundant if the user has already specified that the inputs are already sorted.

I haven't full thought through the solution, but my intuition is that sort_groupby_helper should short circuit when they keys are already sorted and just return a view to the input values.

The text was updated successfully, but these errors were encountered:

jrhemstad · 2021-06-15T19:17:15Z

This impacts more than just groupby::scan. This optimization will improve any time grouped_values is called when the input keys are already sorted.

github-actions · 2021-11-15T21:03:38Z

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

revans2 · 2021-11-16T19:35:56Z

Not sure if this is still an issue or not, but I would love to have groupby::scan go faster for already sorted data. As this is a use case we commonly have in Spark.

PointKernel · 2021-11-16T19:38:10Z

Oops, I completely forgot about this issue. Will work on this shortly.

PointKernel · 2021-11-16T19:39:12Z

Still desired.

Closes #8522 This PR gets rid of redundant rearranging processes in `groupby::scan` if input values are presorted. Instead of a short circuit in `sort_helper`, it adds an early exit in the scan functor to avoid materializing `sorted_values`/`grouped_values` thus reducing memory footprint. This optimization brings a 1.6x speedup for presorted scan operations. - Baseline ``` ----------------------------------------------------------------------------------------- Benchmark Time CPU Iterations ----------------------------------------------------------------------------------------- Groupby/BasicSumScan/1000000/manual_time 0.455 ms 0.472 ms 1388 Groupby/BasicSumScan/10000000/manual_time 8.80 ms 8.81 ms 61 Groupby/BasicSumScan/100000000/manual_time 543 ms 543 ms 1 Groupby/PreSortedSumScan/1000000/manual_time 0.217 ms 0.236 ms 3319 Groupby/PreSortedSumScan/10000000/manual_time 1.45 ms 1.47 ms 479 Groupby/PreSortedSumScan/100000000/manual_time 14.0 ms 14.0 ms 47 ``` - After optimization ``` ----------------------------------------------------------------------------------------- Benchmark Time CPU Iterations ----------------------------------------------------------------------------------------- Groupby/BasicSumScan/1000000/manual_time 0.455 ms 0.472 ms 1393 Groupby/BasicSumScan/10000000/manual_time 8.81 ms 8.82 ms 60 Groupby/BasicSumScan/100000000/manual_time 546 ms 546 ms 1 Groupby/PreSortedSumScan/1000000/manual_time 0.129 ms 0.148 ms 5389 Groupby/PreSortedSumScan/10000000/manual_time 0.901 ms 0.921 ms 769 Groupby/PreSortedSumScan/100000000/manual_time 8.68 ms 8.70 ms 74 ``` Authors: - Yunsong Wang (https://github.com/PointKernel) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - Karthikeyan (https://github.com/karthikeyann) URL: #9754

jrhemstad added feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. labels Jun 15, 2021

harrism added the Performance Performance related issue label Jun 16, 2021

PointKernel self-assigned this Jun 29, 2021

github-actions bot added the inactive-90d label Nov 15, 2021

PointKernel removed the inactive-90d label Nov 16, 2021

jrhemstad added the 0 - Backlog In queue waiting for assignment label Nov 16, 2021

PointKernel mentioned this issue Nov 22, 2021

Optimize groupby::scan #9754

Merged

rapids-bot bot closed this as completed in #9754 Jan 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Optimizations for groupby::scan #8522

[FEA] Optimizations for groupby::scan #8522

jrhemstad commented Jun 15, 2021

jrhemstad commented Jun 15, 2021

github-actions bot commented Nov 15, 2021

revans2 commented Nov 16, 2021

PointKernel commented Nov 16, 2021

PointKernel commented Nov 16, 2021

[FEA] Optimizations for groupby::scan #8522

[FEA] Optimizations for groupby::scan #8522

Comments

jrhemstad commented Jun 15, 2021

jrhemstad commented Jun 15, 2021

github-actions bot commented Nov 15, 2021

revans2 commented Nov 16, 2021

PointKernel commented Nov 16, 2021

PointKernel commented Nov 16, 2021