Update sort groupby to use non-atomic operation #9035

karthikeyann · 2021-08-13T19:17:30Z

This PR replaces update_target_element with reduce_by_key in sort groupby reduce_functor. (to allow decimal128 sort groupby)

Operations updated are

SUM
PRODUCT
MIN
MAX
ARGMIN
ARGMAX

Compilation time increased from 1m18s to ~~3m28s~~ 1m27s.
~~With major compilation time taking 184s for group_argmin.cu, group_argmax.cu each. (now trying to reduce this time)~~ reduced compile time of group_argmin.cu, group_argmax.cu to 70s each.

codecov · 2021-08-13T21:05:42Z

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.10@4d8e401). Click here to learn what that means.
The diff coverage is n/a.

❗ Current head 90333da differs from pull request most recent head 4d43263. Consider uploading reports for the commit 4d43263 to get more accurate results

@@               Coverage Diff               @@
##             branch-21.10    #9035   +/-   ##
===============================================
  Coverage                ?   10.83%           
===============================================
  Files                   ?      114           
  Lines                   ?    19098           
  Branches                ?        0           
===============================================
  Hits                    ?     2070           
  Misses                  ?    17028           
  Partials                ?        0

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4d8e401...4d43263. Read the comment docs.

rgsl888prabhu

Adding test cases for dictionary column would be beneficial.

Rest looks good.

karthikeyann · 2021-08-16T12:54:17Z

rerun tests

jrhemstad · 2021-08-16T21:05:33Z

cpp/include/cudf/utilities/output_writer_iterator.cuh

+ * @tparam Iterator iterator type that acts as index of the output.
+ */
+template <typename BinaryFunction, typename Iterator>
+class output_writer_iterator


This looks like a lot of machinery and I'm not clear about what it's purpose is.

I created this first to test, if I can set element and set_valid at the same time using single reduce_by_key with thrust::optional. But using thrust::optional was always slow.
So, I reverted to use 2 reduce_by_key, 1 for element, another for null_mask.

To use reduce_by_key with null_mask, it needs a temporary bool buffer and valid_if.
To avoid this, I used this transform_output_writer_iterator.
Anyway, this is purely to avoid allocation of temporary bool buffer, but it doesn't affect performance much.
I will revert to using temporary bool buffer, and remove this iterator. (I thought that there may be other use cases for this iterator in cudf. Especially with null_mask.)

This machinery is the bare minimum to use a proxy object for assignments in thrust. The question is if the proxy object is useful. If it is, then the machinery is just overhead.

A better name, imho, would be assignment_iterator

From what I see the use case is overriding the assignment operation. In an output iterator, the assignment operator is used to take the value on the rhs and assign it to the lhs. here, we intercept that assignment operation and call a binary operator binop(lhs, rhs) that can override the assignment operation.

This PR uses the assignment_iterator with a lambda that captures a null mask. The proxy intercepts the lhs int and a rhs bool and then invokes the lambda which calls set_valid or set_null on the captured mask. It is the output version of make_validity_iterator.

I think we should be more specific here. I would like to see a make_validity_output_iterator that is very focused on dealing with null masks of a particular column. With an interface like this, we can start to experiment with opportunistic coalescing of null mask assignments using cooperative groups. Also, this API makes it very obvious that we are doing individual bit assignments and that coalescing them by changing the calling code could prove more performant.

auto make_validity_output_iterator(mutable_column_device_view const& destination);

The implementation may well use the proxy-based iterator you've created, but there would be a very clear use case for it, and other developers will have an easy to use API for that use case.

@elstehle IIRC you've used thrust a lot... have you ever run in to an instance where this sort of assignment interception would be useful?

Sorry, I don't have a concrete use case in the back of my head right now

@jrhemstad I'm still confused. Why exactly is cooperative groups not an option here? I was thinking we could try coalesced_threads(), followed by labeled_partition() and reduce(), then use thread_rank() == 0 to indicate which threads should write the output, and use atomics to make sure uncoordinated writes don't interfere with one another. I doubt this will be fast in the majority of use cases, but I am not sure why it would not work. It would be an experiment to see if we can speed up the best-case scenario, where all/most writes are sequential.

. I was under the impression cooperative groups was able to detect which threads were active

They can. Determining which threads are active is not the problem.

The problem is coordinating and detecting which threads are attempting to update the same bits in a given 4B or 8B word. For example, if any two threads in a grid want to update bits i and i+1, how do you detect that scenario without some form of communication? coalesced_threads doesn't help you here because whether or not the threads are coalesced doesn't tell you anything about what bits they are updating.

Furthermore, even if the two threads are both active at the same time in the same warp (which you could never rely on) how do you detect that t0 wants to update bit i and t1 update bit i+1 when it could just as well be t0 updates i and t1 updates i + 1042.

That's where @karthikeyann 's proxy class comes in to play. It knows the index and the value, and the lambda is responsible for using that index and value to assign the appropriate bit to a captured null mask. It appears we have enough information to attempt opportunistic concurrency here. I'm not saying it will be beneficial, only that it seems possible and might be worth an experiment.

The problem is coordinating and detecting which threads are attempting to update the same bits in a given 4B or 8B word.

the proxy class has the index, and we can pass this to labeled_partition.

For example, if any two threads in a grid want to update bits i and i+1, how do you detect that scenario without some form of communication?

For the sake of the experiment, the communication would be limited to 32 threads within a warp. Communicating across warps would require shmem, and since we're in a lambda called by thrust, I'm not sure if/how that would work.

coalesced_threads doesn't help you here because whether or not the threads are coalesced doesn't tell you anything about what bits they are updating.

coalesced_threads is just to prevent UB/hang, labeled_partition would be responsible for determining what writes can be coalesced by using idx/32 as the label (with some offset, if necessary).

jrhemstad · 2021-08-16T21:06:16Z

Did you run any benchmarks before/after this?

…omic-groupbyreduce

karthikeyann · 2021-08-17T20:21:03Z

Benchmark Comparison: (Time in ms)

Benchmark                                 Time(%)    CPU(%)    Time Old    Time New    CPU Old   CPU New(ms)
------------------------------------------------------------------------------------------------------------
Groupby/PreSorted/1000000/manual_time    -0.7131    -0.6977        0.68        0.19       0.69          0.21
Groupby/PreSorted/10000000/manual_time   -0.8705    -0.8687        6.95        0.90       6.97          0.92
Groupby/PreSorted/100000000/manual_time  -0.8864    -0.8862       70.49        8.01      70.51          8.03

jrhemstad · 2021-08-18T18:48:19Z

cpp/src/groupby/sort/group_single_pass_reduction_util.cuh

+struct null_as_sentinel {
+  column_device_view const col;
+  size_type const SENTINEL;
+  __device__ size_type operator()(size_type i) const { return col.is_null(i) ? SENTINEL : i; }
+};


Can't null_replacement_iterator be used instead?

cudf/cpp/include/cudf/detail/iterator.cuh

Line 162 in dfe0a03

auto make_null_replacement_iterator(column_device_view const& column,

Can't. null_replacement_iterator returns values of the column. Here, indices are needed.

jrhemstad · 2021-08-18T18:50:41Z

cpp/src/groupby/sort/group_single_pass_reduction_util.cuh

+ * @tparam T Type of the underlying column. For dictionary column, type of the key column.
+ */
+template <typename T>
+struct null_replaced_value_accessor : value_accessor<T> {


Same question here. Does the existing null_replacement_iterator not work?

null_replacement_iterator right now doesn't support dictionary columns. This functor does. If null_replacement_iterator is updated to support dictionary too, it will add to all kernels using it.
Can I add dictionary support to null_replacement_iterator<T> (T is underlying type, not dictionary32 for dictionary type)?
(could be another PR, column_device_view::begin<T>() could be updated too. It would provide wide support for dictionary columns in most algorithms.
This needs all benchmarks comparison too).

@davidwendt thoughts?

Can you use this?

cudf/cpp/include/cudf/dictionary/detail/iterator.cuh

Lines 110 to 112 in 8b02ca3

template <typename KeyType>

auto make_dictionary_pair_iterator(column_device_view const& dictionary_column,

bool has_nulls = true)

It would be better to use the indices for any cudf operations where possible for both run-time and compile-time performance. For example, sorting in general only needs the indices.
You can use this function

cudf/cpp/include/cudf/dictionary/dictionary_column_view.hpp

Line 73 in 8b02ca3

column_view get_indices_annotated() const noexcept;

to get the indices column_view decorated with the offset, size, and validity-mask appropriately.

hash groupby produces base type column as output.
If we use gather with ARGMIN, or ARGMAX for MIN, or MAX, it would create dictionary column. (added one more test for this, and updated sort groupby to fix this)

This sounds correct to me. Aggregates like min/max return values that already exist in the column so the output would have the same keys as the input. Whereas, sum/prod create totally new values.

Also, here is an example using the the dictionary-pair-iterator along with a null-replacement transformer.

cudf/cpp/src/reductions/simple.cuh

Lines 142 to 146 in f0fa255

auto f = simple_op.template get_null_replacing_element_transformer<ResultType>();

auto p =

cudf::dictionary::detail::make_dictionary_pair_iterator<ElementType>(*dcol, col.has_nulls());

auto it = thrust::make_transform_iterator(p, f);

return detail::reduce(it, col.size(), simple_op, stream, mr);

I'm inclined to prefer your approach here instead since it simplifies the caller to one value-accessor. The only thing that makes me nervous is that col.element<dictionary32>(i) would be included/inlined for every type and that function contains it's own type-dispatcher call in it. But technically every type is potentially a dictionary key type so I think the same amount of code is generated either way. Anyway, it may be worth looking into using this null-replacement accessor in the reductions code too.

dictionary32 means 32 bit int index right?
why is there another type dispatcher for col.element<dictionary32>(i) if index type is already known?

Dictionary index types can technically be any unsigned integer type. The element<dictionary32>(i) always returns an int32 value regardless of the underlying indices type.
https://github.com/rapidsai/cudf/blob/branch-21.10/cpp/include/cudf/column/column_device_view.cuh#L415-L421

column

jrhemstad · 2021-08-25T17:18:09Z

rerun tests

karthikeyann · 2021-08-26T15:40:51Z

rerun tests

codereport · 2021-08-26T19:26:17Z

In trying to confirm that this enables decimal128 sort-based group_by, I came across the same errors that are causing CI to fail:

/opt/rapids/cudf/cpp/tests/groupby/min_tests.cpp:241:85: error: invalid initialization of reference of type 'std::unique_ptr<cudf::groupby_aggregation>&&' from expression of type 'std::unique_ptr<cudf::aggregation>'
  241 |   test_single_agg(keys, vals, expect_keys, expect_vals_w, cudf::make_min_aggregation());

will wait for fix.

…omic-groupbyreduce

karthikeyann · 2021-08-27T03:06:19Z

@codereport The issue is fixed.

codereport · 2021-08-27T05:12:38Z

@codereport The issue is fixed.

I have half the group_by decimal128 tests working. The other two require a bit of extra work. Will confirm tomorrow when/if they are working. Thanks 🙏

codereport

Enhanced sort-based group_by for decimal128 work with this PR :)

JohnZed · 2021-08-27T23:21:13Z

@gpucibot merge

replace update_target_element with reduce_by_key in sort groupby

a293aa1

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Aug 13, 2021

karthikeyann added feature request New feature or request non-breaking Non-breaking change labels Aug 13, 2021

reduce compile time of group_argmin/max using output_writer_iterator

2c1040b

karthikeyann marked this pull request as ready for review August 16, 2021 07:54

karthikeyann requested a review from a team as a code owner August 16, 2021 07:54

karthikeyann requested review from cwharris and rgsl888prabhu August 16, 2021 07:54

rgsl888prabhu reviewed Aug 16, 2021

View reviewed changes

dictionary32 comparison for argmin, argmax

729fcaf

karthikeyann requested a review from rgsl888prabhu August 16, 2021 16:53

use transform_output_operator for argmin, argmax

156ba33

jrhemstad reviewed Aug 16, 2021

View reviewed changes

karthikeyann added the 3 - Ready for Review Ready for review by team label Aug 17, 2021

Merge branch 'branch-21.10' of github.com:rapidsai/cudf into fea-noat…

578c37d

…omic-groupbyreduce

remove transform_output_writer_iterator

aa06c15

add missing stream, mr

9e15dc2

jrhemstad reviewed Aug 18, 2021

View reviewed changes

fix null_mask allocation

e6257c9

karthikeyann requested review from jrhemstad and davidwendt August 19, 2021 16:45

rgsl888prabhu approved these changes Aug 25, 2021

View reviewed changes

update sort groupby to create base type column for dictionary type

aea6886

column

karthikeyann added 2 commits August 27, 2021 08:04

Merge branch 'branch-21.10' of github.com:rapidsai/cudf into fea-noat…

5bd4321

…omic-groupbyreduce

add groupby templated make_min_aggregation factory

4d43263

codereport added 5 - DO NOT MERGE Hold off on merging; see PR for details and removed 5 - DO NOT MERGE Hold off on merging; see PR for details labels Aug 27, 2021

codereport approved these changes Aug 27, 2021

View reviewed changes

rapids-bot bot merged commit 1cc84c5 into rapidsai:branch-21.10 Aug 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update sort groupby to use non-atomic operation #9035

Update sort groupby to use non-atomic operation #9035

karthikeyann commented Aug 13, 2021 •

edited

Loading

codecov bot commented Aug 13, 2021 •

edited

Loading

rgsl888prabhu left a comment

karthikeyann commented Aug 16, 2021

jrhemstad Aug 16, 2021

karthikeyann Aug 17, 2021

cwharris Aug 17, 2021

cwharris Aug 17, 2021

cwharris Aug 17, 2021

elstehle Aug 18, 2021

cwharris Aug 18, 2021

jrhemstad Aug 18, 2021

cwharris Aug 18, 2021 •

edited

Loading

cwharris Aug 18, 2021 •

edited

Loading

jrhemstad commented Aug 16, 2021

karthikeyann commented Aug 17, 2021

jrhemstad Aug 18, 2021

karthikeyann Aug 19, 2021 •

edited

Loading

jrhemstad Aug 18, 2021

karthikeyann Aug 19, 2021 •

edited

Loading

jrhemstad Aug 19, 2021

davidwendt Aug 19, 2021

davidwendt Aug 19, 2021 •

edited

Loading

karthikeyann Aug 25, 2021

davidwendt Aug 25, 2021

davidwendt Aug 25, 2021

karthikeyann Aug 26, 2021

davidwendt Aug 26, 2021

jrhemstad commented Aug 25, 2021

karthikeyann commented Aug 26, 2021

codereport commented Aug 26, 2021

karthikeyann commented Aug 27, 2021

codereport commented Aug 27, 2021

codereport left a comment

JohnZed commented Aug 27, 2021

	template <typename KeyType>
	auto make_dictionary_pair_iterator(column_device_view const& dictionary_column,
	bool has_nulls = true)

	auto f = simple_op.template get_null_replacing_element_transformer<ResultType>();
	auto p =
	cudf::dictionary::detail::make_dictionary_pair_iterator<ElementType>(*dcol, col.has_nulls());
	auto it = thrust::make_transform_iterator(p, f);
	return detail::reduce(it, col.size(), simple_op, stream, mr);

Update sort groupby to use non-atomic operation #9035

Update sort groupby to use non-atomic operation #9035

Conversation

karthikeyann commented Aug 13, 2021 • edited Loading

codecov bot commented Aug 13, 2021 • edited Loading

Codecov Report

rgsl888prabhu left a comment

Choose a reason for hiding this comment

karthikeyann commented Aug 16, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cwharris Aug 18, 2021 • edited Loading

Choose a reason for hiding this comment

cwharris Aug 18, 2021 • edited Loading

Choose a reason for hiding this comment

jrhemstad commented Aug 16, 2021

karthikeyann commented Aug 17, 2021

Choose a reason for hiding this comment

karthikeyann Aug 19, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

karthikeyann Aug 19, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidwendt Aug 19, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jrhemstad commented Aug 25, 2021

karthikeyann commented Aug 26, 2021

codereport commented Aug 26, 2021

karthikeyann commented Aug 27, 2021

codereport commented Aug 27, 2021

codereport left a comment

Choose a reason for hiding this comment

JohnZed commented Aug 27, 2021

karthikeyann commented Aug 13, 2021 •

edited

Loading

codecov bot commented Aug 13, 2021 •

edited

Loading

cwharris Aug 18, 2021 •

edited

Loading

cwharris Aug 18, 2021 •

edited

Loading

karthikeyann Aug 19, 2021 •

edited

Loading

karthikeyann Aug 19, 2021 •

edited

Loading

davidwendt Aug 19, 2021 •

edited

Loading