Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update sort groupby to use non-atomic operation #9035

Merged

Conversation

karthikeyann
Copy link
Contributor

@karthikeyann karthikeyann commented Aug 13, 2021

This PR replaces update_target_element with reduce_by_key in sort groupby reduce_functor. (to allow decimal128 sort groupby)

Operations updated are

  • SUM
  • PRODUCT
  • MIN
  • MAX
  • ARGMIN
  • ARGMAX

Compilation time increased from 1m18s to 3m28s 1m27s.
With major compilation time taking 184s for group_argmin.cu, group_argmax.cu each. (now trying to reduce this time) reduced compile time of group_argmin.cu, group_argmax.cu to 70s each.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Aug 13, 2021
@karthikeyann karthikeyann added feature request New feature or request non-breaking Non-breaking change labels Aug 13, 2021
@codecov
Copy link

codecov bot commented Aug 13, 2021

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.10@4d8e401). Click here to learn what that means.
The diff coverage is n/a.

❗ Current head 90333da differs from pull request most recent head 4d43263. Consider uploading reports for the commit 4d43263 to get more accurate results
Impacted file tree graph

@@               Coverage Diff               @@
##             branch-21.10    #9035   +/-   ##
===============================================
  Coverage                ?   10.83%           
===============================================
  Files                   ?      114           
  Lines                   ?    19098           
  Branches                ?        0           
===============================================
  Hits                    ?     2070           
  Misses                  ?    17028           
  Partials                ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4d8e401...4d43263. Read the comment docs.

@karthikeyann karthikeyann marked this pull request as ready for review August 16, 2021 07:54
@karthikeyann karthikeyann requested a review from a team as a code owner August 16, 2021 07:54
Copy link
Contributor

@rgsl888prabhu rgsl888prabhu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding test cases for dictionary column would be beneficial.

Rest looks good.

@karthikeyann
Copy link
Contributor Author

rerun tests

* @tparam Iterator iterator type that acts as index of the output.
*/
template <typename BinaryFunction, typename Iterator>
class output_writer_iterator
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a lot of machinery and I'm not clear about what it's purpose is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created this first to test, if I can set element and set_valid at the same time using single reduce_by_key with thrust::optional. But using thrust::optional was always slow.
So, I reverted to use 2 reduce_by_key, 1 for element, another for null_mask.

To use reduce_by_key with null_mask, it needs a temporary bool buffer and valid_if.
To avoid this, I used this transform_output_writer_iterator.
Anyway, this is purely to avoid allocation of temporary bool buffer, but it doesn't affect performance much.
I will revert to using temporary bool buffer, and remove this iterator. (I thought that there may be other use cases for this iterator in cudf. Especially with null_mask.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This machinery is the bare minimum to use a proxy object for assignments in thrust. The question is if the proxy object is useful. If it is, then the machinery is just overhead.

A better name, imho, would be assignment_iterator

From what I see the use case is overriding the assignment operation. In an output iterator, the assignment operator is used to take the value on the rhs and assign it to the lhs. here, we intercept that assignment operation and call a binary operator binop(lhs, rhs) that can override the assignment operation.

This PR uses the assignment_iterator with a lambda that captures a null mask. The proxy intercepts the lhs int and a rhs bool and then invokes the lambda which calls set_valid or set_null on the captured mask. It is the output version of make_validity_iterator.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should be more specific here. I would like to see a make_validity_output_iterator that is very focused on dealing with null masks of a particular column. With an interface like this, we can start to experiment with opportunistic coalescing of null mask assignments using cooperative groups. Also, this API makes it very obvious that we are doing individual bit assignments and that coalescing them by changing the calling code could prove more performant.

auto make_validity_output_iterator(mutable_column_device_view const& destination);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation may well use the proxy-based iterator you've created, but there would be a very clear use case for it, and other developers will have an easy to use API for that use case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@elstehle IIRC you've used thrust a lot... have you ever run in to an instance where this sort of assignment interception would be useful?

Sorry, I don't have a concrete use case in the back of my head right now

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jrhemstad I'm still confused. Why exactly is cooperative groups not an option here? I was thinking we could try coalesced_threads(), followed by labeled_partition() and reduce(), then use thread_rank() == 0 to indicate which threads should write the output, and use atomics to make sure uncoordinated writes don't interfere with one another. I doubt this will be fast in the majority of use cases, but I am not sure why it would not work. It would be an experiment to see if we can speed up the best-case scenario, where all/most writes are sequential.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

. I was under the impression cooperative groups was able to detect which threads were active

They can. Determining which threads are active is not the problem.

The problem is coordinating and detecting which threads are attempting to update the same bits in a given 4B or 8B word. For example, if any two threads in a grid want to update bits i and i+1, how do you detect that scenario without some form of communication? coalesced_threads doesn't help you here because whether or not the threads are coalesced doesn't tell you anything about what bits they are updating.

Furthermore, even if the two threads are both active at the same time in the same warp (which you could never rely on) how do you detect that t0 wants to update bit i and t1 update bit i+1 when it could just as well be t0 updates i and t1 updates i + 1042.

Copy link
Contributor

@cwharris cwharris Aug 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's where @karthikeyann 's proxy class comes in to play. It knows the index and the value, and the lambda is responsible for using that index and value to assign the appropriate bit to a captured null mask. It appears we have enough information to attempt opportunistic concurrency here. I'm not saying it will be beneficial, only that it seems possible and might be worth an experiment.

Copy link
Contributor

@cwharris cwharris Aug 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is coordinating and detecting which threads are attempting to update the same bits in a given 4B or 8B word.

the proxy class has the index, and we can pass this to labeled_partition.

For example, if any two threads in a grid want to update bits i and i+1, how do you detect that scenario without some form of communication?

For the sake of the experiment, the communication would be limited to 32 threads within a warp. Communicating across warps would require shmem, and since we're in a lambda called by thrust, I'm not sure if/how that would work.

coalesced_threads doesn't help you here because whether or not the threads are coalesced doesn't tell you anything about what bits they are updating.

coalesced_threads is just to prevent UB/hang, labeled_partition would be responsible for determining what writes can be coalesced by using idx/32 as the label (with some offset, if necessary).

@jrhemstad
Copy link
Contributor

Did you run any benchmarks before/after this?

@karthikeyann karthikeyann added the 3 - Ready for Review Ready for review by team label Aug 17, 2021
@karthikeyann
Copy link
Contributor Author

Benchmark Comparison: (Time in ms)

Benchmark                                 Time(%)    CPU(%)    Time Old    Time New    CPU Old   CPU New(ms)
------------------------------------------------------------------------------------------------------------
Groupby/PreSorted/1000000/manual_time    -0.7131    -0.6977        0.68        0.19       0.69          0.21
Groupby/PreSorted/10000000/manual_time   -0.8705    -0.8687        6.95        0.90       6.97          0.92
Groupby/PreSorted/100000000/manual_time  -0.8864    -0.8862       70.49        8.01      70.51          8.03

Comment on lines +99 to +103
struct null_as_sentinel {
column_device_view const col;
size_type const SENTINEL;
__device__ size_type operator()(size_type i) const { return col.is_null(i) ? SENTINEL : i; }
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't null_replacement_iterator be used instead?

auto make_null_replacement_iterator(column_device_view const& column,

Copy link
Contributor Author

@karthikeyann karthikeyann Aug 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't. null_replacement_iterator returns values of the column. Here, indices are needed.

* @tparam T Type of the underlying column. For dictionary column, type of the key column.
*/
template <typename T>
struct null_replaced_value_accessor : value_accessor<T> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question here. Does the existing null_replacement_iterator not work?

Copy link
Contributor Author

@karthikeyann karthikeyann Aug 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

null_replacement_iterator right now doesn't support dictionary columns. This functor does. If null_replacement_iterator is updated to support dictionary too, it will add to all kernels using it.
Can I add dictionary support to null_replacement_iterator<T> (T is underlying type, not dictionary32 for dictionary type)?
(could be another PR, column_device_view::begin<T>() could be updated too. It would provide wide support for dictionary columns in most algorithms.
This needs all benchmarks comparison too).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davidwendt thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use this?

template <typename KeyType>
auto make_dictionary_pair_iterator(column_device_view const& dictionary_column,
bool has_nulls = true)

Copy link
Contributor

@davidwendt davidwendt Aug 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to use the indices for any cudf operations where possible for both run-time and compile-time performance. For example, sorting in general only needs the indices.
You can use this function

column_view get_indices_annotated() const noexcept;

to get the indices column_view decorated with the offset, size, and validity-mask appropriately.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hash groupby produces base type column as output.
If we use gather with ARGMIN, or ARGMAX for MIN, or MAX, it would create dictionary column. (added one more test for this, and updated sort groupby to fix this)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds correct to me. Aggregates like min/max return values that already exist in the column so the output would have the same keys as the input. Whereas, sum/prod create totally new values.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, here is an example using the the dictionary-pair-iterator along with a null-replacement transformer.

auto f = simple_op.template get_null_replacing_element_transformer<ResultType>();
auto p =
cudf::dictionary::detail::make_dictionary_pair_iterator<ElementType>(*dcol, col.has_nulls());
auto it = thrust::make_transform_iterator(p, f);
return detail::reduce(it, col.size(), simple_op, stream, mr);

I'm inclined to prefer your approach here instead since it simplifies the caller to one value-accessor. The only thing that makes me nervous is that col.element<dictionary32>(i) would be included/inlined for every type and that function contains it's own type-dispatcher call in it. But technically every type is potentially a dictionary key type so I think the same amount of code is generated either way. Anyway, it may be worth looking into using this null-replacement accessor in the reductions code too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dictionary32 means 32 bit int index right?
why is there another type dispatcher for col.element<dictionary32>(i) if index type is already known?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dictionary index types can technically be any unsigned integer type. The element<dictionary32>(i) always returns an int32 value regardless of the underlying indices type.
https://github.com/rapidsai/cudf/blob/branch-21.10/cpp/include/cudf/column/column_device_view.cuh#L415-L421

@jrhemstad
Copy link
Contributor

rerun tests

1 similar comment
@karthikeyann
Copy link
Contributor Author

rerun tests

@codereport
Copy link
Contributor

In trying to confirm that this enables decimal128 sort-based group_by, I came across the same errors that are causing CI to fail:

/opt/rapids/cudf/cpp/tests/groupby/min_tests.cpp:241:85: error: invalid initialization of reference of type 'std::unique_ptr<cudf::groupby_aggregation>&&' from expression of type 'std::unique_ptr<cudf::aggregation>'
  241 |   test_single_agg(keys, vals, expect_keys, expect_vals_w, cudf::make_min_aggregation());

will wait for fix.

@karthikeyann
Copy link
Contributor Author

@codereport The issue is fixed.

@codereport
Copy link
Contributor

@codereport The issue is fixed.

I have half the group_by decimal128 tests working. The other two require a bit of extra work. Will confirm tomorrow when/if they are working. Thanks 🙏

@codereport codereport added 5 - DO NOT MERGE Hold off on merging; see PR for details and removed 5 - DO NOT MERGE Hold off on merging; see PR for details labels Aug 27, 2021
Copy link
Contributor

@codereport codereport left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enhanced sort-based group_by for decimal128 work with this PR :)

@JohnZed
Copy link
Contributor

JohnZed commented Aug 27, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 1cc84c5 into rapidsai:branch-21.10 Aug 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants