Replace `dots_along_rows` with `rowNorm` and improve `coalescedReduction` performance #1011

Nyrio · 2022-11-11T15:48:40Z

dots_along_rows in ann_utils.cuh was in some cases more performant than the corresponding raft primitive rowNorm, so I have improved that primitive in order to replace dots_along_rows without performance regressions. rowNorm for a row-major matrix calls coalescedReduction, which I have modified to conditionally select one of the following code paths based on the input dimensions:

Thin: for matrices with many small rows, one block processes multiple rows, with 2 to 32 threads collaborating on each row using a shuffle-based reduction.
Medium: the existing cub-based implementation with one block per row (I have only changed the reduction algorithm to raking which is more performant provided that the workload is big enough)
Thick: two-step implementation. In the first step, multiple blocks per row reducing to an intermediate buffer (main_op is applied but not final_op). In the second step, reduces the intermediate buffer using the thin kernel (this time final_op is applied but not main_op).

Other changes included in this PR:

In order to properly support shuffle-based reductions, I have added generic shuffle helpers that support arbitrary types by cutting them into chunks (based on size/alignment). This was adapted from similar helpers in CUB.
I have added a helper for "logical" warp reduction, i.e sub-warps of 2, 4, 8, 16 or 32 threads, and added support for arbitrary reduction operations in the warp reduction.
I have consolidated tests with support for arbitrary types and operations and tested some operations that in particular use the index argument of main_op such as an argmax, and only for the coalesced reduction I have added test cases with raft::KeyValuePair

…-reduction

Nyrio · 2022-11-14T13:18:53Z

Note to reviewers: I am aware that the reduction currently doesn't compile with non-trivial types such as cub pairs due to the shuffle-based reductions. Working on a fix.

…esting

Nyrio · 2022-11-15T11:37:53Z

I have fixed support for non-trivial types, please have a detailed look at the last commit and in particular changes to cuda_utils.cuh.

tfeher · 2022-11-16T11:41:32Z

After these changes, is the following comment still valid?

raft/cpp/include/raft/linalg/norm.cuh

Line 41 in 355f693

* current implementation is optimized only for bigger values of 'D'.

Nyrio · 2022-11-16T11:46:10Z

After these changes, is the following comment still valid?

Removed.

tfeher

Hi Louis, it is nice to see further improvements in our prims. I see that the bulk of the changes are the updates in the tests cases, thanks for the thorough work!

I have just few smaller comments for the code.

Please update the PR description:

mention adding general shuffle and reduction op
move detailed description about performance of different kernels into a separate comment.

If you have any measurements/notes on why is this approach better than cub segmented reduction, then please add a comment.

cpp/include/raft/linalg/detail/coalesced_reduction.cuh

cpp/include/raft/util/cuda_utils.cuh

Nyrio · 2022-11-16T14:26:59Z

@tfeher cub::DeviceSegmentedReduce is a more generic primitive and my expectation was that it would not perform better, but I should run some benchmarks against it to make sure of that. Segmented reduce can work with segments of arbitrary lengths, and reads the start and end offsets of segments from arrays. I haven't read the implementation but my guess is that it does a BlockReduce per segment, in which case we have no reason to pay the price of creating and reading these offsets.

Nyrio · 2022-11-16T14:37:39Z

Some notes on the performance of the thick vs medium kernel:

For the thick implementation, I considered using atomics but for a generic reduction, it requires a pre-step to initialize the output, allocate and initialize mutexes, and a post-step for the final op. That is altogether much costlier than the two-step approach I ended up using.
The prim is heavily memory-bound, so only one block per SM is enough to reach near-SOL global memory bandwidth, meaning the medium kernel will perform better if the number of rows is near or greater than the number of SMs, i.e anything more than a few dozens of rows.
If the number of rows is small and the number of columns is up to a few thousand, we should also prefer the one-kernel approach because the bottleneck is the launch latencies.

Visual demonstration of the performance of the medium vs thick implementations (y-axis is time in ms, lower is better):

tfeher

Thanks Louis for the update LGTM!

…-reduction

cjnolet · 2022-11-17T15:14:31Z

rerun tests

cjnolet · 2022-11-17T16:30:07Z

@Nyrio i suspect maybe the CI checks aren’t being executed because of the conflicts in your branch.

Nyrio · 2022-11-17T16:35:34Z

@cjnolet I was waiting for my local compilation and test run to succeed before pushing, but as you can expect, compiling the neighbors test took a few hours.

cjnolet · 2022-11-17T16:38:42Z

Wait, a few hours?!?! What type of environment / configuration are you using? How many cores are you using to compile?

cjnolet · 2022-11-17T17:48:44Z

@gpucibot merge

cjnolet · 2022-11-18T01:35:44Z

@gpucibot merge

…-reduction

cjnolet · 2022-11-18T20:50:47Z

rerun tests

Nyrio · 2022-11-21T09:48:10Z

@cjnolet It looks like the CI errors are unrelated to the contents of this PR.

cjnolet · 2022-11-21T16:18:38Z

rerun tests

cjnolet · 2022-11-21T16:49:16Z

@Nyrio yep you are right about that. @ajschmidt8 has fixed the issue so we should be able to get this in today, assuming it passes.

cjnolet · 2022-11-21T22:17:20Z

rerun tests

cjnolet · 2022-11-21T22:25:50Z

rerun tests

cjnolet · 2022-11-22T13:33:16Z

rerun tests

Nyrio added 3 commits November 11, 2022 12:27

More efficient coalesced reduction for thin and thick matrices

8c9e946

Merge remote-tracking branch 'origin/branch-22.12' into enh-coalesced…

50f1d9e

…-reduction

Replace dots_along_rows with rowNorm

f3f3454

Nyrio requested review from a team as code owners November 11, 2022 15:48

github-actions bot added CMake cpp labels Nov 11, 2022

Nyrio added 3 - Ready for Review improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Nov 11, 2022

Nyrio mentioned this pull request Nov 11, 2022

Replace normalize_rows in ann_utils.cuh by a new rowNormalize prim and improve performance for thin matrices (small n_cols) #979

Merged

Fix main op template arg list with proper index type

2604687

Fix support for non-trivial types in coalescedReduction and improve t…

3253a75

…esting

achirkin mentioned this pull request Nov 16, 2022

[ENH] IVF-* ANN post-integration TODOs #711

Open

11 tasks

doxygen fix

4662db3

Remove outdated comment

8f6b847

tfeher requested changes Nov 16, 2022

View reviewed changes

cpp/include/raft/linalg/detail/coalesced_reduction.cuh Show resolved Hide resolved

cpp/include/raft/util/cuda_utils.cuh Show resolved Hide resolved

cpp/include/raft/util/cuda_utils.cuh Outdated Show resolved Hide resolved

Nyrio added 2 commits November 16, 2022 15:56

Generic warpReduce + documentation improvements

a02ad67

Add doxygen for shfl overloads

5acb563

tfeher approved these changes Nov 17, 2022

View reviewed changes

Merge remote-tracking branch 'origin/branch-22.12' into enh-coalesced…

a62af1d

…-reduction

Merge remote-tracking branch 'origin/branch-22.12' into enh-coalesced…

ad80991

…-reduction

rapids-bot bot merged commit a6961dc into rapidsai:branch-22.12 Nov 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace `dots_along_rows` with `rowNorm` and improve `coalescedReduction` performance #1011

Replace `dots_along_rows` with `rowNorm` and improve `coalescedReduction` performance #1011

Nyrio commented Nov 11, 2022 •

edited

Loading

Nyrio commented Nov 14, 2022 •

edited

Loading

Nyrio commented Nov 15, 2022

tfeher commented Nov 16, 2022

Nyrio commented Nov 16, 2022

tfeher left a comment

Nyrio commented Nov 16, 2022 •

edited

Loading

Nyrio commented Nov 16, 2022 •

edited

Loading

tfeher left a comment

cjnolet commented Nov 17, 2022

cjnolet commented Nov 17, 2022

Nyrio commented Nov 17, 2022

cjnolet commented Nov 17, 2022

cjnolet commented Nov 17, 2022

cjnolet commented Nov 18, 2022

cjnolet commented Nov 18, 2022

Nyrio commented Nov 21, 2022

cjnolet commented Nov 21, 2022

cjnolet commented Nov 21, 2022

cjnolet commented Nov 21, 2022

cjnolet commented Nov 21, 2022

cjnolet commented Nov 22, 2022

Replace dots_along_rows with rowNorm and improve coalescedReduction performance #1011

Replace dots_along_rows with rowNorm and improve coalescedReduction performance #1011

Conversation

Nyrio commented Nov 11, 2022 • edited Loading

Nyrio commented Nov 14, 2022 • edited Loading

Nyrio commented Nov 15, 2022

tfeher commented Nov 16, 2022

Nyrio commented Nov 16, 2022

tfeher left a comment

Choose a reason for hiding this comment

Nyrio commented Nov 16, 2022 • edited Loading

Nyrio commented Nov 16, 2022 • edited Loading

tfeher left a comment

Choose a reason for hiding this comment

cjnolet commented Nov 17, 2022

cjnolet commented Nov 17, 2022

Nyrio commented Nov 17, 2022

cjnolet commented Nov 17, 2022

cjnolet commented Nov 17, 2022

cjnolet commented Nov 18, 2022

cjnolet commented Nov 18, 2022

Nyrio commented Nov 21, 2022

cjnolet commented Nov 21, 2022

cjnolet commented Nov 21, 2022

cjnolet commented Nov 21, 2022

cjnolet commented Nov 21, 2022

cjnolet commented Nov 22, 2022

Replace `dots_along_rows` with `rowNorm` and improve `coalescedReduction` performance #1011

Replace `dots_along_rows` with `rowNorm` and improve `coalescedReduction` performance #1011

Nyrio commented Nov 11, 2022 •

edited

Loading

Nyrio commented Nov 14, 2022 •

edited

Loading

Nyrio commented Nov 16, 2022 •

edited

Loading

Nyrio commented Nov 16, 2022 •

edited

Loading