Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace dots_along_rows with rowNorm and improve coalescedReduction performance #1011

Merged
merged 11 commits into from
Nov 22, 2022

Conversation

Nyrio
Copy link
Contributor

@Nyrio Nyrio commented Nov 11, 2022

dots_along_rows in ann_utils.cuh was in some cases more performant than the corresponding raft primitive rowNorm, so I have improved that primitive in order to replace dots_along_rows without performance regressions. rowNorm for a row-major matrix calls coalescedReduction, which I have modified to conditionally select one of the following code paths based on the input dimensions:

  • Thin: for matrices with many small rows, one block processes multiple rows, with 2 to 32 threads collaborating on each row using a shuffle-based reduction.
  • Medium: the existing cub-based implementation with one block per row (I have only changed the reduction algorithm to raking which is more performant provided that the workload is big enough)
  • Thick: two-step implementation. In the first step, multiple blocks per row reducing to an intermediate buffer (main_op is applied but not final_op). In the second step, reduces the intermediate buffer using the thin kernel (this time final_op is applied but not main_op).

Other changes included in this PR:

  • In order to properly support shuffle-based reductions, I have added generic shuffle helpers that support arbitrary types by cutting them into chunks (based on size/alignment). This was adapted from similar helpers in CUB.
  • I have added a helper for "logical" warp reduction, i.e sub-warps of 2, 4, 8, 16 or 32 threads, and added support for arbitrary reduction operations in the warp reduction.
  • I have consolidated tests with support for arbitrary types and operations and tested some operations that in particular use the index argument of main_op such as an argmax, and only for the coalesced reduction I have added test cases with raft::KeyValuePair

@Nyrio
Copy link
Contributor Author

Nyrio commented Nov 14, 2022

Note to reviewers: I am aware that the reduction currently doesn't compile with non-trivial types such as cub pairs due to the shuffle-based reductions. Working on a fix.

@Nyrio
Copy link
Contributor Author

Nyrio commented Nov 15, 2022

I have fixed support for non-trivial types, please have a detailed look at the last commit and in particular changes to cuda_utils.cuh.

@tfeher
Copy link
Contributor

tfeher commented Nov 16, 2022

After these changes, is the following comment still valid?

* current implementation is optimized only for bigger values of 'D'.

@Nyrio
Copy link
Contributor Author

Nyrio commented Nov 16, 2022

After these changes, is the following comment still valid?

Removed.

Copy link
Contributor

@tfeher tfeher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Louis, it is nice to see further improvements in our prims. I see that the bulk of the changes are the updates in the tests cases, thanks for the thorough work!

I have just few smaller comments for the code.

Please update the PR description:

  • mention adding general shuffle and reduction op
  • move detailed description about performance of different kernels into a separate comment.

If you have any measurements/notes on why is this approach better than cub segmented reduction, then please add a comment.

cpp/include/raft/util/cuda_utils.cuh Show resolved Hide resolved
cpp/include/raft/util/cuda_utils.cuh Outdated Show resolved Hide resolved
@Nyrio
Copy link
Contributor Author

Nyrio commented Nov 16, 2022

@tfeher cub::DeviceSegmentedReduce is a more generic primitive and my expectation was that it would not perform better, but I should run some benchmarks against it to make sure of that. Segmented reduce can work with segments of arbitrary lengths, and reads the start and end offsets of segments from arrays. I haven't read the implementation but my guess is that it does a BlockReduce per segment, in which case we have no reason to pay the price of creating and reading these offsets.

@Nyrio
Copy link
Contributor Author

Nyrio commented Nov 16, 2022

Some notes on the performance of the thick vs medium kernel:

  • For the thick implementation, I considered using atomics but for a generic reduction, it requires a pre-step to initialize the output, allocate and initialize mutexes, and a post-step for the final op. That is altogether much costlier than the two-step approach I ended up using.
  • The prim is heavily memory-bound, so only one block per SM is enough to reach near-SOL global memory bandwidth, meaning the medium kernel will perform better if the number of rows is near or greater than the number of SMs, i.e anything more than a few dozens of rows.
  • If the number of rows is small and the number of columns is up to a few thousand, we should also prefer the one-kernel approach because the bottleneck is the launch latencies.

Visual demonstration of the performance of the medium vs thick implementations (y-axis is time in ms, lower is better):

2022-11-10_thick_comp_f_i32_pool

Copy link
Contributor

@tfeher tfeher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Louis for the update LGTM!

@cjnolet
Copy link
Member

cjnolet commented Nov 17, 2022

rerun tests

@cjnolet
Copy link
Member

cjnolet commented Nov 17, 2022

@Nyrio i suspect maybe the CI checks aren’t being executed because of the conflicts in your branch.

@Nyrio
Copy link
Contributor Author

Nyrio commented Nov 17, 2022

@cjnolet I was waiting for my local compilation and test run to succeed before pushing, but as you can expect, compiling the neighbors test took a few hours.

@cjnolet
Copy link
Member

cjnolet commented Nov 17, 2022

Wait, a few hours?!?! What type of environment / configuration are you using? How many cores are you using to compile?

@cjnolet
Copy link
Member

cjnolet commented Nov 17, 2022

@gpucibot merge

1 similar comment
@cjnolet
Copy link
Member

cjnolet commented Nov 18, 2022

@gpucibot merge

@cjnolet
Copy link
Member

cjnolet commented Nov 18, 2022

rerun tests

@Nyrio
Copy link
Contributor Author

Nyrio commented Nov 21, 2022

@cjnolet It looks like the CI errors are unrelated to the contents of this PR.

@cjnolet
Copy link
Member

cjnolet commented Nov 21, 2022

rerun tests

@cjnolet
Copy link
Member

cjnolet commented Nov 21, 2022

@Nyrio yep you are right about that. @ajschmidt8 has fixed the issue so we should be able to get this in today, assuming it passes.

@cjnolet
Copy link
Member

cjnolet commented Nov 21, 2022

rerun tests

2 similar comments
@cjnolet
Copy link
Member

cjnolet commented Nov 21, 2022

rerun tests

@cjnolet
Copy link
Member

cjnolet commented Nov 22, 2022

rerun tests

@rapids-bot rapids-bot bot merged commit a6961dc into rapidsai:branch-22.12 Nov 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review CMake cpp improvement Improvement / enhancement to an existing function non-breaking Non-breaking change
Projects
Development

Successfully merging this pull request may close these issues.

3 participants