Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add dispatch based on compute architecture #1295
Add dispatch based on compute architecture #1295
Changes from 4 commits
749d000
7262861
1ef8520
09a3050
f8daf48
1a6636f
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be static so we don't run into the issue where multiple consumers of raft build with different
arch
values and we get incorrect kernel selection.For more info see: NVIDIA/cub#545
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point. It looks like the dummy kernel approach requires making the kernel static to get a reliable solution, at the cost of littering the final binary with many empty kernels.
In kernel_runtime_arch, we are currently taking a pointer to the
dummy_runtime_kernel
. If instead, we took a runtime argument that was a pointer to one of the candidate kernels that is going to be called, would that solve the problem? That is, I would remove thedummy_runtime_kernel
and the kernel pointer would have to be provided by the user. I think it does solve the linking problem that you described above and it doesn't create spurious kernels, but I want to double check before I change the code.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requiring a kernel pointer would work as well since we would now be querying based a specific kernel that was only compiled once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot! I will go for that direction then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little late to the party, but I came up with an idea for an alternative way of doing this that I like better because it avoids the empty kernel. See https://github.com/NVIDIA/cub/issues/556
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the pointer! I've been meaning to respond to this for a while, but never found the time to test my assertions.
We are currently (that is: in the PR that was merged) avoiding the empty kernel by forcing the caller to provide a pointer to one of the kernel versions. We then query the func attributes of that kernel.
The
__CUDA_ARCH_LIST__
looks like a worthwile approach. However, it may break when kernels are weakly linked (e.g. templated). You describe the issue very well in #1722. I had not considered outlawing weak linking completely.. Let's see how that goes!