-
Notifications
You must be signed in to change notification settings - Fork 197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add dispatch based on compute architecture #1295
Closed
ahendriksen
wants to merge
6
commits into
rapidsai:pull-request/1142
from
ahendriksen:enh-arch-dispatch
Closed
Changes from all commits
Commits
Show all changes
6 commits
Select commit
Hold shift + click to select a range
749d000
Add dispatch based on compute architecture
ahendriksen 7262861
Fix style
ahendriksen 1ef8520
Merge remote-tracking branch 'rapids/pull-request/1142' into enh-arch…
ahendriksen 09a3050
Fix linker error: multiple definition..
ahendriksen f8daf48
Merge remote-tracking branch 'rapids/pull-request/1142' into enh-arch…
ahendriksen 1a6636f
Implement review feedback
ahendriksen File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,141 @@ | ||
/* | ||
* Copyright (c) 2023, NVIDIA CORPORATION. | ||
* | ||
* Licensed under the Apache License, Version 2.0 (the "License"); | ||
* you may not use this file except in compliance with the License. | ||
* You may obtain a copy of the License at | ||
* | ||
* http://www.apache.org/licenses/LICENSE-2.0 | ||
* | ||
* Unless required by applicable law or agreed to in writing, software | ||
* distributed under the License is distributed on an "AS IS" BASIS, | ||
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
* See the License for the specific language governing permissions and | ||
* limitations under the License. | ||
*/ | ||
#pragma once | ||
|
||
namespace raft::arch { | ||
|
||
/* raft::arch provides the following facilities: | ||
* | ||
* - raft::arch::SM_XX : hardcoded compile-time constants for various compute | ||
* architectures. The values raft::arch::SM_min and raft::arch::SM_future | ||
* represent architectures that are always smaller and larger (respectively) | ||
* than any architecture that can be encountered in practice. | ||
* | ||
* - raft::arch::SM_compute_arch : a compile-time value for the *current* | ||
* compute architecture that a kernel is compiled with. It can only be used | ||
* inside kernels with a template argument. | ||
* | ||
* - raft::arch::kernel_runtime_arch : a function that computes at *run-time* | ||
* which version of a kernel will launch (i.e., it will return the compute | ||
* architecture of the version of the kernel that will be launched by the | ||
* driver). | ||
* | ||
* - raft::arch::SM_range : a compile-time value to represent an open interval | ||
* of compute architectures. This can be used to check if the current | ||
* compile-time architecture is in a specified compatibility range. | ||
*/ | ||
|
||
// detail::SM_generic is a template to create a generic compile-time SM | ||
// architecture constant. | ||
namespace detail { | ||
template <int n> | ||
struct SM_generic { | ||
public: | ||
__host__ __device__ constexpr int value() const { return n; } | ||
}; | ||
|
||
// A dummy kernel that is used to determine the runtime architecture. | ||
__global__ inline void dummy_runtime_kernel() {} | ||
} // namespace detail | ||
|
||
// A list of architectures that RAPIDS explicitly builds for (SM60, ..., SM90) | ||
// and SM_MIN and SM_FUTURE, that allow specifying an open interval of | ||
// compatible compute architectures. | ||
using SM_min = detail::SM_generic<350>; | ||
using SM_60 = detail::SM_generic<600>; | ||
using SM_70 = detail::SM_generic<700>; | ||
using SM_75 = detail::SM_generic<750>; | ||
using SM_80 = detail::SM_generic<800>; | ||
using SM_86 = detail::SM_generic<860>; | ||
using SM_90 = detail::SM_generic<900>; | ||
using SM_future = detail::SM_generic<99999>; | ||
|
||
// This is a type that uses the __CUDA_ARCH__ macro to obtain the compile-time | ||
// compute architecture. It can only be used where __CUDA_ARCH__ is defined, | ||
// i.e., inside a __global__ function template. | ||
struct SM_compute_arch { | ||
template <int dummy = 0> | ||
__device__ constexpr int value() const | ||
{ | ||
#ifdef __CUDA_ARCH__ | ||
return __CUDA_ARCH__; | ||
#else | ||
// This function should not be called in host code (because __CUDA_ARCH__ is | ||
// not defined). This function is constexpr and thus can be called in host | ||
// code (due to the --expt-relaxed-constexpr compile flag). We would like to | ||
// provide an intelligible error message when this function is called in | ||
// host code, which we do below. | ||
// | ||
// To make sure the static_assert only fires in host code, we use a dummy | ||
// template parameter as described in P2593: | ||
// https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2022/p2593r0.html | ||
static_assert(dummy != 0, | ||
"SM_compute_arch.value() is only callable from a __global__ function template. " | ||
"A way to create a function template is by adding 'template <int dummy = 0>'."); | ||
return -1; | ||
#endif | ||
} | ||
}; | ||
|
||
// A runtime value for the actual compute architecture of a kernel. | ||
// | ||
// A single kernel can be compiled for several "virtual" compute architectures. | ||
// When a program runs, the driver picks the version of the kernel that most | ||
// closely matches the current hardware. This struct reflects the virtual | ||
// compute architecture of the version of the kernel that the driver picks when | ||
// the kernel runs. | ||
struct SM_runtime { | ||
friend SM_runtime kernel_runtime_arch(); | ||
|
||
private: | ||
const int _version; | ||
SM_runtime(int version) : _version(version) {} | ||
|
||
public: | ||
__host__ __device__ int value() const { return _version; } | ||
}; | ||
|
||
// Computes which compute architecture of a kernel will run | ||
// | ||
// Semantics are described above in the documentation of SM_runtime. | ||
inline SM_runtime kernel_runtime_arch() | ||
{ | ||
auto kernel = detail::dummy_runtime_kernel; | ||
cudaFuncAttributes attributes; | ||
cudaFuncGetAttributes(&attributes, kernel); | ||
|
||
return SM_runtime(10 * attributes.ptxVersion); | ||
} | ||
|
||
// SM_range represents a range of SM architectures. It can be used to | ||
// conditionally compile a kernel. | ||
template <typename SM_MIN, typename SM_MAX> | ||
struct SM_range { | ||
private: | ||
const SM_MIN _min; | ||
const SM_MAX _max; | ||
|
||
public: | ||
__host__ __device__ constexpr SM_range(SM_MIN min, SM_MAX max) : _min(min), _max(max) {} | ||
|
||
template <typename SM_t> | ||
__host__ __device__ constexpr bool contains(SM_t current) const | ||
{ | ||
return _min.value() <= current.value() && current.value() < _max.value(); | ||
} | ||
}; | ||
|
||
} // namespace raft::arch |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be static so we don't run into the issue where multiple consumers of raft build with different
arch
values and we get incorrect kernel selection.For more info see: NVIDIA/cub#545
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point. It looks like the dummy kernel approach requires making the kernel static to get a reliable solution, at the cost of littering the final binary with many empty kernels.
In kernel_runtime_arch, we are currently taking a pointer to the
dummy_runtime_kernel
. If instead, we took a runtime argument that was a pointer to one of the candidate kernels that is going to be called, would that solve the problem? That is, I would remove thedummy_runtime_kernel
and the kernel pointer would have to be provided by the user. I think it does solve the linking problem that you described above and it doesn't create spurious kernels, but I want to double check before I change the code.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requiring a kernel pointer would work as well since we would now be querying based a specific kernel that was only compiled once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot! I will go for that direction then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little late to the party, but I came up with an idea for an alternative way of doing this that I like better because it avoids the empty kernel. See https://github.com/NVIDIA/cub/issues/556
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the pointer! I've been meaning to respond to this for a while, but never found the time to test my assertions.
We are currently (that is: in the PR that was merged) avoiding the empty kernel by forcing the caller to provide a pointer to one of the kernel versions. We then query the func attributes of that kernel.
The
__CUDA_ARCH_LIST__
looks like a worthwile approach. However, it may break when kernels are weakly linked (e.g. templated). You describe the issue very well in #1722. I had not considered outlawing weak linking completely.. Let's see how that goes!