You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As discussed in NVIDIA/cub#545, CUB needs to query the current device's compute capability in order to know which tuning policy to use for launching the kernel.
As discussed in NVIDIA/cub#545, this runs into problems due to the nuanced relationship among the linkage of kernels, their enclosing function, and the architectures used to compile the TU. The end result is that we can end up getting a version of EmptyKernel with a different PTX version than we expect.
Proposal
The goal of the machinery described above is to determine which PTX version for a given kernel will be used when it is invoked.
However, there is another way for CUB to do this.
We could instead use cudaGetDeviceProperties. The resulting cudaDeviceProp structure has cudaDeviceProp::major and cudaDeviceProp::minor members that indicate the major/minor versions of the compute capability for the current device.
In addition, we would have to somewhere cache internal to CUB the list of architectures used to compile a particular TU (__CUDA_ARCH_LIST__) so we can select the closest arch to the compute capability of the current device.
Additional Context
It is generally recommended to avoid using cudaGetDeviceProperties and to instead use cudaDeviceGetAttribute to query the specific attribute of interest as cudaGetDevicePropertiescan be quite slow.
However, it doesn't appear there is a way to query the compute capability through cudaDeviceGetAttribute as the cudaDeviceAttr enum doesn't have a field for querying compute capability.
I don't think this will be a serious issue as we cache the result anyways.
The text was updated successfully, but these errors were encountered:
Just to make it more explicit -- the main difference between these approaches is that cudaFuncGetAttributes tells us which available PTX target will be used on the current device, while cudaDeviceGetAttribute will return the SM architecture of the device. As you mention, we can use the list of target architectures (eg. __CUDA_ARCH_LIST__ on nvcc) with this information to figure out which PTX target will be selected.
I'm in favor of finding a better solution than the empty_kernel approach, which has been troublesome for a variety of reasons. If the approach of querying the SM arch and refining with __CUDA_ARCH_LIST__ works, it sounds good to me and is worth experimenting with for 2.1.
Leaving this comment here as a reminder for ourselves that if we are going to change the approach here, we want to account for the ChainedPolicy pruning introduced in #2154 (more details in comment #2154 (comment)).
Current Situation
As discussed in NVIDIA/cub#545, CUB needs to query the current device's compute capability in order to know which tuning policy to use for launching the kernel.
Currently, CUB does this by using
cudaFuncGetAttributes
on anEmptyKernel<void>
.As discussed in NVIDIA/cub#545, this runs into problems due to the nuanced relationship among the linkage of kernels, their enclosing function, and the architectures used to compile the TU. The end result is that we can end up getting a version of
EmptyKernel
with a different PTX version than we expect.Proposal
The goal of the machinery described above is to determine which PTX version for a given kernel will be used when it is invoked.
However, there is another way for CUB to do this.
We could instead usecudaGetDeviceProperties
. The resultingcudaDeviceProp
structure hascudaDeviceProp::major
andcudaDeviceProp::minor
members that indicate the major/minor versions of the compute capability for the current device.We could instead use
cudaDeviceGetAttribute
and query for cudaDevAttrComputeCapabilityMajor and cudaDevAttrComputeCapabilityMinor.In addition, we would have to somewhere cache internal to CUB the list of architectures used to compile a particular TU (
__CUDA_ARCH_LIST__
) so we can select the closest arch to the compute capability of the current device.Additional Context
It is generally recommended to avoid usingcudaGetDeviceProperties
and to instead usecudaDeviceGetAttribute
to query the specific attribute of interest ascudaGetDeviceProperties
can be quite slow.However, it doesn't appear there is a way to query the compute capability throughcudaDeviceGetAttribute
as thecudaDeviceAttr
enum doesn't have a field for querying compute capability.I don't think this will be a serious issue as we cache the result anyways.The text was updated successfully, but these errors were encountered: