Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DISCUSSION] Alternative solution for determining compute capability at runtime #898

Open
jrhemstad opened this issue Aug 19, 2022 · 2 comments
Assignees
Labels
cub For all items related to CUB

Comments

@jrhemstad
Copy link
Collaborator

jrhemstad commented Aug 19, 2022

Current Situation

As discussed in NVIDIA/cub#545, CUB needs to query the current device's compute capability in order to know which tuning policy to use for launching the kernel.

Currently, CUB does this by using cudaFuncGetAttributes on an EmptyKernel<void>.

As discussed in NVIDIA/cub#545, this runs into problems due to the nuanced relationship among the linkage of kernels, their enclosing function, and the architectures used to compile the TU. The end result is that we can end up getting a version of EmptyKernel with a different PTX version than we expect.

Proposal

The goal of the machinery described above is to determine which PTX version for a given kernel will be used when it is invoked.

However, there is another way for CUB to do this.

We could instead use cudaGetDeviceProperties. The resulting cudaDeviceProp structure has cudaDeviceProp::major and cudaDeviceProp::minor members that indicate the major/minor versions of the compute capability for the current device.

We could instead use cudaDeviceGetAttribute and query for cudaDevAttrComputeCapabilityMajor and cudaDevAttrComputeCapabilityMinor.

In addition, we would have to somewhere cache internal to CUB the list of architectures used to compile a particular TU (__CUDA_ARCH_LIST__) so we can select the closest arch to the compute capability of the current device.

Additional Context

It is generally recommended to avoid using cudaGetDeviceProperties and to instead use cudaDeviceGetAttribute to query the specific attribute of interest as cudaGetDeviceProperties can be quite slow.

However, it doesn't appear there is a way to query the compute capability through cudaDeviceGetAttribute as the cudaDeviceAttr enum doesn't have a field for querying compute capability.

I don't think this will be a serious issue as we cache the result anyways.

@alliepiper
Copy link
Collaborator

Just to make it more explicit -- the main difference between these approaches is that cudaFuncGetAttributes tells us which available PTX target will be used on the current device, while cudaDeviceGetAttribute will return the SM architecture of the device. As you mention, we can use the list of target architectures (eg. __CUDA_ARCH_LIST__ on nvcc) with this information to figure out which PTX target will be selected.

I'm in favor of finding a better solution than the empty_kernel approach, which has been troublesome for a variety of reasons. If the approach of querying the SM arch and refining with __CUDA_ARCH_LIST__ works, it sounds good to me and is worth experimenting with for 2.1.

@elstehle
Copy link
Collaborator

Leaving this comment here as a reminder for ourselves that if we are going to change the approach here, we want to account for the ChainedPolicy pruning introduced in #2154 (more details in comment #2154 (comment)).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cub For all items related to CUB
Projects
Status: Todo
Development

No branches or pull requests

4 participants