Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bugfix] CudaDeviceAPI::GetAttr may check kExist when GPUs absent #16903

Merged

Conversation

Lunderberg
Copy link
Contributor

This commit resolves a bug that was introduced in
#16377. If no CUDA-capable GPUs are present, the call to cudaGetDeviceCount will return an error, which will be raised as an exception by the CUDA_CALL macro. However, checking the kExist flag is valid even if no GPUs are present.

This commit removes the use of CUDA_CALL, and instead returns false in this case.

This commit resolves a bug that was introduced in
apache#16377.  If no CUDA-capable GPUs are
present, the call to `cudaGetDeviceCount` will return an error, which will
be raised as an exception by the `CUDA_CALL` macro.  However, checking
the `kExist` flag is valid even if no GPUs are present.

This commit removes the use of `CUDA_CALL`, and instead returns false
in this case.
@tqchen tqchen merged commit 7dc0472 into apache:main Apr 18, 2024
18 checks passed
@vinx13
Copy link
Member

vinx13 commented Apr 18, 2024

I found the issue is not about cudaGetDeviceCount, cudaGetDeviceCount doesn't return error, however, the issue is it doesn't check CUDA_VISIBLE_DEVICES setting but like other APIs do. When you set CUDA_VISIBLE_DEVICES to empty, checking kExist will return true, and other attrs will checked and cause no device error. I think we should recover the original check before #16377 that uses cudaDeviceGetAttribute and clear the error code if not exists

@tqchen
Copy link
Member

tqchen commented Apr 18, 2024

@vinx13 do you mind send a followup pr on this?

@Lunderberg
Copy link
Contributor Author

@vinx13 Thank you for the info on it. I'm wondering if that's a difference in driver versions, because I'm definitely seeing different behavior in cudaGetDeviceCount based on the environment variable CUDA_VISIBLE_DEVICES. The screenshot below shows the behavior I am seeing (prior to this PR), where checking kExist returns the correct output when CUDA_VISIBLE_DEVICES is used to hide some of the GPUs, but raised an exception when CUDA_VISIBLE_DEVICES is used to hide every GPU.

image

@vinx13
Copy link
Member

vinx13 commented Apr 18, 2024

it's not because cudaGetDeviceCount, when I ran the test, I found that there are multiple calls to GetAttr when the first kExist check succeed, the first call is to check kExist and it always return true, the subsequent call fails because of setting CUDA_VISIBLE_DEVICES=''

@Lunderberg Lunderberg deleted the bugfix_return_false_for_device_exists_check branch April 18, 2024 21:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants