-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ci: GH200 test instability #1600
Conversation
I need to check with @finkandreas if CUDA MPS on GH200 is configured with multi-GPU support, because that would be desirable in our configuration: https://docs.nvidia.com/deploy/mps/#mps-on-multi-gpu-systems |
@finkandreas Please ignore this question, No, it is not relevant to us. From the docs:
I was hoping that MPS could automatically dispatch the CUDA kernels across different devices, but this is not the case. |
Co-authored-by: Hannes Vogt <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
It was observed that the CI tests on GPU randomly hang, and eventually time out. In an attempt to make the CI stable on GH200 nodes, this PR proposes 2 changes:
Additional change: