ci: GH200 test instability #1600

edopao · 2024-08-02T09:40:18Z

It was observed that the CI tests on GPU randomly hang, and eventually time out. In an attempt to make the CI stable on GH200 nodes, this PR proposes 2 changes:

Reduce pytest parallelism, by lowering the num processes from 32 to 16 (note that GPU tests exploit CUDA MPS)
Reduce the SLURM timeout, in order to early detect when jobs hang (longest job takes 10 min)

Additional change:

Improve the hash on the Dockerfile for caching of base image: now also include the build arguments

ci/cscs-ci.yml

edopao · 2024-08-02T10:08:06Z

I need to check with @finkandreas if CUDA MPS on GH200 is configured with multi-GPU support, because that would be desirable in our configuration: https://docs.nvidia.com/deploy/mps/#mps-on-multi-gpu-systems

edopao · 2024-08-02T11:02:20Z

I need to check with @finkandreas if CUDA MPS on GH200 is configured with multi-GPU support, because that would be desirable in our configuration: https://docs.nvidia.com/deploy/mps/#mps-on-multi-gpu-systems

@finkandreas Please ignore this question, No, it is not relevant to us. From the docs:

When CUDA_VISIBLE_DEVICES is set before launching the control daemon, the devices will be remapped by the MPS server.

I was hoping that MPS could automatically dispatch the CUDA kernels across different devices, but this is not the case.

ci/cscs-ci.yml

Co-authored-by: Hannes Vogt <[email protected]>

havogt

Thanks!

Reduce pytest num processes

ebb38ec

havogt reviewed Aug 2, 2024

View reviewed changes

ci/cscs-ci.yml Outdated Show resolved Hide resolved

edopao added 2 commits August 2, 2024 12:08

Update config

b12ae82

Update config (1)

dcda262

edopao marked this pull request as ready for review August 2, 2024 10:53

edopao requested a review from havogt August 2, 2024 11:04

havogt reviewed Aug 2, 2024

View reviewed changes

ci/cscs-ci.yml Outdated Show resolved Hide resolved

Edit code comment

81d0c00

havogt reviewed Aug 2, 2024

View reviewed changes

ci/cscs-ci.yml Outdated Show resolved Hide resolved

Update ci/cscs-ci.yml

dc041af

Co-authored-by: Hannes Vogt <[email protected]>

edopao requested a review from havogt August 2, 2024 12:16

havogt approved these changes Aug 2, 2024

View reviewed changes

edopao merged commit bd4c48e into GridTools:main Aug 2, 2024
31 checks passed

edopao deleted the ci-gh200 branch August 2, 2024 12:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: GH200 test instability #1600

ci: GH200 test instability #1600

edopao commented Aug 2, 2024 •

edited

Loading

edopao commented Aug 2, 2024

edopao commented Aug 2, 2024

havogt left a comment

ci: GH200 test instability #1600

ci: GH200 test instability #1600

Conversation

edopao commented Aug 2, 2024 • edited Loading

edopao commented Aug 2, 2024

edopao commented Aug 2, 2024

havogt left a comment

Choose a reason for hiding this comment

edopao commented Aug 2, 2024 •

edited

Loading