Controller creates machines then deletes them inexplicably #1593
Replies: 2 comments 4 replies
-
Just to clarify, the observed behavior, both reserving machines and taking them down, are actions being taken by Slurm and not the HPC Toolkit. Here are the events I see in the logs:
I didn't see any red flags in the compute node logs prior to the missing subscription. Without knowing more I would guess that some of the GPU nodes are taking longer than the allotted 5 minutes to boot up. Once the timeout is hit Slurm assumes something is wrong and kills the nodes and tries again. I am not certain if this is the case, but if boot up time is tied to the specific machine and a reservation is used, it is possible that those same set of machines will always take longer to boot up, sending them into an infinite cycle. I would suggest extending the timeout for resuming nodes. You can adjust timeout using the
|
Beta Was this translation helpful? Give feedback.
-
@yaroslavvb this issue seems resolved so I will close the discussion. Thank you for your report! |
Beta Was this translation helpful? Give feedback.
-
Trying to create a cluster, HPC toolkit reserves the machines, then for some reason takes half of them down.
How do I troubleshoot this?
Below are the logs from controller, GCP (showing timestamps for machine creation/deletion) and one of the killed machines (alpha6-ultra-ghpc-0)
Cloud logs:
https://www.dropbox.com/scl/fi/2ux2y8phhjdw32qhjavrh/jul16-cloud-logs.txt?rlkey=6zwnji0ie5f8xm0ogqkdylsto&dl=0
Controller logs:
https://www.dropbox.com/scl/fi/qa2oznc8a7hlc26wywvtg/jul16-controller-logs.json?rlkey=uzwou5ukcaef9bnrrmv13ahid&dl=0
Machine 0 logs:
https://www.dropbox.com/scl/fi/x0spjk19j58mr9vom13mh/jul16-machine0.json?rlkey=6wpp5g5hp2tm6lib39cpqhmev&dl=0
Beta Was this translation helpful? Give feedback.
All reactions