Controller creates machines then deletes them inexplicably #1593

yaroslavvb · 2023-07-16T16:21:11Z

yaroslavvb
Jul 16, 2023

Trying to create a cluster, HPC toolkit reserves the machines, then for some reason takes half of them down.

ultra*       up   infinite     25  down* alpha6-ultra-ghpc-[0,2,4-7,10,21-23,26,31-33,36,38,42-46,48-49,51,53]
ultra*       up   infinite     29   idle alpha6-ultra-ghpc-[1,3,8-9,11-20,24-25,27-30,34-35,37,39-41,47,50,52]

How do I troubleshoot this?

Below are the logs from controller, GCP (showing timestamps for machine creation/deletion) and one of the killed machines (alpha6-ultra-ghpc-0)

Cloud logs:
https://www.dropbox.com/scl/fi/2ux2y8phhjdw32qhjavrh/jul16-cloud-logs.txt?rlkey=6zwnji0ie5f8xm0ogqkdylsto&dl=0

Controller logs:
https://www.dropbox.com/scl/fi/qa2oznc8a7hlc26wywvtg/jul16-controller-logs.json?rlkey=uzwou5ukcaef9bnrrmv13ahid&dl=0

Machine 0 logs:
https://www.dropbox.com/scl/fi/x0spjk19j58mr9vom13mh/jul16-machine0.json?rlkey=6wpp5g5hp2tm6lib39cpqhmev&dl=0

nick-stroud · 2023-07-20T07:02:59Z

nick-stroud
Jul 20, 2023
Maintainer

Just to clarify, the observed behavior, both reserving machines and taking them down, are actions being taken by Slurm and not the HPC Toolkit.

Here are the events I see in the logs:

15:46:32: All nodes are resumed
Some nodes become active
15:51:32: slurmctld: node alpha6-ultra-ghpc-0 not resumed by ResumeTimeout(300) - marking down and power_save with similar message for many other nodes
15:51:32: INFO: suspend alpha6-ultra-ghpc-[0,2,4-7,10,21-23,26,31-33,36,38,42-46,48-49,51,53]
15:51:33: INFO: delete 25 subscriptions (alpha6-ultra-ghpc-[0,2,4-7,10,21-23,26,31-33,36,38,42-46,48-49,51,53])
15:51:52: controller posts a message that projects/contextual-research-common/subscriptions/alpha6-ultra-ghpc-0 is deleted
15:51:56: the compute node complains that: ERROR: 404 Resource not found (resource=projects/contextual-research-common/subscriptions/alpha6-ultra-ghpc-0).
15:52:47: Compute node shuts down

I didn't see any red flags in the compute node logs prior to the missing subscription.

Without knowing more I would guess that some of the GPU nodes are taking longer than the allotted 5 minutes to boot up. Once the timeout is hit Slurm assumes something is wrong and kills the nodes and tries again.

I am not certain if this is the case, but if boot up time is tied to the specific machine and a reservation is used, it is possible that those same set of machines will always take longer to boot up, sending them into an infinite cycle.

I would suggest extending the timeout for resuming nodes. You can adjust timeout using the cloud_parameters object on the schedmd-slurm-gcp-v5-controller. Maybe try something like:

      cloud_parameters:
        no_comma_params: false
        resume_rate: 0
        resume_timeout: 600  # was 300
        suspend_rate: 0
        suspend_timeout: 600  # was 300

3 replies

yaroslavvb Jul 20, 2023
Author

Thanks for the tips, indeed this "ultra" nodes often take more than 5 minutes to boot up.

However, after failing to establish contact, the controller never retries. I've had bulkInsert fail for various reasons and have never seen controller retry. IE.

Some nodes fail to come up due to temporary "Quota Exceeded" error.
Some nodes fail to come up due to transient VM_MIN_COUNT exceeded errors.

This lack of retrying feels like a deeper issue with slurm-gcp or HPC toolkit, occasional failures in initial bulkInsert are to be expected.

nick-stroud Jul 20, 2023
Maintainer

Apologies, I was getting my wires crossed with #1584 with the comment about "infinite cycle". I have asked SchedMD if the observed behavior (resume is not retried) is work as intended or a bug.

In the short term I would suggest working around this by:

Setting resume timeout higher as suggested above. Hopefully this allows for most nodes to be successfully created.
In the case that a transient error causes some nodes to not be created use /slurm/scripts/resume.py <node name> to manually re-initiate startup of failed nodes.

yaroslavvb Jul 21, 2023
Author

AH yes, I did see the machine come up and down but later it turned out to be normal autoscaling behavior. The thing I was missing from being able to recognize it as such sooner, was that I don't have an easy way to see the reason why the controller initiates machine take-down.

tpdownes · 2023-07-26T16:20:49Z

tpdownes
Jul 26, 2023
Maintainer

@yaroslavvb this issue seems resolved so I will close the discussion. Thank you for your report!

1 reply

yaroslavvb Jul 26, 2023
Author

Yes, it was resolved. BTW, I found that there's a setting to enable additional logging for resume/suspend, trying to figure out how to get that setting piped through HPC tookit https://github.com/SchedMD/slurm-gcp/blob/master/docs/faq.md#how-do-i-enable-additional-logging-for-slurm-gcp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Controller creates machines then deletes them inexplicably #1593

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Controller creates machines then deletes them inexplicably #1593

yaroslavvb Jul 16, 2023

Replies: 2 comments · 4 replies

nick-stroud Jul 20, 2023 Maintainer

yaroslavvb Jul 20, 2023 Author

nick-stroud Jul 20, 2023 Maintainer

yaroslavvb Jul 21, 2023 Author

tpdownes Jul 26, 2023 Maintainer

yaroslavvb Jul 26, 2023 Author

yaroslavvb
Jul 16, 2023

Replies: 2 comments 4 replies

nick-stroud
Jul 20, 2023
Maintainer

yaroslavvb Jul 20, 2023
Author

nick-stroud Jul 20, 2023
Maintainer

yaroslavvb Jul 21, 2023
Author

tpdownes
Jul 26, 2023
Maintainer

yaroslavvb Jul 26, 2023
Author