-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scaling issues due to prolog tagging api #34
Comments
some comment on this topic from slurm support team Considering the nature of this command in that it needs to run in parallel but async from the other prologs/epilogs. I think a SPANK plugin would fit better than a PREP plugin and avoid the need to write any non-trivial code. For instance, this is a popular plugin to use lua with SPANK: I think the slurm_spank_init_post_opt() is likely the function to call the tagging command. |
looking more closely I notice the loop in the prolog script. the prolog script runs on every compute node and at every step execution and
I think we can still keep this in the prolog, find own instance ID with curl and make the node tag itself with a single call. there will be only n calls to the tagging API not perfect like async tagging but much better anyway I think |
I changed the prolog script to PrologSlurmctld and any job larger than 30 nodes crashes Then I tried this approach inside the prolog.sh
This works for 40 nodes, will test with larger jobs too. But could not find a way to transport the comments yet |
Prolog script edits to reduce scaling issues reported in #34
We got into scaling issue with the tagging in prolog script
I understand the prolog is ran at every step and when many nodes are involved the job fails with timeouts
we need to find another place to do the tagging and I understand that the comment is job related but some other tags can be done only once when the instances are created, either because of the min value in the configuration or created by slurm
I am looking at places where this could be done.
maybe it can be done at the headnode instead in the PrologSlurmctld https://slurm.schedmd.com/prolog_epilog.html
The text was updated successfully, but these errors were encountered: