-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[question] What is the best way to run a job on all nodes while draining them? #9857
Comments
Normally there are two "knobs" you have: drain and eligibility. Eligibility is useful to toggle if you want to prevent scheduling of new tasks, but not drain the ones that are currently running. Whereas the case you have here is that you want to run tasks on a node that isn't otherwise eligible for scheduling. The scheduler isn't going to want to run workloads on ineligible nodes. But what you're trying to do here might be possible with a little bit of cleverness, especially given the scenario you're trying to do. You could give jobs a
There's a "system batch" scheduler type that's being worked on in #9160, likely to ship in Nomad 1.1. In the meantime you might be able to workaround that with a batch job that has a count == the number nodes, and the |
Thank you, that makes sense. I’m also wondering if there is a way to do this by shifting part of the administrative responsibilities to the node itself. In that case,
If my understanding is correct, this requires that each node have rather broad management permissions on the cluster, but leaving that aside for a second, does this sound like a plausible scenario? |
You can scope this down a bit by giving the administrative job a Nomad ACL token that has only
That could definitely work! You're relying on the notion that the jobs you care about will all finish, which is only going to be the case with batch workloads. But if that's the case then you're all set and don't need to worry about draining. |
Thank you very much. Once again, very insightful! |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
There is a job I would like to run on all machines from time to time; a simple maintenance script that will ensure the dependencies on my nodes are always up to date. To do this, I will first need to drain all nodes of all my production jobs.
While drained, I would like to run one instance of my maintenance job on every one of the nodes. When this is done, I will allow traffic back.
This question has two parts:
NOTE: I believe a system job is not what I need here. I need to be able to run it on demand, not triggered by any events like the node becoming ready (e.g. a restart).
The text was updated successfully, but these errors were encountered: