-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Safely running jobs with shared data #10366
Comments
Hi @NomAnor! This is an interesting use case and as you point out other people have hit it. Anyone else coming across this issue please feel free to 👍 it and/or leave your use case in the comments! As you have discovered Ideally the shared storage would prevent multiple writers, but I know that that's not always the case and often out of your hands. I think what Nomad could do is add a new jobspec parameter like However, when Does that seem like it would solve your issue? |
We use system type job and have that problem too. When reschedule is disabled (or System type job) I think Nomad can check the allocation status and reused that instead of complete it and start a new one. |
…19101) This commit introduces the parameter preventRescheduleOnLost which indicates that the task group can't afford to have multiple instances running at the same time. In the case of a node going down, its allocations will be registered as unknown but no replacements will be rescheduled. If the lost node comes back up, the allocs will reconnect and continue to run. In case of max_client_disconnect also being enabled, if there is a reschedule policy, an error will be returned. Implements issue #10366 Co-authored-by: Dom Lavery <[email protected]> Co-authored-by: Tim Gross <[email protected]> Co-authored-by: Luiz Aoqui <[email protected]>
Hi @NomAnor! This feature is now available on nomad 1.7. Thanks to @DominicLavery for his work on the first implementation 💟 |
Proposal
Implement a configuration option to disable automatic resheduling of lost tasks and implement a way to
reshedule such allocations explicitly with a command.
Use-cases
I'm looking to migrate some legacy applications from a pacemaker cluster to nomad.
The applications use shared storage so only one intance is allowed to run in the cluster.
This is only a small 3-node internal cluster, I mostly look at nomad for easier use to run containers and vms.
As far as I can tell nomad automatically reshedules tasks when they are marked as lost when
a client is disconnected from the servers.
But the servers don't know if the task is still running, sheduling a new instance is fatal for the shared data.
stop_after_client_disconnect
only works if the client agent is still running, which it might not be.If the automatic resheduling can be disabled an operator can ensure that no instance is running and then reshedule/remove the
lost allocation which would then allow the cluster to start the instance somewhere else.
This is a manual step and would lead to an outage for that service, but in this case this is exactly what I want because the
alternative would be data corruption. Ideally there is a way to automatically ensure no running instance like the STONITH configuration in pacemaker but that might be out of scope for nomad.
In #2185 (comment) someone mentioned a similar problem
Attempted Solutions
I tried setting
restart.attemps = 0
andreschedule.attemps = 0
.The text was updated successfully, but these errors were encountered: