Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Safely running jobs with shared data #10366

Closed
NomAnor opened this issue Apr 12, 2021 · 3 comments
Closed

Safely running jobs with shared data #10366

NomAnor opened this issue Apr 12, 2021 · 3 comments

Comments

@NomAnor
Copy link
Contributor

NomAnor commented Apr 12, 2021

Proposal

Implement a configuration option to disable automatic resheduling of lost tasks and implement a way to
reshedule such allocations explicitly with a command.

Use-cases

I'm looking to migrate some legacy applications from a pacemaker cluster to nomad.
The applications use shared storage so only one intance is allowed to run in the cluster.
This is only a small 3-node internal cluster, I mostly look at nomad for easier use to run containers and vms.

As far as I can tell nomad automatically reshedules tasks when they are marked as lost when
a client is disconnected from the servers.

But the servers don't know if the task is still running, sheduling a new instance is fatal for the shared data.
stop_after_client_disconnect only works if the client agent is still running, which it might not be.

If the automatic resheduling can be disabled an operator can ensure that no instance is running and then reshedule/remove the
lost allocation which would then allow the cluster to start the instance somewhere else.

This is a manual step and would lead to an outage for that service, but in this case this is exactly what I want because the
alternative would be data corruption. Ideally there is a way to automatically ensure no running instance like the STONITH configuration in pacemaker but that might be out of scope for nomad.

In #2185 (comment) someone mentioned a similar problem

Attempted Solutions

I tried setting restart.attemps = 0 and reschedule.attemps = 0.

@schmichael
Copy link
Member

Hi @NomAnor!

This is an interesting use case and as you point out other people have hit it. Anyone else coming across this issue please feel free to 👍 it and/or leave your use case in the comments!

As you have discovered restart.attempts and reschedule.attempts do not cover when allocations become lost. Also you correctly pointed out the limitation of stop_after_client_disconnect to only work if the Nomad client agent is still running.

Ideally the shared storage would prevent multiple writers, but I know that that's not always the case and often out of your hands.

I think what Nomad could do is add a new jobspec parameter like group.reschedule_on_lost = true (name TBD). It would default to true to maintain backward compatibility and the more common case of wanting to reschedule allocations on doown nodes.

However, when reschedule_on_lost = false the scheduler could treat lost allocations as running and wait for an operator to intervene. The next problem is that we lack an API to unblock allocations in that state and would need to add it, perhaps nomad alloc reschedule much like we have nomad alloc restart for local restarts.

Does that seem like it would solve your issue?

@hungdqht
Copy link

We use system type job and have that problem too. When reschedule is disabled (or System type job) I think Nomad can check the allocation status and reused that instead of complete it and start a new one.

Juanadelacuesta added a commit that referenced this issue Dec 6, 2023
…19101)

This commit introduces the parameter preventRescheduleOnLost which indicates that the task group can't afford to have multiple instances running at the same time. In the case of a node going down, its allocations will be registered as unknown but no replacements will be rescheduled. If the lost node comes back up, the allocs will reconnect and continue to run.

In case of max_client_disconnect also being enabled, if there is a reschedule policy, an error will be returned.
Implements issue #10366

Co-authored-by: Dom Lavery <[email protected]>
Co-authored-by: Tim Gross <[email protected]>
Co-authored-by: Luiz Aoqui <[email protected]>
@Juanadelacuesta
Copy link
Member

Juanadelacuesta commented Jan 3, 2024

Hi @NomAnor! This feature is now available on nomad 1.7.

Thanks to @DominicLavery for his work on the first implementation 💟

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

5 participants