You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When a Nomad client is stopped, the allocations on that client host are left running. So long as the client isn't offline long enough to be considered "lost", when the client restarts it rummages around in its local state store to recreate handles to the running tasks. If a task is stopped while the Nomad client is stopped (by the user or simply crashing), the Nomad client has to restore the task. Any failure to do so is definitely a Nomad bug.
However, we've seen operators who remove the client data directory between restarts. There are two ways we've seen this go wrong:
If the client's data directory is removed while the client is shut down, the Nomad client has no way of recreating the handles to running tasks. This also means that Nomad can't shut down or restart those tasks, which could result in stale versions of applications can be running.
If the client's data directory is removed and the task containers are removed manually, but some other resource like an un-garbage-collected mount is left behind, this can prevent Nomad from scheduling the workload.
Many operators (typically those who are running on public cloud infra) will replace the client host entirely during client upgrades. But for those who do not, generally speaking they should not remove the data dir on the client. If they do they need to be aware of all the resources that can be leaked. We don't have good documentation warning about this or giving guidance on it.
The text was updated successfully, but these errors were encountered:
When a Nomad client is stopped, the allocations on that client host are left running. So long as the client isn't offline long enough to be considered "lost", when the client restarts it rummages around in its local state store to recreate handles to the running tasks. If a task is stopped while the Nomad client is stopped (by the user or simply crashing), the Nomad client has to restore the task. Any failure to do so is definitely a Nomad bug.
However, we've seen operators who remove the client data directory between restarts. There are two ways we've seen this go wrong:
Many operators (typically those who are running on public cloud infra) will replace the client host entirely during client upgrades. But for those who do not, generally speaking they should not remove the data dir on the client. If they do they need to be aware of all the resources that can be leaked. We don't have good documentation warning about this or giving guidance on it.
The text was updated successfully, but these errors were encountered: