document/improve on client restarts with missing state #9512

tgross · 2020-12-03T18:43:26Z

When a Nomad client is stopped, the allocations on that client host are left running. So long as the client isn't offline long enough to be considered "lost", when the client restarts it rummages around in its local state store to recreate handles to the running tasks. If a task is stopped while the Nomad client is stopped (by the user or simply crashing), the Nomad client has to restore the task. Any failure to do so is definitely a Nomad bug.

However, we've seen operators who remove the client data directory between restarts. There are two ways we've seen this go wrong:

If the client's data directory is removed while the client is shut down, the Nomad client has no way of recreating the handles to running tasks. This also means that Nomad can't shut down or restart those tasks, which could result in stale versions of applications can be running.
If the client's data directory is removed and the task containers are removed manually, but some other resource like an un-garbage-collected mount is left behind, this can prevent Nomad from scheduling the workload.

Many operators (typically those who are running on public cloud infra) will replace the client host entirely during client upgrades. But for those who do not, generally speaking they should not remove the data dir on the client. If they do they need to be aware of all the resources that can be leaked. We don't have good documentation warning about this or giving guidance on it.

tgross added theme/docs Documentation issues and enhancements stage/needs-discussion labels Dec 3, 2020

tgross mentioned this issue Dec 14, 2020

Nomad alloc stuck in pending state on a new job #9375

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

document/improve on client restarts with missing state #9512

document/improve on client restarts with missing state #9512

tgross commented Dec 3, 2020

document/improve on client restarts with missing state #9512

document/improve on client restarts with missing state #9512

Comments

tgross commented Dec 3, 2020