You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If a prerun hook fails when restoring alloc state, as with a client agent restart, tasks don't get fully cleaned up and may leave orphan resources like a running container and network configuration (e.g. iptables rules).
This was pointed out in #13028 where specifically a CSI prerun hook fails, but it's an issue more generally with alloc runner prerun hooks.
I encountered it myself while investigating that issue, and as @ygersie put it,
The worst thing is that Nomad garbage collects the failed allocation but doesn't actually shutdown the docker container (checked the docker logs it never received the api call to stop it either), leaving a zombie container.
Reproduction steps
I made a strange hook to be able to poison an alloc on disk, so it can succeed first pass but fail after a client agent stop/start.
Expected Result
All of the failed task's resources are cleaned up.
Actual Result
The task is marked as failed and dead and gets replaced, but the old container remains running.
Also if the task uses a static port, the new one will fail to start because the port is held by the "failed" task.
The text was updated successfully, but these errors were encountered:
@suikast42 There are some conceptual parallels, in that both cases result in stuff getting left behind unexpectedly.
However in #17079 your logs indicate that task "Killing" is starting, which is one of the things that is currently not happening under the specific circumstances that cause this issue here. And one of the things that does happen appropriately in this case are task stop hooks, which include service deregistration (edit: and alloc postrun hooks for deregistering group-level services).
So good keeping watch, but these cases are definitely unrelated!
If a prerun hook fails when restoring alloc state, as with a client agent restart, tasks don't get fully cleaned up and may leave orphan resources like a running container and network configuration (e.g. iptables rules).
This was pointed out in #13028 where specifically a CSI prerun hook fails, but it's an issue more generally with alloc runner prerun hooks.
I encountered it myself while investigating that issue, and as @ygersie put it,
Reproduction steps
I made a strange hook to be able to poison an alloc on disk, so it can succeed first pass but fail after a client agent stop/start.
Expected Result
All of the failed task's resources are cleaned up.
Actual Result
The task is marked as failed and dead and gets replaced, but the old container remains running.
Also if the task uses a static port, the new one will fail to start because the port is held by the "failed" task.
The text was updated successfully, but these errors were encountered: