-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dead task in job after node fail #1558
Comments
@tantra35 Can you please paste the nomad server and client logs please? Also, can you please share the steps to reproduce this? |
Here logs from client that accept job from failed node:
I don't know how to reproduce this. This happens one time, but agree this is not normat that some task placed in dead state, without restore, and only stop/run return all in normal operations |
@tantra35 We will look into this. The logs you have shared are from a Nomad server. Please paste the logs of the client where the task was restarted so that we can follow the chain of events. |
In our test environment we mix server + client on same node. So the logs that I brought, is all that exist on node where job was placed |
This is reproducible by having two tasks, one that will fail its artifact download and one that starts successfully |
The artifact fetching may be retried and succeed, so don't set the task as dead. Fixes #1558
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Nomad v0.4.1-dev
commit: 044e067
Issue
After isssue in our test infrastructure(failt of one server) some jobs have dead task, which doens't restarted(nomad run doesn't helps to kick nomad update state of dead task):
then we see alloc-status:
It seems that nomad, moved job from dead node(we shutdown that node, from linux console by the shutdown -P now), and make wrong decisions, as a result it get fluend task state from dead node, and doesn't update it on live node
our job file looks like this:
The text was updated successfully, but these errors were encountered: