-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Task shall not be marked as complete when it's killed by node draining? #3691
Comments
what kind of job type do you have? batch currently won't be retried afaik |
@jippi The job type is raw_exec. I set the retry policy of my batch job to 2 retries in 24h. From my observation, transient task/alloc failures did recover on the same machine or even on another machine. |
Any updates on this? |
Hi @dukeland9, Sorry you hit this! This was fixed in #3717 and is included in the binary from #3698. It should only occur when there's not enough cluster resources to immediately replace a drained allocation, so if you're able to add more capacity before draining it should work around the issue. For example using Nomad 0.7.1 and the When I drained the node the batch job was running on it exited and was scheduled on the other node. vagrant@linux:/opt/gopath/src/github.com/hashicorp/nomad$ nomad run batch.nomad
==> Monitoring evaluation "7200a490"
Evaluation triggered by job "sleeper"
Allocation "db9dbf7a" created: node "83b23692", group "sleeper"
Allocation "db9dbf7a" status changed: "pending" -> "running"
Evaluation status changed: "pending" -> "complete"
==> Evaluation "7200a490" finished with status "complete"
vagrant@linux:/opt/gopath/src/github.com/hashicorp/nomad$ nomad status db
ID = db9dbf7a
Eval ID = 7200a490
Name = sleeper.sleeper[0]
Node ID = 83b23692
Job ID = sleeper
Job Version = 0
Client Status = running
Client Description = <none>
Desired Status = run
Desired Description = <none>
Created = 8s ago
Modified = 8s ago
Task "sleeper" is "running"
Task Resources
CPU Memory Disk IOPS Addresses
0/100 MHz 23 MiB/300 MiB 300 MiB 0
Task Events:
Started At = 01/10/18 19:53:55 UTC
Finished At = N/A
Total Restarts = 0
Last Restart = N/A
Recent Events:
Time Type Description
01/10/18 19:53:55 UTC Started Task started by client
01/10/18 19:53:55 UTC Task Setup Building Task Directory
01/10/18 19:53:55 UTC Received Task received by client
vagrant@linux:/opt/gopath/src/github.com/hashicorp/nomad$ nomad node-drain -enable 83b23692
Are you sure you want to enable drain mode for node "83b23692-afca-3199-f449-b32c380f0b9f"? [y/N] y
vagrant@linux:/opt/gopath/src/github.com/hashicorp/nomad$ nomad status sleeper
ID = sleeper
...
Allocations
ID Node ID Task Group Version Desired Status Created Modified
7f7f8ed9 b64a527b sleeper 0 run running 27s ago 27s ago
db9dbf7a 83b23692 sleeper 0 stop complete 54s ago 27s ago Sorry for the hassle! I'm closing this since it's fixed on master, but please reopen if you find that's not the case! |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
0.7.1
Operating system and Environment details
Ubuntu 14.04 & 16.04
Issue
I'm using nomad to run distributed batch jobs in a ~30-machine cluster.
When I drain a node from the cluster, all running allocations on that node will be killed and marked complete. That will cause the tasks in those allocations would never be rescheduled.
Should the right behavior be: the allocations on the draining node is killed and marked as failed then rescheduled on another node?
The text was updated successfully, but these errors were encountered: