-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Server-side restarts of tasks failed on clients #1461
Comments
has there been any thought to putting this on the roadmap, and/or mitigating it some other way. Currently if I shut down a nomad client server let say to upgrade nomad or upgrade the OS then there is a very likely chance that some of the tasks randomly fail when trying to start up on another server. The random failure usually comes from some docker race condition, but be that as it may I would prefer to not have to resubmit my job just because 1 machine failed in order to get all of the tasks running again. |
@a86c6f7964 It is something we are hoping to tackle in 0.6.0 |
so not 0.7, 0.8? |
I usually drain servers for maintenance. This obviously doesn't work when a server randomly fails, but for upgrades this seems to work pretty well so far. Then again I don't usually have multi-task taskgroups. |
Occasionally when I drain a node, some jobs won't be re-allocated and will remain in a dead state with "alloc not needed as node is tainted". Doesn't happen often, but when it does, it quickly becomes a pretty major issue (not easy to have visibility into these issues w/o building some nice monitoring + alerting around everything). Hard to definitively say that I'm running into this exact problem, but it definitely feels that way. Is there any update for this on the roadmap? I feel like this is a critical issue for me. Only relevant logs I could find, may or may not be helpful: |
I received even greater experience - I had a job with multiple groups. After node fails i received some groups relocated, and some remain dead. |
per job, it would be nice to set: If Nomad cannot see that a service task is running (unreacheable) or sees that the task is gone, it should try to schedule it on another host. |
Any update on this from a roadmap perspective? I've experienced node failures in AWS EC2, and this has bitten me a few times now. It also has happened during downscale events with an EC2 ASG - AWS only waits so long when terminating an instance before it forcefully kills everything. Given that a downscale first requires a drain (which can sometimes take minutes), AWS is almost always forcefully killing our nomad clients. The desired behavior here is more or less a requirement for a scheduler - node failures are a guarantee in the cloud. Is there anything the community can do to help push this along? I'd take a stab at it myself, but it would take some time for me to get up to speed on the internals (but willing to do so, if needed). |
@SoMuchToGrok This is landing in Nomad 0.8. You can follow this branch for details if interested. |
I hit similar issues when I upgrade my compute fleet in a serial manner as @a86c6f7964 I hadn't thought of |
@preetapan the branch mentioned doesn't seem to exist anymore and the |
@dkua the changelog mentions it as follows
|
@preetapan ah okay thank you that's great to know gonna let my team know. Didn't notice it at first since the branch-to-follow 404s and #3981 doesn't reference this issue. |
This was addressed with rescheduling in 0.8. |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
If a task has failed on a client and it could potentially be recoverable on another, the server should replace the task group onto a new node.
The text was updated successfully, but these errors were encountered: