-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stopped jobs are not properly stopped on all clients #2016
Comments
I've observed this as well.. can you please submit your alloc executor log and the nomad client logs for the allocation? :) |
couldn't find anything interesting in the forwarded logs that I have... |
do you run the client in DEBUG mode? there don't need to be any errors per se, but will greatly help the team try and reverse the cause of it :) |
its not in debug mode, rebooting the client's VM made that "stale" task disappear. (maybe restart On 21 November 2016 at 11:12, Christian Winther [email protected]
|
Could you grab the output of |
that alloc is gone... (i rebooted the VM). I actually did that when it happened and it looked like a normal running allocation.. (nomad status showed it as "completed") |
@bsphere If this happens again would you mind capturing that data? Do you have any steps to reproduce this behavior? |
Same for me. I'll try add logs for some time. It seems, that this incorrect behavior after killing docker container of some task. |
I think we've seen similiar behaviour on 0.5.0 (docker 1.12.3) when we accidentally updated a bit too many machines (don't ask ;)). It caused a rush of allocations and task reshuffling. What we think we saw before rolling back to 0.4.1.
Unfortunately we were unable to gather logs methodically due to accidentally large rollout, but we will attempt again this week and let you know EDIT: |
something wrong just happened again, this is the alloc-status:
I stopped the job, and that alloc is still running...
client logs (they're repeating so I put some of it) -
|
@bsphere This will help a lot thanks! Have you found a way to reproduce this? |
Seems that we have similar problem in our test enviroment. After lost of connectivity(little downtime of network switch) was lost of cluster leader, and on one of the server we saw job with allocation in pending state:
Our logs on moment of lost nomad leadership:
and time to time with follow messages:
We tried 2 times on node with problem, remove existing container( And at last for test we make After drain we see foollow status:
and at processlist we see follow orphaned nomad executor:
|
Bump |
I believe this is happening to me as well. This was after a cluster-wide upgrade. Here was my process:
root@hive01:/home/server# nomad alloc-status 8305a110
ID = 8305a110
Eval ID = fc2072ca
Name = worker.api.worker[0]
Node ID = a396d9d7
Job ID = worker.api
Job Version = 1143
Client Status = complete
Client Description = <none>
Desired Status = stop
Desired Description = alloc not needed due to job update
Created = 17h42m ago
Modified = 30s ago
Task "worker-instance" is "dead"
Task Resources
CPU Memory Disk IOPS Addresses
1200 MHz 1.0 GiB 300 MiB 0
Task Events:
Started At = 2018-06-12T23:00:08Z
Finished At = 2018-06-12T23:08:10Z
Total Restarts = 0
Last Restart = N/A
Recent Events:
Time Type Description
2018-06-12T16:08:10-07:00 Killed Task successfully killed
2018-06-12T16:08:05-07:00 Killing Sent interrupt. Waiting 5s before force killing
2018-06-12T16:00:08-07:00 Started Task started by client
2018-06-12T16:00:07-07:00 Task Setup Building Task Directory
2018-06-12T16:00:07-07:00 Received Task received by client If I can provide more info, please let me know. |
Replies to @orthecreedence:
Hm, I'm confused by this: Nomad does not have vote only servers. Nomad 0.8 Enterprise supports non-voting servers: https://www.nomadproject.io/docs/agent/configuration/server.html#non_voting_server Nomad always suggests an odd number of voting servers to ensure one side of partitions is able to form a quorum.
Restarting all clients should not have broken anything. While a rolling restart is safer, clients should have restarted and restored running allocations without issue. Can you provide more information on why they stopped working? Similar error messages to the other posters? Nomad 0.8 should fix some or all of the issues from above. Ideally we could open new issues for specific restart bugs.
This will cause all jobs to be rescheduled -- which is a fine way to recover except that the actual task processes/containers are likely still running along with their Nomad executors (sidecar processes for logging and containerization). Depending on the design of your services this might be fine. However, you will want to clean up your cluster at some point.
This is unfortunate but likely just a useless log line. From the alloc status you posted the allocation has been stopped which should be expected in this circumstance as all jobs are being rescheduled which will create new allocations. Allocations can never transition from stopped to running again, so dropping updates is fine. We're investigating why this occurred and how to improve the logging. |
jfyi ... Updating running Nomad agents causes issues for me too. More pronounced when I update amd reboot than when I only update. |
Whoops, my bad. I thought I had configured non-voting, maybe I am losing my mind. Also, good tip on the odd servers. I'll make sure to change to 3 servers.
Probably not. I wasn't running the logs in debug mode and was pretty frantic trying to get everything running once the cluster emptied out. This isn't really my issue though, since everything is running fine now.
That's fine, but the log line keeps looping. I have dozens of allocations that, every minute, report as So really my issue is more along the lines of...how do I get nomad to stop trying to update these allocs? Some of them are for jobs that have been purged, yet they keep showing up. Everything seems to be working normally, but I'm wondering if there's a way to stop looping these log messages, or if this is a symptom of a larger problem. Thanks! |
In my nomad servers, I'm seeing this (minutely):
I wonder if this has something to do with the minutely |
@orthecreedence Likely although completely unrelated to the original issue. The original issue should have been fixed a long time ago, and we forgot to close this issue! I'm going to close this issue. Please open a new one and follow the template providing as much logging as you can (you can attach zips and such to Github issues). A sample job status, alloc status, job file for one of the allocs in the error messages you mentioned above would be useful as well. To anyway affected by the original issue. Please open a new issue if something like this is still occurring. The bug should be fixed by now or at the very least displaying different task events and log messages. |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Sometimes, after stopping a job, it is not properly stopped on all clients and even when the alloc status is "complete", the process is still running.
completely stopping the job without restarting doesn't solve this and it keeps being run by the client.
didn't experience this before 0.5.0
here's an example, alloc 0484fb00 is stopped, but still running on the client:
Nomad version
0.5.0
Operating system and Environment details
Coreos-stable
The text was updated successfully, but these errors were encountered: