-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nomad restarting services after lost state even with restart {attempts=0} #6212
Comments
@cbnorman This isn't currently possible because a lost node means the server doesn't have accurate information on the status of the running task. We have been talking internally about use cases like yours where even if a node goes lost the allocation should not be marked as failed because of the user explicitly opting in not to do so. Will take your use case into consideration, we usually like to gather evidence/use cases before prioritizing a feature like this. |
Facing similar issues in a test environment of ours. We have jobs that are allocated according to constraints marked on nodes which are static in nature. For example, Job A would run on nodes having constraints "app" and value as A. So lets say we have 2 nodes for A and the job is deployed on them. But post a hearbeat failure, nomad reschedules the allocation, which falls back on the same node when it re-registers back post successful heartbeat. As such the node still has the old alloc running plus nomad client starts the newly rescheduled replacement nomad allocation. Due this we get "address already in use" port confilct in our logs and the state of our app goes into failure. Maybe the "Stop After Client Disconnect" parameter of the group stanza could have helped stop the old alloc but since the hearbeat timeout period(few seconds) is relatively small and the stop process takes some time, it leaves us with no chance to stop it before nomad allocs a replacement. |
Hi @narendrapatel!
That's a reasonable workaround if you expect to see some intermittent networking issues with the clients, but that's probably worth spending some time debugging in your environment as well. |
Hi @tgross
Our network team has confirmed that the test environment would be having network issues due to some throttling constraints. I have analyzed some heartbeat timeout logs for agents and think 90s grace extension should be fair configuration for now. Also checking if we can have some alerting around the same. In addition, increasing the open file limits for the Nomad servers as found some increase in usage there. Is there anything that i can add more? can you guide me here if you have some pointers or if i am missing something? Also, a suggestion. Can we get the Nomad Leader to ask the client to push details of all the allocations running on it before scheduling a allocation there. This can be used to avoid re-running a allocation on a client that came after a missed heartbeat and already has the allocation running on it. If the running allocation does not match with our latest job specifications then we can stop the allocation and schedule a new one if required. |
Definitely take a look at the production requirements docs if you haven't already.
Depending on why this is important to you, you may want to look at |
@tgross yes we have the given requirements met except for the fact that this being the test environment can have some network latency on and off.
Already checked the setting but unfortunately the heartbeat miss and re-registration time period is very low and the process stop takes some time. |
Ok, going to close this issue in lieu of the one you'll open. |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
If you have a question, prepend your issue with
[question]
or preferably use the nomad mailing list.If filing a bug please include the following:
Nomad version
Nomad v0.9.4 (a81aa84)
Operating system and Environment details
debian 9
Issue
a small number of our services run in a remote datacenter utilising the raw_exec driver which is connected to our nomad cluster via a dedicated cloud connection. most of the services are stateful and require a controlled shutdown to avoid data loss. we have therefor configured the jobs with:
We have noticed that if there is a disconnect from the servers in the cloud to the clients in the datacenter all services go into a lost state and continue to run locally - which is great. The problem is when the clients re-connect back to the servers all the jobs are restarted.
here is the nomad status for a test job:
here are the logs from the client:
Is there anyway to completely stop nomad restarting a job, as mentioned the job functions fine while disconnected, its only on reconnection to the master that it decides to restart it even though the job has no restart configured?
The text was updated successfully, but these errors were encountered: