-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dead service after Nomad cluster restart #5919
Comments
I was able to reproduce this behavior also in a small three-node cluster in AWS. Job HCL reschedule
I assume that Nomad did not continue rescheduling after failure (I waited 30 minutes): |
I am attaching nomad config used in a small three-node cluster in AWS.
|
@jozef-slezak What does the output of |
By the way, when this happens I need to start the DEAD job manually (I just click Start button in the console). I would appreciate that nomad just continues with retries. The PENDING process displays Queued=1. I tried job stop & start but it did not help. Trying restarting nomad itself and it has been recovered and is RUNNING now. It should also recover automatically. Maybe nomad could restart driver/plugin process. Please, could you at least suggest some workaround (for example some watchdog or restarting process by systemd) until the fix is available? |
Could you please run the same test as @cgbaker did in #5917. Repeat restarting the 3node cluster until you reproduce this behavior (DEAD job after restart).
|
I wasn't able to reproduce this with separate servers and clients. I did see some trouble that may be resolved in #5906. I will try to repro with that build. |
@cgbaker, this issue is hard to reproduce. I believe it is related to "Error was unrecoverable" (see the screenshots above). |
Today, I was able to simulate this buggy behavior on a single node (just by using sudo systemctl stop nomad and sudo systemctl start nomad). Please check the evals below.
|
Update: forget this, I had the wrong Nomad binary. I am still curious whether you have the Nomad systemd service configured with |
@jozef-slezak : I attempted to repro this on a single node (client+server) with the latest Nomad |
Thanks for your cooperation. Tomorrow I will check the killmode. We are using raw exec instead of docker driver. This issue is about dead (not event started) process (not interrupted). Probably I should submit some test script and try with latest nomad. |
We have KillMode=control-group. @cgbaker, what is you opinion, please? Please compare |
I will do more testing with either |
Hey there Since this issue hasn't had any activity in a while - we're going to automatically close it in 30 days. If you're still seeing this issue with the latest version of Nomad, please respond here and we'll keep this open and take another look at this. Thanks! |
This issue will be auto-closed because there hasn't been any activity for a few months. Feel free to open a new one if you still experience this problem 👍 |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
We are restarting nomad cluster (3 nomad servers and tens of nomad clients all physical machines). Same test scenario #5917, #5908 but a different bug report.
Please check if this is related to #5669
Nomad version
0.9.3
Operating system and Environment details
Linux CentOS
Issue
Service is not running after Nomad cluster restart.
Reproduction steps
Restart all nomad servers and clients (sudo reboot).
Check Nomad console - sometimes we see DEAD services that was running before restart.
The text was updated successfully, but these errors were encountered: