-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Errors after saveral node-drains, and reboots #2328
Comments
@tantra35 I tried reproducing this on 0.5.4 and the 0.5.5-rc1. Can you provide particular steps to reproduce. |
All jobs on that node was tied to it, with
We made follow sequence of steps:
then againt we made
and at third time again made
and after that nomad begin fail with messages above |
Can you give your client config and the job you were running? |
nomad.json
some of our jobs:
|
Hi, I am afraid I still faced that issue with 0.5.5rc2 on a single machine.
Here are the journald logs:
|
@hmalphettes Can you paste the full logs so we can see the crash? So your reproduction steps were to just:
Also was this a fresh install or in-place update of the client? was there existing data in the data-dir? |
What was the status of the "cba86622-6a4e-315d-d292-19897fc6d354" alloc and "marcus" task? Can you provide any logs related to them from the run prior to the error? |
@dadgar, @schmichael my apologies for the fuzzy report. I certainly don't want to create some uncertainty and I am convinced 0.5.5-rc2 is fixing a number of those cases.
According to the logs, the alloc "cba86622-6a4e-315d-d292-19897fc6d354" was for the marcus task and it failed. Then it was immediately GC-ed. The logs are saying that nomad was asked to shutdown and I cannot remember manually doing that - I was focused on making my code run and not looking much at nomad; if the description of the scenario contains too much uncertainty, please disregard and I'll provide something more reproducible next time. Here are the logs up to when nomad stopped and failed to restart:
|
@hmalphettes Thanks for the extra details! No worries on remembering the exact steps. Would be useful in the future, but Nomad should never corrupt its state files. I'm adding some code now to try to defend against that a little better. |
We're still unable to reproduce this bug, so if you can we'd appreciate as much of the following as possible:
|
@schmichael, I figured what was going on: I am wrapping my nomad calls into a bash function. At this point it behaved more like a chaos monkey that stresses how well nomad can shutdown then gets immediately restarted by systemd. Then nomad in turn immediately restarts the jobs that use the The whole thing is fully automated on EC2 with some user-data. The fact that it actually still managed to start a cluster with 2 machines and run a bunch of server jobs on them successfully is more a testament to how well nomad works than anything else. Anyways, if we wanted to reproduce this: Here is the systemd unit:
As far as my use-case is concerned, I am fine with nomad-0.5.5. Many thanks for your attention! |
Aha! Very helpful information, thanks! The |
I'm unable to start the nomad agent on a node, and we are also seeing the The directories are created when the agent is started, and if I manually The logged stop looks like this:
Is this normal for an agent shutdown, with the RPC error etc? Looks a bit odd... After this, what appears to be failed, shutdown the server is left with running docker containers, and the alloc dirs full of old allocation directories from long ago dead/completed/lost tasks. Any ideas? |
Closing this ticket as its quite old and the code in question is radically different in 0.9 vs 0.5/0.6. Please open a new ticket if you're running into similar issues in a recent version of Nomad. |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
0.5.4
We make several times node-drain and reboots, and after that we've got follow erros:
And after that nomad doesn't lauched anytime with follow errors:
This error have happened again and again, without any chance to normal nomad launch. So we've stopped
nomad agent
, then deletedalloc dir
andclient dir
, and at last we have started nomad, and it begins work perfectlyThe text was updated successfully, but these errors were encountered: