-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to start tasks "logmon: fork/exec /usr/bin/nomad;62edd7f7: no such file or directory" #14079
Comments
Hi @ktf, can't say I've seen anything like this. Were the failures coming from tasks being launched for the first time? Or were these tasks that Nomad was trying to reattach to? What task driver were these tasks using? |
Thank you for your reply. No, the tasks have been there since a while and they have a constrain of the given machine. They are stateless long running services. They are using the exec driver. Could it be that some internal nomad state was not fully committed and that the scheduling of the raw_exec batch job cleaned up something? Does the error message mean that the executable /usr/bin/nomad is not found? |
@ktf, was the https://github.com/hashicorp/nomad/blob/v1.3.3/client/logmon/plugin.go#L19 |
Yes, I suspect this correlates with the move from 1.3.2 to 1.3.3, which was done via yum update in the background. I will pin the versions for the future, but right now I am still unable to launch jobs on many of the machines (but not all). |
for the record... I also tried adding a symlink to nomad;62edd7f7 and it still does not work. |
also, the hash is different on every machine which does not work. |
Can we get the exact kernel you're using via |
Linux XXX.cern.ch 3.10.0-1127.19.1.el7.x86_64 #1 SMP Tue Aug 25 17:23:54 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux |
I am not 100% sure, but apparently: does not keep into account that readlink does not need to return a 0 terminated string. No? |
I think the readlink call is okay - the buffer being passed in to store the result is created via Go's https://cs.opensource.google/go/go/+/refs/tags/go1.19:src/os/file_unix.go;l=370 I did however come across https://superuser.com/questions/1714705/bash-processes-with-deleted-link-targets-in-proc-pid-exe which describes exactly the same format you're seeing - |
Ah, ok, sorry my go-fu is limited. :-) Thank you for the pointer, I will also inquire with our VM experts, since one of the commonalities seems to be the fact it happens in both cases when running on an hypervisor. That said, do you think you could simply strip whatever is right of |
I suspect this: https://www.reddit.com/r/linux/comments/apmptq/comment/egcc313/ is also relevant. Apparently there is some container related vulnerability with /proc/self/exe and containers, which might be somehow worked around by RH. |
Yeah I think that's likely where this is headed, it would just be so much more comforting if we could at least track down where the suffix is originating 😬 One question that came up in an internal discussion - you mentioned rebooting the machine didn't fix it - do you mean you continued to see the exact same issue - or was there some other error condition? |
Exact same condition after a reboot (i.e. issuing "reboot" on a root shell). Notice this is a VM on OpenStack (I think the actual backend is QEMU), I do not know if that might be the reason. I have not tried an hard reboot from the OpenStack console. I am pretty much convinced this is SELinux coming into play. What I can do on my side is to:
and see if there is any improvements. I will try in a bit. |
This PR causes the logmon task runner to acquire the binary of the Nomad executable in an 'init' block, so as to almost certainly get the name while the nomad file still exists. This is an attempt at fixing the case where a deleted Nomad file (e.g. during upgrade) may be getting renamed with a mysterious suffix first. If this doesn't work, as a last resort we can literally just trim the mystery string. Fixes: #14079
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Nomad v1.3.3 (428b2cd)
Operating system and Environment details
Centos 7
Issue
Today a given machine in our cluster was failing to run some services and the only message which was there is:
"logmon: fork/exec /usr/bin/nomad;62edd7f7: no such file or directory"
restarting the machine did not help. Then suddenly after I pushed a raw_exec batch job for test to the same machine, things got back into business. Any idea of what might have been going on?
Reproduction steps
No idea.
Expected Result
Having tasks running on that machine, as usual.
Actual Result
Tasks were not running.
The text was updated successfully, but these errors were encountered: