Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to start tasks "logmon: fork/exec /usr/bin/nomad;62edd7f7: no such file or directory" #14079

Closed
ktf opened this issue Aug 10, 2022 · 15 comments · Fixed by #14297
Closed

Comments

@ktf
Copy link

ktf commented Aug 10, 2022

Nomad version

Nomad v1.3.3 (428b2cd)

Operating system and Environment details

Centos 7

Issue

Today a given machine in our cluster was failing to run some services and the only message which was there is:

"logmon: fork/exec /usr/bin/nomad;62edd7f7: no such file or directory"

restarting the machine did not help. Then suddenly after I pushed a raw_exec batch job for test to the same machine, things got back into business. Any idea of what might have been going on?

Reproduction steps

No idea.

Expected Result

Having tasks running on that machine, as usual.

Actual Result

Tasks were not running.

@shoenig
Copy link
Member

shoenig commented Aug 10, 2022

Hi @ktf, can't say I've seen anything like this.

Were the failures coming from tasks being launched for the first time? Or were these tasks that Nomad was trying to reattach to? What task driver were these tasks using?

@ktf
Copy link
Author

ktf commented Aug 10, 2022

Thank you for your reply. No, the tasks have been there since a while and they have a constrain of the given machine. They are stateless long running services. They are using the exec driver. Could it be that some internal nomad state was not fully committed and that the scheduling of the raw_exec batch job cleaned up something? Does the error message mean that the executable /usr/bin/nomad is not found?

@shoenig
Copy link
Member

shoenig commented Aug 12, 2022

@ktf, was the nomad executable itself modified / replaced / upgraded while the tasks were running? The error message suggests os.Executable returned the literal string: /usr/bin/nomad;62edd7f7 (i.e. including the odd ;62edd7f7 suffix). I'm not aware of anything in Nomad code that would do that 🤔

https://github.com/hashicorp/nomad/blob/v1.3.3/client/logmon/plugin.go#L19

@ktf
Copy link
Author

ktf commented Aug 12, 2022

Yes, I suspect this correlates with the move from 1.3.2 to 1.3.3, which was done via yum update in the background. I will pin the versions for the future, but right now I am still unable to launch jobs on many of the machines (but not all).

@ktf
Copy link
Author

ktf commented Aug 12, 2022

for the record... I also tried adding a symlink to nomad;62edd7f7 and it still does not work.

@ktf
Copy link
Author

ktf commented Aug 12, 2022

also, the hash is different on every machine which does not work.

@shoenig
Copy link
Member

shoenig commented Aug 12, 2022

Can we get the exact kernel you're using via uname -a ?

@ktf
Copy link
Author

ktf commented Aug 12, 2022

Linux XXX.cern.ch 3.10.0-1127.19.1.el7.x86_64 #1 SMP Tue Aug 25 17:23:54 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

@shoenig shoenig self-assigned this Aug 12, 2022
@ktf
Copy link
Author

ktf commented Aug 13, 2022

I am not 100% sure, but apparently:

https://github.com/golang/go/blob/946b4baaf6521d521928500b2b57429c149854e7/src/syscall/syscall_linux.go#L262

does not keep into account that readlink does not need to return a 0 terminated string. No?
Am I wrong?

@shoenig
Copy link
Member

shoenig commented Aug 15, 2022

I think the readlink call is okay - the buffer being passed in to store the result is created via Go's make which automatically zero-fills the values

https://cs.opensource.google/go/go/+/refs/tags/go1.19:src/os/file_unix.go;l=370

I did however come across https://superuser.com/questions/1714705/bash-processes-with-deleted-link-targets-in-proc-pid-exe which describes exactly the same format you're seeing - /usr/bin/bash;61f1808a (deleted) (in our case, the (deleted) is being removed by the os.Executable implementation). They are also using CentOS 7 / a 3.10 kernel.

@ktf
Copy link
Author

ktf commented Aug 15, 2022

Ah, ok, sorry my go-fu is limited. :-)

Thank you for the pointer, I will also inquire with our VM experts, since one of the commonalities seems to be the fact it happens in both cases when running on an hypervisor.

That said, do you think you could simply strip whatever is right of ; on your side (just like you remove (deleted)?

@ktf
Copy link
Author

ktf commented Aug 15, 2022

I suspect this:

https://www.reddit.com/r/linux/comments/apmptq/comment/egcc313/

is also relevant. Apparently there is some container related vulnerability with /proc/self/exe and containers, which might be somehow worked around by RH.

@shoenig
Copy link
Member

shoenig commented Aug 15, 2022

That said, do you think you could simply strip whatever is right of ; on your side (just like you remove (deleted)?

Yeah I think that's likely where this is headed, it would just be so much more comforting if we could at least track down where the suffix is originating 😬

One question that came up in an internal discussion - you mentioned rebooting the machine didn't fix it - do you mean you continued to see the exact same issue - or was there some other error condition?

@ktf
Copy link
Author

ktf commented Aug 16, 2022

One question that came up in an internal discussion - you mentioned rebooting the machine didn't fix it - do you mean you continued to see the exact same issue - or was there some other error condition?

Exact same condition after a reboot (i.e. issuing "reboot" on a root shell). Notice this is a VM on OpenStack (I think the actual backend is QEMU), I do not know if that might be the reason. I have not tried an hard reboot from the OpenStack console.

I am pretty much convinced this is SELinux coming into play. What I can do on my side is to:

  • Hard reboot the VM
  • Hard reboot the VM disabling SELinux

and see if there is any improvements. I will try in a bit.

shoenig added a commit that referenced this issue Aug 24, 2022
This PR causes the logmon task runner to acquire the binary of the
Nomad executable in an 'init' block, so as to almost certainly get
the name while the nomad file still exists.

This is an attempt at fixing the case where a deleted Nomad file
(e.g. during upgrade) may be getting renamed with a mysterious
suffix first.

If this doesn't work, as a last resort we can literally just trim
the mystery string.

Fixes: #14079
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
Development

Successfully merging a pull request may close this issue.

2 participants