-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nomad 0.9: raw_exec task processes leaking from allocations stopped after agent restart #5593
Comments
Fixes #5593 Executor seems to die unexpectedly after nomad agent dies or is restarted. The crash seems to occur at the first log message after the nomad agent dies. To ease debugging we forward executor log messages to executor.log as well as to Stderr. `go-plugin` sets up plugins with Stderr pointing to a pipe being read by plugin client, the nomad agent in our case[1]. When the nomad agent dies, the pipe is closed, and any subsequent executor logs fail with ErrClosedPipe and SIGPIPE signal. SIGPIPE results into executor process dying. I considered adding a handler to ignore SIGPIPE, but hc-log library currently panics when logging write operation fails[2] This we opt to revert to v0.8 behavior of exclusively writing logs to executor.log, while we investigate alternative options. [1] https://github.com/hashicorp/nomad/blob/v0.9.0/vendor/github.com/hashicorp/go-plugin/client.go#L528-L535 [2] https://github.com/hashicorp/nomad/blob/v0.9.0/vendor/github.com/hashicorp/go-hclog/int.go#L320-L323
Fixes #5593 Executor seems to die unexpectedly after nomad agent dies or is restarted. The crash seems to occur at the first log message after the nomad agent dies. To ease debugging we forward executor log messages to executor.log as well as to Stderr. `go-plugin` sets up plugins with Stderr pointing to a pipe being read by plugin client, the nomad agent in our case[1]. When the nomad agent dies, the pipe is closed, and any subsequent executor logs fail with ErrClosedPipe and SIGPIPE signal. SIGPIPE results into executor process dying. I considered adding a handler to ignore SIGPIPE, but hc-log library currently panics when logging write operation fails[2] This we opt to revert to v0.8 behavior of exclusively writing logs to executor.log, while we investigate alternative options. [1] https://github.com/hashicorp/nomad/blob/v0.9.0/vendor/github.com/hashicorp/go-plugin/client.go#L528-L535 [2] https://github.com/hashicorp/nomad/blob/v0.9.0/vendor/github.com/hashicorp/go-hclog/int.go#L320-L323
@cheeseprocedure - We just released 0.9.1-rc which resolves this issue, available at https://releases.hashicorp.com/nomad/0.9.1-rc1/. Would be great if you can try it out and let us know if it fixed what you saw. |
Thanks @preetapan - I've been unable to reproduce these failures on 0.9.1-rc1! |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Nomad v0.9.0 (18dd59056ee1d7b2df51256fe900a98460d3d6b9)
Operating system and Environment details
Issue
tl;dr: processes launched by raw_exec tasks are orphaned if the Nomad agent restarts and a job stop/update stops allocations on the host that were live before the agent restart.
We use Nomad to run a simple guest VM supervisor process via
raw_exec
. After deploying Nomad 0.9 to a limited set of datacenters, a monitor started firing intermittently which indicated an instance of our supervisor found running without a Nomad executor parent process.After some digging, we found this was happening to allocations started prior to a restart of the local Nomad agent. A while back, we tweaked our configuration management tooling as a workaround for #4413 so Nomad agents running in client mode would simply restart (regardless of whether TLS key material had been updated). Because of this behaviour, we often have allocations running that are older than the current Nomad agent process.
Allocations impacted by this issue have an event log much like this:
Despite Nomad reporting
Task successfully killed
, the process launched by theraw_exec
driver is still running at this point.When Nomad reports
Sent interrupt
, the allocation's process does receive the appropriate signal; our tooling kicks off graceful shutdown/eventual forced termination of our guest VMs, so the orphaned process doesn't stick around for long. (This is why our orphaned process monitor was flapping.)While submitting this issue, I realized the same failure mode occurs after restarting the Nomad agent and calling
nomad stop [job_name]
to stop that job's allocations on the agent's host. That's the case documented below.Reproduction steps
Configs/resources
/tmp/nomad_config.json
:/tmp/testjob.sh
:Procedure
nomad agent -config /tmp/nomad_config.json
nomad run /tmp/testjob.hcl
tail -f /tmp/testjob.log
to confirm allocation is live and generating outputnomad stop testjob
/tmp/testjob.log
should immediately include areceived interrupt but ignoring it
line from the current allocation, andnomad status [alloc_id]
should include aSent interrupt
event.Task successfully killed
event following an RPC error withall SubConns are in TransientFailure
and/tmp/testjob.log
should continue showing output from the still-running bash script (now with a parent PID of 1) launched by the supposedly-complete allocation:2019-04-22T14:08:28-07:00 Terminated Exit Code: 0, Exit Message: "executor: error waiting on process: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: error while dialing: dial unix /var/folders/f1/004zzwdd3ml7x9j1q8thh3b00000gn/T/plugin954529704: connect: connection refused\""
Job file (if appropriate)
/tmp/testjob.hcl
:Nomad Client logs (if appropriate)
Nomad agent logs after a Nomad client restart and
nomad stop testjob
:The text was updated successfully, but these errors were encountered: