Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix crash when executor parent nomad process dies #5598

Merged
merged 1 commit into from
Apr 23, 2019

Conversation

notnoop
Copy link
Contributor

@notnoop notnoop commented Apr 23, 2019

Fixes #5593

Executor seems to die unexpectedly after nomad agent dies or is
restarted. The crash seems to occur at the first log message after
the nomad agent dies.

To ease debugging we forward executor log messages to executor.log as
well as to Stderr. go-plugin sets up plugins with Stderr pointing to
a pipe being read by plugin client, the nomad agent in our case[1].
When the nomad agent dies, the pipe is closed, and any subsequent
executor logs fail with ErrClosedPipe and SIGPIPE signal. SIGPIPE
results into executor process dying.

I considered adding a handler to ignore SIGPIPE, but hc-log library
currently panics when logging write operation fails[2]

This we opt to revert to v0.8 behavior of exclusively writing logs to
executor.log, while we investigate alternative options.

Logmon and dockerlogger have the same issue, but I punted on addressing them for another PR - the severity is less there, considering they are very light on logging:

  • logmon primarily logs at startup time and at termination/close, and virtually never in between
  • dockerlog logs on transient errors, but the effect is losing logs rather than containers being leaked.

[1] https://github.com/hashicorp/nomad/blob/v0.9.0/vendor/github.com/hashicorp/go-plugin/client.go#L528-L535
[2] https://github.com/hashicorp/nomad/blob/v0.9.0/vendor/github.com/hashicorp/go-hclog/int.go#L320-L323

@notnoop notnoop requested a review from nickethier April 23, 2019 13:46
Fixes #5593

Executor seems to die unexpectedly after nomad agent dies or is
restarted.  The crash seems to occur at the first log message after
the nomad agent dies.

To ease debugging we forward executor log messages to executor.log as
well as to Stderr.  `go-plugin` sets up plugins with Stderr pointing to
a pipe being read by plugin client, the nomad agent in our case[1].
When the nomad agent dies, the pipe is closed, and any subsequent
executor logs fail with ErrClosedPipe and SIGPIPE signal.  SIGPIPE
results into executor process dying.

I considered adding a handler to ignore SIGPIPE, but hc-log library
currently panics when logging write operation fails[2]

This we opt to revert to v0.8 behavior of exclusively writing logs to
executor.log, while we investigate alternative options.

[1] https://github.com/hashicorp/nomad/blob/v0.9.0/vendor/github.com/hashicorp/go-plugin/client.go#L528-L535
[2] https://github.com/hashicorp/nomad/blob/v0.9.0/vendor/github.com/hashicorp/go-hclog/int.go#L320-L323
@notnoop notnoop force-pushed the b-dont-forward-logs branch from 7027b02 to c07c0c8 Compare April 23, 2019 13:53
@notnoop notnoop merged commit c76130c into master Apr 23, 2019
@notnoop notnoop deleted the b-dont-forward-logs branch April 23, 2019 14:15
notnoop pushed a commit that referenced this pull request Apr 23, 2019
@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 12, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Nomad 0.9: raw_exec task processes leaking from allocations stopped after agent restart
2 participants