Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backport of alloc exec: fix panics after stream close into release/1.7.x #19951

Merged

Conversation

hc-github-team-nomad-core
Copy link
Contributor

Backport

This PR is auto-generated from #19932 to be assessed for backporting due to the inclusion of the label backport/1.7.x.

The below text is copied from the body of the original PR.


In #19172 we added a check on websocket errors to see if they were one of several benign "close" messages. This change inadvertently assumed that other messages used for close would not implement HTTPCodedError. When errors like the following are received:

msgpack decode error [pos 0]: io: read/write on closed pipe"

they are sent from the inner loop as though they were a "real" error, but the channel is already being closed with a "close" message.

This allowed many more attempts to pass thru a previously-undiscovered race condition in the two goroutines that stream RPC responses to the websocket. When the input stream returns an error for any reason (for example, the command we're executing has exited), it will unblock the "outer" goroutine and cause a write to the websocket. If we're concurrently writing the "close error" discussed above, this results in a panic from the websocket library.

This changeset includes two fixes:

  • Catch "closed pipe" error correctly so that we're not sending unnecessary error messages.
  • Move all writes to the websocket into the same response streaming goroutine. The main handler goroutine will block on a results channel, and the response streaming goroutine will send on that channel with the final error when it's done so it can be reported to the user.

Fixes: #19506


In addition to the new unit test and associated websocket test infrastructure, I did some soak testing:

$ for i in {1..200}; do nomad alloc exec 9ee04c03 echo -n "."; done
........................................................................................................................................................................................................%


Overview of commits

@tgross tgross merged commit eb799e6 into release/1.7.x Feb 12, 2024
19 of 20 checks passed
@tgross tgross deleted the backport/alloc-exec-closed/privately-trusty-locust branch February 12, 2024 15:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants