-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
panic serving: concurrent write to websocket connection #19506
Comments
Hi @Kamilcuk! I've been able to repro this with a lot of attempts but don't have an obvious root cause. I know we touched the Alloc Exec API a good bit to support Actions in Nomad 1.7.0, so presumably there's something we broke there. We'll investigate and report back. |
Hi, I prepared the following reproducing script called
Execute I am consistently getting |
I can see the same sometimes (1.7.3 running on Ubuntu 22.04). Three nomad servers federated to another three and one client attached to each of those clusters (just a testbed). |
I can see the same using v1.5.11 running in Debian 11. I have this on my journalctl
|
This error specifically when using |
Hey folks, sorry about the delay on this. On the surface it looks like this was introduced in #19172 which shipped in Nomad 1.7.0 (with backports to 1.6.4 and 1.5.11), and there's definitely a bug in that PR, which I'll explain below. But that bug unfortunately doesn't explain the panic. The relevant blocks of code are In #19172, we added a check if the error returned from decoding from the websocket was one of several benign "close errors". The trouble is that this check incorrectly assumed that errors other than those with valid websocket message error codes were of type But that's not the cause of the panic! When I hit this error:
while running a build with Go's data race detection on, I see the following data race reported: data race
Which means these two writes are happening at the same time: So that's puzzling. I've got a fairly straightforward fix for the error handling bug. What I'm going to try to do next is move the |
Draft PR with the fix is here: #19932 but I'm working up a test for it before I mark that ready for review. |
#19932 has been merged and will ship in the next version of Nomad 1.7.x, with backports to 1.6.x and 1.5.x |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Operating system and Environment details
Archlinux.
Issue
I am executing a lot of
nomad job exec
API commands to test some stuff and Nomad logs some panics:When running not
-dev
instance, sometimes Nomad process terminates (!!!):Reproduction steps
Run nomad 1.7.2 agent -dev.
Execute a lot of:
Expected Result
There should be no exceptions in logs.
Actual Result
There are exceptions in logs and occasionally Nomad process terminates.
Reproducible on 1.7.0, 1.7.1, 1.7.2.
Not reproducible on 1.6.3.
Job file (if appropriate)
Nomad logs
https://pastebin.com/5vhAPCB1
https://pastebin.com/ydJLgzHp
The text was updated successfully, but these errors were encountered: