-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alloc Exec spontaneously disconnects after some time #10579
Comments
I'm having the same issue on Nomad 1.0.2. Each time the session is between 2 and 15 minutes before the disconnect. Every time getting "failed to exec into task: disconnected without receiving the exit code" error on disconnect. |
Also experiencing this intermittently for the past several months on various versions of nomad, currently 1.0.4. |
Ha, I also noted this and I thought my haproxy configuration was at fault. But seems like that nomad alone also has a problem -- that would have saved me a few hours debugging :D |
Hi folks! Looks like that error is bubbling up from I'm going to dig into this with my colleagues @isabeldepapel and @jsosulska tomorrow. |
Fantastic news, its great to know that this isn't some heisenbug as it originally appeared to be. |
Ok, so we spent a little time pairing on this and there's a few things we've come up with:
|
I've opened #10638 to make the problem a bit more debuggable. I ran |
Any updates on the issue? Can't understand it from #10638 |
I'm afraid, we still need to dig into this issue more. #10657 ensures that the CLI reports the last few events (e.g. final output, exit code); it addresses the case where a completed commands result into "unexpected EOF" error. We aren't certain what the cause of unexpected disconnects here. #10657 may highlight the cause more; though the issue might be a bug in RPC forwarding logic, or that the 10s heartbeat period is too long in some environments. I haven't fully reproduced it, but I'll test it more this week. To set update context, I have a WIP branch that log all exec messages and events in https://github.com/hashicorp/nomad/compare/b-debug-exec-2 . Hopefully that will highlight the issue - but I haven't caught it yet. |
So I have a fix for this finally in #10710! Thank you @the-maldridge for reporting this issue. It's very unfortunate that we did not notice this issue internally. Without your GitHub reports, this issue will remain haunting more users unnecessarily! |
Track usage of incoming streams on a connection. Connections without reference counts get marked as unused and reaped in a periodic job. This fixes a bug where `alloc exec` and `alloc fs` sessions get terminated unexpectedly. Previously, when a client heartbeats switches between servers, the pool connection reaper eventually identifies the connection as unused and closes it even if it has an active exec/fs sessions. Fixes #10579
Clever fix, thanks for taking a look! |
Track usage of incoming streams on a connection. Connections without reference counts get marked as unused and reaped in a periodic job. This fixes a bug where `alloc exec` and `alloc fs` sessions get terminated unexpectedly. Previously, when a client heartbeats switches between servers, the pool connection reaper eventually identifies the connection as unused and closes it even if it has an active exec/fs sessions. Fixes #10579
Track usage of incoming streams on a connection. Connections without reference counts get marked as unused and reaped in a periodic job. This fixes a bug where `alloc exec` and `alloc fs` sessions get terminated unexpectedly. Previously, when a client heartbeats switches between servers, the pool connection reaper eventually identifies the connection as unused and closes it even if it has an active exec/fs sessions. Fixes #10579
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Operating system and Environment details
Observed with Ubuntu and Void Linux clients, observed with macOS clients, observed with Alpine, Void, and ResinStack servers. All x64, tested with diversity in network path, underlying provider, and Nomad versions at both ends.
Issue
When using the
alloc exec
command the shell connects as expected and interactive commands work great. After some time the session spontaneously disconnects. Sometimes it disconnects with an error such asfailed to exec into task: disconnected without receiving the exit code
. Sometimes it disconnects with no error printed at all.Reproduction steps
-i -t
options.After some time you will be disconnected.
Expected Result
I expect that unless an external and intentional factor is at play, the session should remain connected. Examples of external factors include but are not limited to: network path disruption, token expiry, user explicitly disconnecting the session.
Actual Result
The interactive exec is disconnected without warning and no discernible cause.
So far I have not found anything in the logs and the job being exec'd into seems to have no effect on how long the session survives. I think this may be some context internal to Nomad not being renewed, but I have not traced the execution yet.
The text was updated successfully, but these errors were encountered: