-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pending allocations stuck in pending state after adoption by a new deployment #23305
Comments
Hi @lattwood and thanks for raising this issue with the detail included. I'll add this to our backlog to investigate further. In the meantime, if you discover any other information or concrete reproduction, please let us know. |
More log spelunking- after this shows up in the logs on an agent/client, no further alloc updates occur, and the drain issue with the pending allocs also starts occuring too.
|
grabbed a goroutine stack dump and found a clue. the same node is blocked here, and was for 1489 minutes, which ends up being just after the "error performing RPC to server" stuff in the above comment: https://github.com/hashicorp/yamux/blob/6c5a7317d6e3b6e7f85db696d6c83ed353e7cb4c/stream.go#L145-L153 Maybe timeouts aren't getting set on retries? idk. edit: the deserialization methods for a jobspec are also evident in the stack trace.
|
@jrasell I found the bug, it was in yamux. PR: hashicorp/yamux#127 |
I'm suspecting this bug is caused by the whole pending allocations |
Hi @lattwood, amazing work and thanks so much for digging into this and raising a PR. I'll poke around internally and see who I need to get a reviewers on the PR. |
@jrasell good stuff. i'm likely going to be cutting a 1.7.7-dev or 1.7.8 build with the yamux change and rolling it out on our side today, once i change the return to a goto. |
Just added some more bug/correctness fixes to the PR. |
https://github.com/lattwood/nomad/releases/tag/v1.7.8-yamux-fork I threw a build up there with the yamux fix. |
Confirmed that hashicorp/yamux#127 fixes the issue. We'll be running a forked Nomad build until there's a release of Nomad with the yamux fix in it. edit: idk if I'm out of line for saying this, but the yamux fix probably should be backported everywhere it's in use 😅 |
Did you have TLS configured for Nomad's RPC? |
@schmichael yup. (our nomad agents are distributed around the world) |
Specifically to debug hashicorp/nomad#23305 but tests should probably run against multiple net.Conn implementations as yamux is sensitive to net.Conn behaviors such as concurrent Read|Write|Close and what errors are returned.
Just saw another issue opened on yamux this week that could be responsible for what we're seeing here as well- hashicorp/yamux#133 |
Specifically to debug hashicorp/nomad#23305 but tests should probably run against multiple net.Conn implementations as yamux is sensitive to net.Conn behaviors such as concurrent Read|Write|Close and what errors are returned.
hmm, i don't think so? heh.
…On Thu, Sep 5, 2024 at 3:05 PM Michael Schurter ***@***.***> wrote:
Closed #23305 <#23305> as
completed via hashicorp/yamux#127
<hashicorp/yamux#127>.
—
Reply to this email directly, view it on GitHub
<#23305 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABPJNKA4TPGTSASFDO6LNHDZVCMPDAVCNFSM6AAAAABJFAB6ASVHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJUGE2TCMRTGI2DKMY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
🤦 Sorry. Evidently Github treats "Should fix #..." as meaning "Absolutely does fix #..." Reopening. |
* test against tls.Conns, not pipes Specifically to debug hashicorp/nomad#23305 but tests should probably run against multiple net.Conn implementations as yamux is sensitive to net.Conn behaviors such as concurrent Read|Write|Close and what errors are returned. * locally closed streams can still read Effectively reverts 912e296 Reasons to revert to locally closed streams being readable: 1. Matches libp2p's yamux fork: https://github.com/libp2p/go-yamux/blob/master/stream.go#L95-L96 2. Both yamux and SPDY make it clear that a locally closed stream cannot send more data. 3. SPDY explicitly supports unidirectional streams where one peer is closed (readonly) from the beginning: https://www.chromium.org/spdy/spdy-protocol/spdy-protocol-draft3-1/#46-unidirectional-streams * test: fix timing when using tcp TestSession_PartialReadWindowUpdate asserts that large sends can cause window updates. When using io.Pipes the server recieves the window update nearly synchronously during the client's Read call. Using tcp sockets adds just enough latency to reliably lose the race for observing the window update on the server. Added a sleep to ensure the server has time to read and apply the window update from the client. I also added a number of comments and non-functional cleanups. The time.Sleep() is the only functional change. * test: expand connection types tested Expanded tests to use TCP and TLS. Sorry for the huge diff, but I think this makes yamux's test suite much more realistic. There are quite a few new tools: - testConnPipe is the original io.Pipe based testing tool. Some tests still require it due to the ability to easily pause the data flow. - testConnTCP and testConnTLS create TCP and TLS connections for yamux to use. This introduces more realistic timing issues, buffering, and subtle differences in error messages. - testConnTypes is a helper to run subtests against *all* the above connection types as well as ensuring reversing the client/server sockets doesn't impact yamux (it didn't!). I didn't convert every test to it since it adds some time and noise to test runs. I also tried to formalize (client, server) as a pattern. There was a mix of orderings. Those roles are rarely meaningful to yamux, but meaningful names makes debugging easier than numbering variables.
In the context of this- #23848 would it be worthwhile for us to try dropping in the libp2p yamux fork |
The main point of this dependency upgrade is to pull in the fixes in hashicorp/yamux#127 which prevents leaking deadlocked goroutines. It has been observed to improve the issue in #23305 but does not seem sufficient to fix it entirely. Since touching yamux is a rare and scary event, I do **not** intend to backport this. If we discover the improvements are stable and significant enough, or if further fixes land in yamux, backporting can be done at that time.
The main point of this dependency upgrade is to pull in the fixes in hashicorp/yamux#127 which prevents leaking deadlocked goroutines. It has been observed to improve the issue in #23305 but does not seem sufficient to fix it entirely. Since touching yamux is a rare and scary event, I do **not** intend to backport this. If we discover the improvements are stable and significant enough, or if further fixes land in yamux, backporting can be done at that time.
I think the hung Read is a symptom, not the cause. I think the cause is a timed out write that the keepalive interval on the underlying session. |
Nomad version
Operating system and Environment details
All Linux
Issue
nomad job scale JOB TG COUNT
when it changes.nomad job scale
command immediately afterwards.Reproduction steps
Expected Result
Actual Result
Screenshots of Absurdity
In this screenshot, all the allocs are part of the same job, but different task groups. The colours correspond to the actual task groups. Note the varying versions, and some having client and some not.
Additionally, you can see the
Modified
time is the same for ones staying up to date, but isn't changing on other ones- you can also see theCreated
times are all over the place.In this screenshot, you can see we have a pending allocation that has been rescheduled, and that rescheduled allocation is marked pending as well. And neither of the allocations have been assigned to a client as far as the Nomad WebUI is concerned.
Reporter's speculation
Maybe it has something to do w/ how the allocations are being adopted by the rapid deployments? This definitely reeks of race condition.
Nomad logs
Useless, unfortunately, due to this bug: #22431
The text was updated successfully, but these errors were encountered: