-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Constantly increased sockets inside nomad server or client in CLOSE_WAIT state when working with nomad http api #4604
Comments
The same happens on client nodes, except logs slightly different, because on client rcp pool used(on client node we doens't alow debug logs so info about request timeout not present)
and if we made many http requests to nomad api with timeouts we see many follow sockets (
and in our case this looks like this
|
This happens again and again, after we attach delve debbuger we found this(tons of goriuntines):
stack trace for one of them is:
For now we add timeout to 10 seconds in rpc calls to prevent dead stuck connections
in |
This happes again, but we now get more info about that On remote region to where request proxied one of nomad servers( then we attach to working process of that server with debugger and see huge count(as expected) of gorountines(output is reduced)
With follow backtrace on each:
So when we apply timeouts for RPC, we remove only part of the problem(huge count of gorountine was removed from server origin of the request: each minute we make stats collection from all allocations between all region that we use) We monitor allocations with follow python script(this is provided only as info to better understand what we are doing specifically)
|
Also i must said that this happens only when we query allocations from particular client not all. But when we query allocations for that client from another nomad servers in the same region, we doesn't see any issues |
Noting this looks closely related to #6620 so I've assigned this to myself. |
After some time we have patched yamux a little bit (hashicorp/yamux#79) and finally we think, that problem not in yamux itself(yes it have minor bugs, but they can't lead to this issue). Debugging on server nodes with this issue, shows to us that yamux https://github.com/hashicorp/nomad/blob/master/helper/pool/pool.go#L277-L281
If
and we can get behavior described here(due buffer of After 1-2 weeks i will writing here does our changes lead to positive result |
@tantra35 Thank you for digging into it and following with the report. I mistakenly thought this was addressed by the leadership flapping issue! Sorry! I'll dig into this one a bit more now. As for the channel buffer being 4 - we can increase the limit. Though, I would have expected a small buffer to be sufficient here - the loop should dequeue and start goroutines very fast, or hopefully faster than server issues RPC calls. Also, I may have noticed a question about |
Also, I suspect the loop is buggy - it currently goes in a very tight loop when case session, ok := <-conns:
if !ok {
return
}
go c.listenConn(session) |
@notnoop increase of buffer doesn't helps(so our conclusion is wrong, in any case Warning must be placed in
and can you clarify follow code in // Write the RpcNomad byte to set the mode
if _, err := stream.Write([]byte{byte(pool.RpcNomad)}); err != nil {
stream.Close()
return err
} this implies, that communications happens in // Write the multiplex byte to set the mode
if _, err := conn.Write([]byte{byte(RpcMultiplex)}); err != nil {
conn.Close()
return nil, err
} how this actually works in that case? |
That's a very good question, and it does look suspicious - when working on |
So dug and this might be a red harring. In Nomad client, the RPC handler expects every new yamux session to start with RpcNomad or RpcStreaming . The conns/streams in there is a bit confusing. I have added a more explicit test in a421a84 and confirmed that the test is passing https://circleci.com/gh/hashicorp/nomad/46421 . One thing I'm noticing is that our handling for RPC errors misses closing the connection in some failures. I wonder if we ought to port the equivalent of #7045 here. Can you see if there is a corrolation between [1] https://github.com/hashicorp/nomad/blob/v0.10.4/nomad/client_rpc.go#L218-L230 |
Hm I'm a bit confused, and perhaps missing something, but i thinked that So real streams handling happens in [2], and passing test (about the truth, I don't fully understand what it doing) looks confused for me, in any way I won't argue with you, and i must dig further(only it is unclear where) And now we doesn't collecting [1] https://github.com/hashicorp/nomad/blob/v0.10.4/nomad/rpc.go#L236-L330 |
It's indeed pretty confusing indeed, and I could be wrong, but love the additional data points and hypothesis you are raising. I think clients are special and we should document it. Clients use I'm suspicious of the handling when the underlying connection goes bad. I'll dig into that path further. [1] https://github.com/hashicorp/nomad/blob/v0.10.4/client/rpc.go#L49-L73 |
What about logs? "error performing RPC" or "error performing streaming RPC", "streaming RPC error", or "RPC error" are the significant log messages |
but this hung, as i wrote early correlated with leader switch, and seems that hang connection appear on previous leader. For example today this happens again as you may see leader switch from
if we debug this server we discover that all hanging present on |
we have folow backtrace of hunging grountine:
ok lets examine frame 13
as you may see ok lets examine session inside stream and watch connection info
as you can see connection is between servers |
Also we found follow when we replace https://github.com/hashicorp/nomad/blob/v0.10.4/helper/pool/pool.go#L271-L277 with follow lines of code(exactly eqvivalent, but without if):
We constantly see warn message in logs nearly every 2 minutes: And this looks like a nonsense, but when code looks like this:
There no warning messages in logs, so i make conclusion that |
Interesting. I find it very odd that Also, thanks for the |
We have full separate nomad servers and clients. So servers function only as servers |
OK - I suspect that's the bug - if Looking at https://github.com/hashicorp/nomad/blob/v0.10.4/nomad/node_endpoint.go#L373-L383 ; I suspect the following events took place:
This race should happen rarely, but I suspect considering the frequency of your leadership election and frequency of alloc stat calls, it's more likely for you to detect the problem. Does that seem plausible to you? |
Sounds plausible, but it doesn't explain why And about |
Servers don't call Accept on the Yamux session - only nomad clients do IIUC. Nomad servers only accept the raw connection - so that would make sense. I will still do a bit of digging today and post findings as well as a way to reproduce my steps. |
@notnoop Seems that is full victory, for last 12 days we doesn't see any problems with stat calls(not any timeouts or something else), and no any leaked gorountines. Thanks |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Nomad v0.8.4 (dbee1d7)
Issue
We have a script that periodically call nomad api(we request nomad server, not client) to make some statistic about launched jobs. Our nomad setup is a federation between 4 regions(in nomad terminology). In this script we call nomad api and if response took too long time(timeout), close connection
The code below demonstrate what we do(requests python lib used)
So if timeout happens we have increasly sockets inside nomad process in CLOSE_WAIT state, which live forever, until nomad server restart
After some investigate and turn on DEBUG logs we fond follow in logs
As you can see only after nomad begins stop, hung http api connection become alive, and we see huge api request times - more than 10 minutes
Its seems that
func (p *ConnPool) RPC(region string, addr net.Addr, version int, method string, args interface{}, reply interface{}) error {
inhelper/pool/pool.go
hung forever in case when rpc are made (in particularly when in proxyed to another region)The text was updated successfully, but these errors were encountered: