Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Constantly increased sockets inside nomad server or client in CLOSE_WAIT state when working with nomad http api #4604

Closed
tantra35 opened this issue Aug 20, 2018 · 24 comments · Fixed by #7370

Comments

@tantra35
Copy link
Contributor

tantra35 commented Aug 20, 2018

Nomad version

Nomad v0.8.4 (dbee1d7)

Issue

We have a script that periodically call nomad api(we request nomad server, not client) to make some statistic about launched jobs. Our nomad setup is a federation between 4 regions(in nomad terminology). In this script we call nomad api and if response took too long time(timeout), close connection

The code below demonstrate what we do(requests python lib used)

  l_allocStatusParams = {"region": _region, "node_id": _allocation['NodeID']}
  l_allocStatus = _session.get(NOMAD_ADDR + "/v1/client/allocation/" + _allocation["ID"] + "/stats", params=l_allocStatusParams, headers={"X-Nomad-Token": NOMAD_TOKEN}, timeout=2)

So if timeout happens we have increasly sockets inside nomad process in CLOSE_WAIT state, which live forever, until nomad server restart

After some investigate and turn on DEBUG logs we fond follow in logs

Aug 20 17:38:02 vol-cl-control-01 systemd[1]: Stopping Nomad agent...
Aug 20 17:38:02 vol-cl-control-01 nomad.sh[3984234]: ==> Caught signal: interrupt
Aug 20 17:38:02 vol-cl-control-01 nomad.sh[3984234]:     2018/08/20 17:38:02.478993 [DEBUG] http: Shutting down http server
Aug 20 17:38:02 vol-cl-control-01 nomad.sh[3984234]:     2018/08/20 17:38:02.479026 [INFO] agent: requesting shutdown
Aug 20 17:38:02 vol-cl-control-01 nomad.sh[3984234]:     2018/08/20 17:38:02.479030 [INFO] nomad: shutting down server
Aug 20 17:38:02 vol-cl-control-01 nomad.sh[3984234]:     2018/08/20 17:38:02 [WARN] serf: Shutdown without a Leave
Aug 20 17:38:02 vol-cl-control-01 nomad.sh[3984234]:     2018/08/20 17:38:02.513798 [ERR] http: Request /v1/client/allocation/a22564b0-bd6c-b303-23f9-75efbbfbe5a1/stats?region=aav01&node_id
=8c6b0236-e923-1925-a963-5f53c414d403, error: rpc error: EOF
Aug 20 17:38:02 vol-cl-control-01 nomad.sh[3984234]:     2018/08/20 17:38:02.513807 [ERR] http: Request /v1/client/allocation/ae396939-77b9-6a78-e435-c1b5c3efc25b/stats?region=aav01&node_id
=8c6b0236-e923-1925-a963-5f53c414d403, error: rpc error: EOF
Aug 20 17:38:02 vol-cl-control-01 nomad.sh[3984234]:     2018/08/20 17:38:02.513819 [ERR] http: Request /v1/client/allocation/ae396939-77b9-6a78-e435-c1b5c3efc25b/stats?region=aav01&node_id
=8c6b0236-e923-1925-a963-5f53c414d403, error: rpc error: EOF
Aug 20 17:38:02 vol-cl-control-01 nomad.sh[3984234]:     2018/08/20 17:38:02.513931 [DEBUG] http: Request GET /v1/client/allocation/ae396939-77b9-6a78-e435-c1b5c3efc25b/stats?region=aav01&n
ode_id=8c6b0236-e923-1925-a963-5f53c414d403 (22m59.270636301s)
Aug 20 17:38:02 vol-cl-control-01 nomad.sh[3984234]:     2018/08/20 17:38:02.513940 [DEBUG] http: Request GET /v1/client/allocation/ae396939-77b9-6a78-e435-c1b5c3efc25b/stats?region=aav01&n
ode_id=8c6b0236-e923-1925-a963-5f53c414d403 (10m1.296302508s)
Aug 20 17:38:02 vol-cl-control-01 nomad.sh[3984234]:     2018/08/20 17:38:02.513989 [DEBUG] http: Request GET /v1/client/allocation/a22564b0-bd6c-b303-23f9-75efbbfbe5a1/stats?region=aav01&n
ode_id=8c6b0236-e923-1925-a963-5f53c414d403 (9m54.212359036s)
Aug 20 17:38:02 vol-cl-control-01 nomad.sh[3984234]:     2018/08/20 17:38:02.513993 [ERR] http: Request /v1/client/allocation/ae396939-77b9-6a78-e435-c1b5c3efc25b/stats?region=aav01&node_id
=8c6b0236-e923-1925-a963-5f53c414d403, error: rpc error: EOF
Aug 20 17:38:02 vol-cl-control-01 nomad.sh[3984234]:     2018/08/20 17:38:02.514006 [ERR] http: Request /v1/client/allocation/a22564b0-bd6c-b303-23f9-75efbbfbe5a1/stats?region=aav01&node_id
=8c6b0236-e923-1925-a963-5f53c414d403, error: rpc error: EOF
Aug 20 17:38:02 vol-cl-control-01 nomad.sh[3984234]:     2018/08/20 17:38:02.514011 [ERR] http: Request /v1/client/allocation/2388d8b4-2564-7dc0-42d4-42a5315378c9/stats?region=aav01&node_id
=8c6b0236-e923-1925-a963-5f53c414d403, error: rpc error: EOF
Aug 20 17:38:02 vol-cl-control-01 nomad.sh[3984234]:     2018/08/20 17:38:02.513990 [ERR] http: Request /v1/client/allocation/ae396939-77b9-6a78-e435-c1b5c3efc25b/stats?region=aav01&node_id
=8c6b0236-e923-1925-a963-5f53c414d403, error: rpc error: EOF
Aug 20 17:38:02 vol-cl-control-01 nomad.sh[3984234]:     2018/08/20 17:38:02.514104 [DEBUG] http: Request GET /v1/client/allocation/a22564b0-bd6c-b303-23f9-75efbbfbe5a1/stats?region=aav01&n
ode_id=8c6b0236-e923-1925-a963-5f53c414d403 (27m54.332350804s)
Aug 20 17:38:02 vol-cl-control-01 nomad.sh[3984234]:     2018/08/20 17:38:02.514124 [DEBUG] http: Request GET /v1/client/allocation/ae396939-77b9-6a78-e435-c1b5c3efc25b/stats?region=aav01&n
ode_id=8c6b0236-e923-1925-a963-5f53c414d403 (23m58.527153171s)
Aug 20 17:38:02 vol-cl-control-01 nomad.sh[3984234]:     2018/08/20 17:38:02.514131 [ERR] http: Request /v1/client/allocation/ae396939-77b9-6a78-e435-c1b5c3efc25b/stats?region=aav01&node_id
=8c6b0236-e923-1925-a963-5f53c414d403, error: rpc error: EOF
Aug 20 17:38:02 vol-cl-control-01 nomad.sh[3984234]:     2018/08/20 17:38:02.514008 [ERR] http: Request /v1/client/allocation/2388d8b4-2564-7dc0-42d4-42a5315378c9/stats?region=aav01&node_id
=8c6b0236-e923-1925-a963-5f53c414d403, error: rpc error: EOF
Aug 20 17:38:02 vol-cl-control-01 nomad.sh[3984234]:     2018/08/20 17:38:02.514150 [DEBUG] http: Request GET /v1/client/allocation/ae396939-77b9-6a78-e435-c1b5c3efc25b/stats?region=aav01&n
ode_id=8c6b0236-e923-1925-a963-5f53c414d403 (25m58.826122779s)

As you can see only after nomad begins stop, hung http api connection become alive, and we see huge api request times - more than 10 minutes

Its seems that func (p *ConnPool) RPC(region string, addr net.Addr, version int, method string, args interface{}, reply interface{}) error { in helper/pool/pool.go hung forever in case when rpc are made (in particularly when in proxyed to another region)

@tantra35 tantra35 changed the title Constantly increased sockets in CLOSE_WAIT state when working with nomad http api Constantly increased sockets inside nomad server or client in CLOSE_WAIT state when working with nomad http api Aug 21, 2018
@tantra35
Copy link
Contributor Author

tantra35 commented Aug 21, 2018

The same happens on client nodes, except logs slightly different, because on client rcp pool used(on client node we doens't alow debug logs so info about request timeout not present)

Aug 21 13:35:14 vol-h-docker-04 systemd[1]: Stopping Nomad agent...
Aug 21 13:35:14 vol-h-docker-04 nomad.sh[32669]: ==> Caught signal: interrupt
Aug 21 13:35:14 vol-h-docker-04 nomad.sh[32669]:     2018/08/21 13:35:14.448612 [ERR] nomad: "ClientAllocations.Stats" RPC failed to server 172.16.9.87:4647:
 rpc error: EOF
Aug 21 13:35:14 vol-h-docker-04 nomad.sh[32669]:     2018/08/21 13:35:14.448626 [ERR] http: Request /v1/client/allocation/a22564b0-bd6c-b303-23f9-75efbbfbe5a
1/stats?region=aav01&node_id=8c6b0236-e923-1925-a963-5f53c414d403, error: rpc error: EOF
Aug 21 13:35:14 vol-h-docker-04 nomad.sh[32669]:     2018/08/21 13:35:14.448856 [ERR] nomad: "Node.GetClientAllocs" RPC failed to server 172.16.9.35:4647: rp
c error: EOF
Aug 21 13:35:14 vol-h-docker-04 nomad.sh[32669]:     2018/08/21 13:35:14.448873 [ERR] nomad: "ClientAllocations.Stats" RPC failed to server 172.16.9.89:4647:
 rpc error: EOF

and if we made many http requests to nomad api with timeouts we see many follow sockets (lsof -p <nomad pid>):

nomad   23374 root   72u     IPv4         2999079758      0t0        TCP 127.0.0.1:4646->127.0.0.1:39826 (CLOSE_WAIT)

and in our case this looks like this

default

Tcp Sockets Allocated in graph above mostly is sockets in CLOSE_WAIT state. Obviously this sockets also eats file handlers

@tantra35
Copy link
Contributor Author

tantra35 commented Sep 2, 2018

This happens again and again, after we attach delve debbuger we found this(tons of goriuntines):

  Goroutine 68321 - User: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/session.go:308 github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Session).keepalive (0xbb5aae)
  Goroutine 68335 - User: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/session.go:207 github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Session).AcceptStream (0xbb5028)
  Goroutine 68386 - User: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/stream.go:133 github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Stream).Read (0xbb84d1)
  Goroutine 68484 - User: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/stream.go:133 github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Stream).Read (0xbb84d1)
  Goroutine 68503 - User: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/stream.go:133 github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Stream).Read (0xbb84d1)
  Goroutine 68837 - User: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-memdb/watch_few.go:16 github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-memdb.watchFew (0xecb2ba)
  Goroutine 68882 - User: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/stream.go:133 github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Stream).Read (0xbb84d1)
  Goroutine 69415 - User: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/stream.go:133 github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Stream).Read (0xbb84d1)
  Goroutine 69552 - User: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/stream.go:133 github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Stream).Read (0xbb84d1)
  Goroutine 69949 - User: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/stream.go:133 github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Stream).Read (0xbb84d1)
  Goroutine 70141 - User: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/stream.go:133 github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Stream).Read (0xbb84d1)
  Goroutine 70523 - User: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/stream.go:133 github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Stream).Read (0xbb84d1)
  Goroutine 70570 - User: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/stream.go:133 github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Stream).Read (0xbb84d1)
  Goroutine 70693 - User: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/stream.go:133 github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Stream).Read (0xbb84d1)
  Goroutine 71081 - User: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/stream.go:133 github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Stream).Read (0xbb84d1)
  Goroutine 71232 - User: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/stream.go:133 github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Stream).Read (0xbb84d1)
  Goroutine 71352 - User: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/stream.go:133 github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Stream).Read (0xbb84d1)
  Goroutine 71384 - User: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/stream.go:133 github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Stream).Read (0xbb84d1)
  Goroutine 71650 - User: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/session.go:207 github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Session).AcceptStream (0xbb5028)
  Goroutine 71651 - User: /usr/local/go/src/runtime/netpoll.go:173 internal/poll.runtime_pollWait (0x4297c7)
  Goroutine 71652 - User: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/session.go:392 github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Session).send (0xbb66a0)
  Goroutine 71653 - User: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/session.go:308 github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Session).keepalive (0xbb5aae)
  Goroutine 71654 - User: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-memdb/watch_few.go:16 github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-memdb.watchFew (0xecb2ba)
  Goroutine 71661 - User: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/stream.go:133 github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Stream).Read (0xbb84d1)
  Goroutine 71786 - User: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/stream.go:133 github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Stream).Read (0xbb84d1)
  Goroutine 71835 - User: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/stream.go:133 github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Stream).Read (0xbb84d1)
  Goroutine 72247 - User: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/stream.go:133 github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Stream).Read (0xbb84d1)
  Goroutine 72777 - User: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/stream.go:133 github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Stream).Read (0xbb84d1)
  Goroutine 73032 - User: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/stream.go:133 github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Stream).Read (0xbb84d1)
  Goroutine 73059 - User: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/stream.go:133 github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Stream).Read (0xbb84d1)
  Goroutine 73884 - User: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/stream.go:133 github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Stream).Read (0xbb84d1)
  Goroutine 74056 - User: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/stream.go:133 github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Stream).Read (0xbb84d1)

stack trace for one of them is:

 (dlv) bt  30
 0  0x000000000042ee2a in runtime.gopark
    at /usr/local/go/src/runtime/proc.go:292
 1  0x000000000043eb40 in runtime.selectgo
    at /usr/local/go/src/runtime/select.go:392
 2  0x0000000000bb84d1 in github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Stream).Read
    at /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/stream.go:133
 3  0x00000000004f93be in bufio.(*Reader).fill
    at /usr/local/go/src/bufio/bufio.go:100
 4  0x00000000004f9c49 in bufio.(*Reader).ReadByte
    at /usr/local/go/src/bufio/bufio.go:242
 5  0x00000000007ab4da in github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-msgpack/codec.(*ioDecReader).readn1
    at /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-msgpack/codec/decode.go:90
 6  0x00000000007bb425 in github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-msgpack/codec.(*msgpackDecDriver).initReadNext
    at /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-msgpack/codec/msgpack.go:540
 7  0x00000000007ae6b9 in github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-msgpack/codec.(*Decoder).decode
    at /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-msgpack/codec/decode.go:635
 8  0x00000000007ae63d in github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-msgpack/codec.(*Decoder).Decode
    at /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-msgpack/codec/decode.go:630
 9  0x0000000000d478a7 in github.com/hashicorp/nomad/vendor/github.com/hashicorp/net-rpc-msgpackrpc.(*MsgpackCodec).read
    at /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/net-rpc-msgpackrpc/codec.go:121
10  0x0000000000d47511 in github.com/hashicorp/nomad/vendor/github.com/hashicorp/net-rpc-msgpackrpc.(*MsgpackCodec).ReadResponseHeader
    at /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/net-rpc-msgpackrpc/codec.go:74
11  0x0000000000d46b2f in github.com/hashicorp/nomad/vendor/github.com/hashicorp/net-rpc-msgpackrpc.CallWithCodec
    at /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/net-rpc-msgpackrpc/client.go:29
12  0x0000000000d4bcbb in github.com/hashicorp/nomad/helper/pool.(*ConnPool).RPC
    at /opt/gopath/src/github.com/hashicorp/nomad/helper/pool/pool.go:422
13  0x0000000000f9a612 in github.com/hashicorp/nomad/nomad.(*Server).forwardRegion
    at /opt/gopath/src/github.com/hashicorp/nomad/nomad/rpc.go:465
14  0x0000000000f99b7b in github.com/hashicorp/nomad/nomad.(*Server).forward
    at /opt/gopath/src/github.com/hashicorp/nomad/nomad/rpc.go:361
15  0x0000000000f438fa in github.com/hashicorp/nomad/nomad.(*ClientAllocations).Stats
    at /opt/gopath/src/github.com/hashicorp/nomad/nomad/client_alloc_endpoint.go:129
16  0x0000000000459fcb in runtime.call64
    at /usr/local/go/src/runtime/asm_amd64.s:574
17  0x00000000004c0e79 in reflect.Value.call
    at /usr/local/go/src/reflect/value.go:447
18  0x00000000004c03f4 in reflect.Value.Call
    at /usr/local/go/src/reflect/value.go:308
19  0x00000000007a728e in net/rpc.(*service).call
    at /usr/local/go/src/net/rpc/server.go:384
20  0x00000000007a8155 in net/rpc.(*Server).ServeRequest
    at /usr/local/go/src/net/rpc/server.go:504
21  0x0000000000fa6ae4 in github.com/hashicorp/nomad/nomad.(*Server).RPC
    at /opt/gopath/src/github.com/hashicorp/nomad/nomad/server.go:1394
22  0x00000000014d52ef in github.com/hashicorp/nomad/command/agent.(*HTTPServer).allocStats
    at /opt/gopath/src/github.com/hashicorp/nomad/command/agent/alloc_endpoint.go:213
23  0x00000000014d426c in github.com/hashicorp/nomad/command/agent.(*HTTPServer).ClientAllocRequest
    at /opt/gopath/src/github.com/hashicorp/nomad/command/agent/alloc_endpoint.go:94
24  0x000000000150e6d8 in github.com/hashicorp/nomad/command/agent.(*HTTPServer).ClientAllocRequest-fm
    at /opt/gopath/src/github.com/hashicorp/nomad/command/agent/http.go:168
25  0x000000000150b0ac in github.com/hashicorp/nomad/command/agent.(*HTTPServer).wrap.func1
    at /opt/gopath/src/github.com/hashicorp/nomad/command/agent/http.go:283
26  0x00000000007067b4 in net/http.HandlerFunc.ServeHTTP
    at /usr/local/go/src/net/http/server.go:1947
27  0x00000000014ab47e in github.com/hashicorp/nomad/vendor/github.com/rs/cors.(*Cors).Handler.func1
    at /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/rs/cors/cors.go:200
28  0x00000000007067b4 in net/http.HandlerFunc.ServeHTTP
    at /usr/local/go/src/net/http/server.go:1947
29  0x0000000000708520 in net/http.(*ServeMux).ServeHTTP
    at /usr/local/go/src/net/http/server.go:2337
30  0x00000000014a8769 in github.com/hashicorp/nomad/vendor/github.com/NYTimes/gziphandler.GzipHandlerWithOpts.func1.1
    at /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/NYTimes/gziphandler/gzip.go:269

For now we add timeout to 10 seconds in rpc calls to prevent dead stuck connections

sc.stream.SetDeadline(time.Now().Add(10 * time.Second))

in func (p *ConnPool) RPC(region string, addr net.Addr, version int, method string, args interface{}, reply interface{}) error method. But this is very coarse workaround

@tantra35
Copy link
Contributor Author

tantra35 commented Sep 28, 2018

@dadgar @preetapan

This happes again, but we now get more info about that

On remote region to where request proxied one of nomad servers(ip-172-20-4-194) have grows gorountine count. And this nomad server can't itself handle request so for example nomad node status 0d87a811 on it hung

default

then we attach to working process of that server with debugger and see huge count(as expected) of gorountines(output is reduced)

Goroutine 111246 - User: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/session.go:159 github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Session).OpenStream (0xbdcb4f)
 Goroutine 111257 - User: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/session.go:159 github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Session).OpenStream (0xbdcb4f)
 Goroutine 111314 - User: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/session.go:159 github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Session).OpenStream (0xbdcb4f)
 Goroutine 111315 - User: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/session.go:159 github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Session).OpenStream (0xbdcb4f)
 Goroutine 111321 - User: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/session.go:159 github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Session).OpenStream (0xbdcb4f)
 Goroutine 111383 - User: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/session.go:159 github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Session).OpenStream (0xbdcb4f)
 Goroutine 111384 - User: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/session.go:159 github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Session).OpenStream (0xbdcb4f)
 Goroutine 111389 - User: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/session.go:159 github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Session).OpenStream (0xbdcb4f)
 Goroutine 111395 - User: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/session.go:159 github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Session).OpenStream (0xbdcb4f)
 Goroutine 111995 - User: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/session.go:159 github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Session).OpenStream (0xbdcb4f)
 Goroutine 111999 - User: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/session.go:159 github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Session).OpenStream (0xbdcb4f)
 Goroutine 112000 - User: /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/session.go:159 github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Session).OpenStream (0xbdcb4f)

With follow backtrace on each:

 0  0x000000000042ee2a in runtime.gopark
   at /usr/local/go/src/runtime/proc.go:292
1  0x000000000043eb40 in runtime.selectgo
   at /usr/local/go/src/runtime/select.go:392
2  0x0000000000bdcb4f in github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Session).OpenStream
   at /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/session.go:159
3  0x0000000000bdc9eb in github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Session).Open
   at /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/session.go:142
4  0x0000000000f4d19d in github.com/hashicorp/nomad/nomad.NodeRpc
   at /opt/gopath/src/github.com/hashicorp/nomad/nomad/client_rpc.go:220
5  0x0000000000f48128 in github.com/hashicorp/nomad/nomad.(*ClientAllocations).Stats
   at /opt/gopath/src/github.com/hashicorp/nomad/nomad/client_alloc_endpoint.go:174
6  0x0000000000459fcb in runtime.call64
   at /usr/local/go/src/runtime/asm_amd64.s:574
7  0x00000000004c0e09 in reflect.Value.call
   at /usr/local/go/src/reflect/value.go:447
8  0x00000000004c0384 in reflect.Value.Call
   at /usr/local/go/src/reflect/value.go:308
9  0x00000000007a6f2e in net/rpc.(*service).call
   at /usr/local/go/src/net/rpc/server.go:384
10  0x00000000007a7df5 in net/rpc.(*Server).ServeRequest
   at /usr/local/go/src/net/rpc/server.go:504
11  0x0000000000f9cf13 in github.com/hashicorp/nomad/nomad.(*Server).handleNomadConn
   at /opt/gopath/src/github.com/hashicorp/nomad/nomad/rpc.go:241

So when we apply timeouts for RPC, we remove only part of the problem(huge count of gorountine was removed from server origin of the request: each minute we make stats collection from all allocations between all region that we use)

We monitor allocations with follow python script(this is provided only as info to better understand what we are doing specifically)

def getAlocationResourceUsage(_session, _region, _stats, _allocation):
  l_allocStatusParams = {"region": _region, "node_id": _allocation['NodeID']}
  l_allocStatus = _session.get(NOMAD_ADDR + "/v1/client/allocation/" + _allocation["ID"] + "/stats", params=l_allocStatusParams, headers={"X-Nomad-Token": NOMAD_TOKEN}, timeout=1)
  l_resource = l_allocStatus.json()["ResourceUsage"]

  _stats.timing("mem.Cache", l_resource["MemoryStats"]["Cache"])
  _stats.timing("mem.RSS", l_resource["MemoryStats"]["RSS"])
  _stats.timing("mem.MaxUsage", l_resource["MemoryStats"]["MaxUsage"])

  _stats.timing("cpu.Percent", l_resource["CpuStats"]["Percent"])
  _stats.timing("cpu.ThrottledPeriods", l_resource["CpuStats"]["ThrottledPeriods"])
  _stats.timing("cpu.TotalTicks", l_resource["CpuStats"]["TotalTicks"])

@tantra35
Copy link
Contributor Author

tantra35 commented Sep 28, 2018

Also i must said that this happens only when we query allocations from particular client not all. But when we query allocations for that client from another nomad servers in the same region, we doesn't see any issues

@tgross
Copy link
Member

tgross commented Nov 5, 2019

Noting this looks closely related to #6620 so I've assigned this to myself.

@tgross tgross added this to the near-term milestone Nov 15, 2019
@tgross tgross modified the milestones: near-term, unscheduled Jan 9, 2020
@notnoop notnoop self-assigned this Jan 13, 2020
@tantra35
Copy link
Contributor Author

tantra35 commented Mar 5, 2020

@notnoop @tgross

After some time we have patched yamux a little bit (hashicorp/yamux#79) and finally we think, that problem not in yamux itself(yes it have minor bugs, but they can't lead to this issue).

Debugging on server nodes with this issue, shows to us that yamux keepalive loop worked perfectly(so not any reason to think that happened some dataflow degradation), but we also saw, that despite this creating yamux stream doesn't went good: synCh channel (https://github.com/hashicorp/yamux/blob/master/session.go#L102) constantly increased untill it not filled fully. So I made a conclusion that other side doesn't call Accept(https://github.com/hashicorp/yamux/blob/master/session.go#L201), and I think that reason of this follow lines of code:

https://github.com/hashicorp/nomad/blob/master/helper/pool/pool.go#L277-L281

		if p.connListener != nil {
			select {
			case p.connListener <- c.session:
			default:
			}
		}

If p.connListener overfills, new session doesn't will be reported in https://github.com/hashicorp/nomad/blob/master/client/rpc.go#L242-L262


// rpcConnListener is a long lived function that listens for new connections
// being made on the connection pool and starts an RPC listener for each
// connection.
func (c *Client) rpcConnListener() {
	// Make a channel for new connections.
	conns := make(chan *yamux.Session, 4) // <<<<<<<<<<<<<<Buffer here is too small
	c.connPool.SetConnListener(conns)

	for {
		select {
		case <-c.shutdownCh:
			return
		case session, ok := <-conns:
			if !ok {
				continue
			}

			go c.listenConn(session)
		}
	}
}

and we can get behavior described here(due buffer of conns channel is very easy to overfill). Now we increase this buffer in 10 times (to 40), and continue to watch(in our case this issue happens about 1 time in a week), additionally we add a little hack in keepalive loop in yamux library to check overflow of synCh chanel and if this happens we simply close upper Session object https://github.com/hashicorp/yamux/pull/79/files#diff-f2fccfb96064097b73fbe8eeae3703eeR389-R407

After 1-2 weeks i will writing here does our changes lead to positive result

@notnoop
Copy link
Contributor

notnoop commented Mar 16, 2020

@tantra35 Thank you for digging into it and following with the report. I mistakenly thought this was addressed by the leadership flapping issue! Sorry! I'll dig into this one a bit more now.

As for the channel buffer being 4 - we can increase the limit. Though, I would have expected a small buffer to be sufficient here - the loop should dequeue and start goroutines very fast, or hopefully faster than server issues RPC calls.

Also, I may have noticed a question about RpcMultiplexV2 - we are switching to RpcMultiplexV2 in 0.11.0 for server-to-server RPC communication. We may have missed a spot in the switch - let us know if you see something odd.

@notnoop
Copy link
Contributor

notnoop commented Mar 16, 2020

Also, I suspect the loop is buggy - it currently goes in a very tight loop when conns get closed - it should return instead:

		case session, ok := <-conns:
			if !ok {
				return
			}

			go c.listenConn(session)

@tantra35
Copy link
Contributor Author

tantra35 commented Mar 16, 2020

@notnoop increase of buffer doesn't helps(so our conclusion is wrong, in any case Warning must be placed in p.connListener eat loop), but we found that real hung happens between servers, not client server communications,




+------------------Allocations.Stats-----------------------+        +------------------------+
|                   | Proxied RPC  |                       |        |                        |
|     server1       |+------------>|       server2         |        |        server3         |
|                   |   ^          |                       |        |                        |
+-------------------+   |          +-----------------------+        +------------------------+
                        |                        |
                        |                        |
                        |                        |
                        |                        |
                        |                        |
                        |                        |
                        |                Allocations.Stats
                        |                       RPC
                        |                        |
                        |                        |
                        +                        |         +----------------------+
                    here hung                    |         |                      |
                                                 |         |                      |
                                                 +-------->|   nomad client node  |
                                                           |                      |
                                                           +----------------------+

and can you clarify follow code in NodeRpc method(https://github.com/hashicorp/nomad/blob/v0.10.4/nomad/client_rpc.go#L226-L230)

	// Write the RpcNomad byte to set the mode
	if _, err := stream.Write([]byte{byte(pool.RpcNomad)}); err != nil {
		stream.Close()
		return err
	}

this implies, that communications happens in RpcMultiplexV2, but in versions before 7144efe, communication protocol is RpcMultiplex (https://github.com/hashicorp/nomad/blob/v0.10.4/helper/pool/pool.go#L334-L338) in ConnPool

	// Write the multiplex byte to set the mode
	if _, err := conn.Write([]byte{byte(RpcMultiplex)}); err != nil {
		conn.Close()
		return nil, err
	}

how this actually works in that case?

@notnoop
Copy link
Contributor

notnoop commented Mar 16, 2020

That's a very good question, and it does look suspicious - when working on RpcMultiplexV2, I also witnessed a hang when I didn't get the initial bytes correct. I'll test and update you today!

@notnoop
Copy link
Contributor

notnoop commented Mar 16, 2020

So dug and this might be a red harring. NodeRpc[1] opens a new session to the nomad client, running on an existing connection. No connection pooling or helper/pool is involved.

In Nomad client, the RPC handler expects every new yamux session to start with RpcNomad or RpcStreaming . The conns/streams in there is a bit confusing.

I have added a more explicit test in a421a84 and confirmed that the test is passing https://circleci.com/gh/hashicorp/nomad/46421 .

One thing I'm noticing is that our handling for RPC errors misses closing the connection in some failures. I wonder if we ought to port the equivalent of #7045 here.

Can you see if there is a corrolation between client.rpc.request_error or client.streaming_rpc.request_error metrics and the problem?

[1] https://github.com/hashicorp/nomad/blob/v0.10.4/nomad/client_rpc.go#L218-L230
[2] https://github.com/hashicorp/nomad/blob/v0.10.4/client/rpc.go#L245-L309

@tantra35
Copy link
Contributor Author

tantra35 commented Mar 16, 2020

Hm I'm a bit confused, and perhaps missing something, but i thinked that session.Open() opening new Stream inside yamux tcp connection(Session), which is managed by helper/pool, and this session is handled in [1] on another side, and for now looks like yamux inside in yamux, which is madness and looks like overhead, and absolutely unclear

So real streams handling happens in [2], and passing test (about the truth, I don't fully understand what it doing) looks confused for me, in any way I won't argue with you, and i must dig further(only it is unclear where)

And now we doesn't collecting client.rpc... metrics so can't trace correlations

[1] https://github.com/hashicorp/nomad/blob/v0.10.4/nomad/rpc.go#L236-L330
[2] https://github.com/hashicorp/nomad/blob/v0.10.4/nomad/rpc.go#L251, https://github.com/hashicorp/nomad/blob/v0.10.4/nomad/rpc.go#L335-L374

@notnoop
Copy link
Contributor

notnoop commented Mar 16, 2020

It's indeed pretty confusing indeed, and I could be wrong, but love the additional data points and hypothesis you are raising.

I think clients are special and we should document it. Clients use helper/pool to start a connection to a server, to heart beat with RpcNomad[1]. The server stores the underlying connection when client connects[2], then the server uses bidirectional yamux session creation support to open a session to the client[3], which is handled by rpcConnListener referenced functions[4]. This server->client path doesn't go through pool magic byte handling the same way as server<->server. In server->client path, server open sessions manually and always need to write the RpcNomad or RpcStreaming without writing a RpcMultiplex or RpcMultiplexV2 byte.

I'm suspicious of the handling when the underlying connection goes bad. I'll dig into that path further.

[1] https://github.com/hashicorp/nomad/blob/v0.10.4/client/rpc.go#L49-L73
[2] https://github.com/hashicorp/nomad/blob/v0.10.4/nomad/node_endpoint.go#L379
[3] https://github.com/hashicorp/nomad/blob/v0.10.4/nomad/client_rpc.go#L30-L53
[4] https://github.com/hashicorp/nomad/blob/v0.10.4/client/rpc.go#L245-L262

@notnoop
Copy link
Contributor

notnoop commented Mar 16, 2020

And now we doesn't collecting client.rpc... metrics so can't trace correlations

What about logs? "error performing RPC" or "error performing streaming RPC", "streaming RPC error", or "RPC error" are the significant log messages

@tantra35
Copy link
Contributor Author

tantra35 commented Mar 16, 2020

nothing interesting in logs,
изображение

but this hung, as i wrote early correlated with leader switch, and seems that hang connection appear on previous leader. For example today this happens again

изображение

as you may see leader switch from 172.27.3.86 to 172.27.0.212 and after that on server node(and we reqest allocations stat only from server nodes) 172.27.3.86 begin present hung connections in CLOSE_WAIT state:

root@ip-172-27-3-86:/home/ruslan# ss -nt | grep 'CLOSE-WAIT'
CLOSE-WAIT 0      0       ::ffff:172.27.3.86:4646               ::ffff:172.27.0.133:48214
CLOSE-WAIT 0      0       ::ffff:172.27.3.86:4646               ::ffff:172.27.0.133:47316
CLOSE-WAIT 0      0       ::ffff:172.27.3.86:4646               ::ffff:172.27.0.133:44368
CLOSE-WAIT 0      0       ::ffff:172.27.3.86:4646               ::ffff:172.27.0.133:35578
CLOSE-WAIT 0      0       ::ffff:172.27.3.86:4646               ::ffff:172.27.0.133:55764
CLOSE-WAIT 0      0       ::ffff:172.27.3.86:4646               ::ffff:172.27.0.133:38524
CLOSE-WAIT 0      0       ::ffff:172.27.3.86:4646               ::ffff:172.27.0.133:43874
CLOSE-WAIT 0      0       ::ffff:172.27.3.86:4646               ::ffff:172.27.0.133:45620
CLOSE-WAIT 0      0       ::ffff:172.27.3.86:4646               ::ffff:172.27.0.133:49618
CLOSE-WAIT 0      0       ::ffff:172.27.3.86:4646               ::ffff:172.27.0.133:34646
CLOSE-WAIT 0      0       ::ffff:172.27.3.86:4646               ::ffff:172.27.0.133:47366

if we debug this server we discover that all hanging present on NodeRpc, and it looks like Accpet call in yamux session on other side not called, because we see increased syncCh inside yamux

@tantra35
Copy link
Contributor Author

tantra35 commented Mar 16, 2020

we have folow backtrace of hunging grountine:

[235 goroutines]
(dlv) goroutine 7622384
Switched from 0 to 7622384 (thread 11251)
(dlv) bt 30
 0  0x000000000043304b in runtime.gopark
    at /usr/local/go/src/runtime/proc.go:303
 1  0x0000000000443316 in runtime.selectgo
    at /usr/local/go/src/runtime/select.go:313
 2  0x0000000000856877 in github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Stream).read
    at /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/stream.go:141
 3  0x00000000008565ce in github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Stream).Read
    at /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/stream.go:88
 4  0x000000000051fcef in bufio.(*Reader).fill
    at /usr/local/go/src/bufio/bufio.go:100
 5  0x0000000000520589 in bufio.(*Reader).ReadByte
    at /usr/local/go/src/bufio/bufio.go:242
 6  0x0000000000ae7b0a in github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-msgpack/codec.(*ioDecReader).readn1
    at /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-msgpack/codec/decode.go:90
 7  0x0000000000af8ef8 in github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-msgpack/codec.(*msgpackDecDriver).initReadNext
    at /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-msgpack/codec/msgpack.go:540
 8  0x0000000000aeb3bf in github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-msgpack/codec.(*Decoder).decode
    at /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-msgpack/codec/decode.go:635
 9  0x0000000000aeb32d in github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-msgpack/codec.(*Decoder).Decode
    at /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-msgpack/codec/decode.go:630
10  0x0000000001291065 in github.com/hashicorp/nomad/vendor/github.com/hashicorp/net-rpc-msgpackrpc.(*MsgpackCodec).read
    at /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/net-rpc-msgpackrpc/codec.go:121
11  0x0000000001290cd1 in github.com/hashicorp/nomad/vendor/github.com/hashicorp/net-rpc-msgpackrpc.(*MsgpackCodec).ReadResponseHeader
    at /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/net-rpc-msgpackrpc/codec.go:74
12  0x00000000012902fa in github.com/hashicorp/nomad/vendor/github.com/hashicorp/net-rpc-msgpackrpc.CallWithCodec
    at /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/net-rpc-msgpackrpc/client.go:29
13  0x0000000001463aa5 in github.com/hashicorp/nomad/nomad.NodeRpc
    at /opt/gopath/src/github.com/hashicorp/nomad/nomad/client_rpc.go:231
14  0x000000000145d8ad in github.com/hashicorp/nomad/nomad.(*ClientAllocations).Stats
    at /opt/gopath/src/github.com/hashicorp/nomad/nomad/client_alloc_endpoint.go:272
15  0x000000000045f9eb in runtime.call64
    at /usr/local/go/src/runtime/asm_amd64.s:523
16  0x00000000004c8224 in reflect.Value.call
    at /usr/local/go/src/reflect/value.go:447
17  0x00000000004c7cb4 in reflect.Value.Call
    at /usr/local/go/src/reflect/value.go:308
18  0x000000000087c6be in net/rpc.(*service).call
    at /usr/local/go/src/net/rpc/server.go:384
19  0x000000000087d5c2 in net/rpc.(*Server).ServeRequest
    at /usr/local/go/src/net/rpc/server.go:505
20  0x00000000014d1269 in github.com/hashicorp/nomad/nomad.(*Server).RPC
    at /opt/gopath/src/github.com/hashicorp/nomad/nomad/server.go:1426
21  0x00000000019d405b in github.com/hashicorp/nomad/command/agent.(*HTTPServer).allocStats
    at /opt/gopath/src/github.com/hashicorp/nomad/command/agent/alloc_endpoint.go:355
22  0x00000000019d1fb0 in github.com/hashicorp/nomad/command/agent.(*HTTPServer).ClientAllocRequest
    at /opt/gopath/src/github.com/hashicorp/nomad/command/agent/alloc_endpoint.go:147
23  0x0000000001a11f68 in github.com/hashicorp/nomad/command/agent.(*HTTPServer).ClientAllocRequest-fm
    at /opt/gopath/src/github.com/hashicorp/nomad/command/agent/http.go:177
24  0x0000000001a0e28c in github.com/hashicorp/nomad/command/agent.(*HTTPServer).wrap.func1
    at /opt/gopath/src/github.com/hashicorp/nomad/command/agent/http.go:298
25  0x0000000000759014 in net/http.HandlerFunc.ServeHTTP
    at /usr/local/go/src/net/http/server.go:1964
26  0x00000000019b6672 in github.com/hashicorp/nomad/vendor/github.com/rs/cors.(*Cors).Handler.func1
    at /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/rs/cors/cors.go:200
27  0x0000000000759014 in net/http.HandlerFunc.ServeHTTP
    at /usr/local/go/src/net/http/server.go:1964
28  0x000000000075adc7 in net/http.(*ServeMux).ServeHTTP
    at /usr/local/go/src/net/http/server.go:2361
29  0x00000000019b3950 in github.com/hashicorp/nomad/vendor/github.com/NYTimes/gziphandler.GzipHandlerWithOpts.func1.1
    at /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/NYTimes/gziphandler/gzip.go:269
30  0x0000000000759014 in net/http.HandlerFunc.ServeHTTP
    at /usr/local/go/src/net/http/server.go:1964

ok lets examine frame 13 NodeRpc function

(dlv) frame 13
(dlv) print stream.session.synCh
chan struct {} {
        qcount: 69,
        dataqsiz: 256,
        buf: *[256]struct struct {} [
.....

as you may see synCh have 69 items and this number only grows

ok lets examine session inside stream and watch connection info

(dlv) print stream.session.conn.fd
*net.netFD {
        pfd: internal/poll.FD {
                fdmu: (*"internal/poll.fdMutex")(0xc000dbe580),
                Sysfd: 29,
                pd: (*"internal/poll.pollDesc")(0xc000dbe598),
                iovecs: *[]syscall.Iovec nil,
                csema: 0,
                isBlocking: 0,
                IsStream: true,
                ZeroReadIsEOF: true,
                isFile: false,},
        family: 10,
        sotype: 1,
        isConnected: false,
        net: "tcp",
        laddr: net.Addr(*net.TCPAddr) *{
                IP: net.IP len: 16, cap: 16, [0,0,0,0,0,0,0,0,0,0,255,255,172,27,3,86],
                Port: 4647,
                Zone: "",},
        raddr: net.Addr(*net.TCPAddr) *{
                IP: net.IP len: 16, cap: 16, [0,0,0,0,0,0,0,0,0,0,255,255,172,27,0,212],
                Port: 37268,
                Zone: "",},}

as you can see connection is between servers 172,27,0,212 and 172,27,3,86

@tantra35
Copy link
Contributor Author

tantra35 commented Mar 17, 2020

Also we found follow when we replace https://github.com/hashicorp/nomad/blob/v0.10.4/helper/pool/pool.go#L271-L277

with follow lines of code(exactly eqvivalent, but without if):

		select {
		case p.connListener <- c.session:
		default:
			p.logger.Printf("[WARN] Skip connListener")
		}

We constantly see warn message in logs nearly every 2 minutes:

изображение

And this looks like a nonsense, but when code looks like this:

		select {
		case p.connListener <- c.session:
		default:
                        if p.connListener != nil {
			    p.logger.Printf("[WARN] Skip connListener")
                        }
		}

There no warning messages in logs, so i make conclusion that p.connListener become nil some times, and this is confirmed by the fact that the suspended connections look as if Accept is not called in yamux Session on other side

@notnoop
Copy link
Contributor

notnoop commented Mar 17, 2020

Interesting. I find it very odd that NodeRPC is used to connect to another server! If a server doesn't have a direct connection to that client, it should go through findNodeConnAndForward path. Do you happen to have nomad servers that also runs as nomad clients?

Also, thanks for the p.connListener tip - I'll investigate this too.

@tantra35
Copy link
Contributor Author

We have full separate nomad servers and clients. So servers function only as servers

@notnoop
Copy link
Contributor

notnoop commented Mar 17, 2020

OK - I suspect that's the bug - if NodeRPC is used to contact a server, the server is indeed isn't setup to handle that bidirectionally and will indeed hang.

Looking at https://github.com/hashicorp/nomad/blob/v0.10.4/nomad/node_endpoint.go#L373-L383 ; I suspect the following events took place:

  1. server S1 is the leader
  2. client C1 connects heartbeats to server S2
  3. server S2 forwards heartbeat RPC call to S1, leader
  4. server S1 is deposed and is succeeded by leader S3
  5. server S1 forwards the RPC call to S3 which applies the heartbeat
  6. Server S1 marks the S2<->S1 connection as C1 NodeID
  7. If server S1 gets an allocation stats request, it will use NodeRPC to query S2 and hangs forever.

This race should happen rarely, but I suspect considering the frequency of your leadership election and frequency of alloc stat calls, it's more likely for you to detect the problem.

Does that seem plausible to you?

@tantra35
Copy link
Contributor Author

tantra35 commented Mar 17, 2020

Sounds plausible, but it doesn't explain why synCh increasing inside yamux session, if i understand all correctly it is possible only in case when on other side of yamux Session doesn't called Accept

And about p.connListener, i think that warning happens on server side https://github.com/hashicorp/nomad/blob/master/nomad/server.go#L309 , and this is normal due https://github.com/hashicorp/nomad/blob/master/nomad/server.go#L413

@notnoop
Copy link
Contributor

notnoop commented Mar 18, 2020

synCh increasing inside yamux session, if i understand all correctly it is possible only in case when on other side of yamux Session doesn't called Accept

Servers don't call Accept on the Yamux session - only nomad clients do IIUC. Nomad servers only accept the raw connection - so that would make sense.

I will still do a bit of digging today and post findings as well as a way to reproduce my steps.

@tantra35
Copy link
Contributor Author

@notnoop Seems that is full victory, for last 12 days we doesn't see any problems with stat calls(not any timeouts or something else), and no any leaked gorountines. Thanks

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants