server: avoid deadlocking call to (*grpc.Server).Stop #31692

tbg · 2018-10-22T14:13:27Z

This can deadlock if we don't also shut down the listener, which happens
only when we stop the stopper. At the same time, triggering the stopper
already also shuts down the grpc server.

Fixes #31690.

Release note: None

This can deadlock if we don't also shut down the listener, which happens only when we stop the stopper. At the same time, triggering the stopper already also shuts down the grpc server. Fixes cockroachdb#31690. Release note: None

tbg · 2018-10-22T15:27:31Z

Hmm, still hitting a deadlock with this. A worse one even, because now it's stuck with both ports closed & it doesn't react to SIGABRT.

tbg · 2018-10-22T16:52:37Z

@knz something that's really weird here is that in this situation, the node definitely will not shut down cleanly (I see that from the logs), but I can see ./cockroach quit --insecure return immediately.
I assume this is by design since that waits for the connection to drop, but I'm seeing that this doesn't work very well in conjunction with this deadlock.

I'm going to have to look into the root cause once more, but in the meantime for discussion here's a second commit that just sends a signal to the own process which then activates the "regular" shutdown machinery. I'm curious what you think of that approach. (That one does not get stuck, I'm not sure why).

Release note: None

tbg · 2018-10-22T17:32:53Z

The start code would also get stuck if it weren't for the "hard shutdown". I'm seeing the same deadlock where the grpc server doesn't shut down because a call to Serve is waiting on cMux.
Currently trying to close some more listeners on shutdown.

tbg · 2018-10-22T17:34:15Z

cMux is stuck in pg:

goroutine 3951 [IO wait, 10 minutes]:
internal/poll.runtime_pollWait(0xc160000, 0x72, 0xc0094afb08)
        /usr/local/Cellar/go/1.11.1/libexec/src/runtime/netpoll.go:173 +0x66
internal/poll.(*pollDesc).wait(0xc00d219718, 0x72, 0xffffffffffffff00, 0x6def7e0, 0x8471c30)
        /usr/local/Cellar/go/1.11.1/libexec/src/internal/poll/fd_poll_runtime.go:85 +0x9a
internal/poll.(*pollDesc).waitRead(0xc00d219718, 0xc00bdc9d00, 0x4, 0x4)
        /usr/local/Cellar/go/1.11.1/libexec/src/internal/poll/fd_poll_runtime.go:90 +0x3d
internal/poll.(*FD).Read(0xc00d219700, 0xc00bdc9d98, 0x4, 0x4, 0x0, 0x0, 0x0)
        /usr/local/Cellar/go/1.11.1/libexec/src/internal/poll/fd_unix.go:169 +0x1d6
net.(*netFD).Read(0xc00d219700, 0xc00bdc9d98, 0x4, 0x4, 0x40680d8, 0xc00e14a820, 0xc00bafa000)
        /usr/local/Cellar/go/1.11.1/libexec/src/net/fd_unix.go:202 +0x4f
net.(*conn).Read(0xc00e361be0, 0xc00bdc9d98, 0x4, 0x4, 0x0, 0x0, 0x0)
        /usr/local/Cellar/go/1.11.1/libexec/src/net/net.go:177 +0x68
github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/cmux.(*bufferedReader).Read(0xc00cf8a578, 0xc00bdc9d98, 0x4, 0x4, 0x203002, 0x0, 0x0)
        /Users/tschottdorf/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/cmux/buffer.go:42 +0x103
io.ReadAtLeast(0x6debc20, 0xc00cf8a578, 0xc00bdc9d98, 0x4, 0x4, 0x4, 0x8, 0x20, 0xc00bdc9d80)
        /usr/local/Cellar/go/1.11.1/libexec/src/io/io.go:310 +0x88
io.ReadFull(0x6debc20, 0xc00cf8a578, 0xc00bdc9d98, 0x4, 0x4, 0xa0, 0x6635260, 0xc000c49db0)
        /usr/local/Cellar/go/1.11.1/libexec/src/io/io.go:329 +0x58
github.com/cockroachdb/cockroach/pkg/sql/pgwire/pgwirebase.(*ReadBuffer).ReadUntypedMsg(0xc00bdc9d80, 0x6debc20, 0xc00cf8a578, 0xc000c49ee0, 0xc00c08a5c0, 0x8a57180)
        /Users/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/sql/pgwire/pgwirebase/encoding.go:100 +0x63
github.com/cockroachdb/cockroach/pkg/sql/pgwire.Match(0x6debc20, 0xc00cf8a578, 0xa1d6fc8)
        /Users/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/sql/pgwire/server.go:201 +0x58
github.com/cockroachdb/cockroach/pkg/server.(*Server).Start.func1(0x6debc20, 0xc00cf8a578, 0xc00e361be0)
        /Users/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/server/server.go:1113 +0x35
github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/cmux.(*cMux).serve(0xc000875e80, 0x6e3ad00, 0xc00e361be0, 0xc0005a0900, 0xc0002d0000)
        /Users/tschottdorf/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/cmux/cmux.go:143 +0x1f4
created by github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/cmux.(*cMux).Serve
        /Users/tschottdorf/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/cmux/cmux.go:133 +0x15e

tbg · 2018-10-22T17:36:43Z

I'm confused how that happens. If a start a single node cluster, open a SQL prompt and even open a txn (unbuffered), ./cockroach quit will still manage a clean shutdown.

knz · 2018-10-22T17:37:04Z

(Your idea to have the process send itself SIGINT is sound -- it's an age-old trick -- and I have no concern with it.)

tbg · 2018-10-22T18:22:00Z

The stuck goroutine is eventually reading from pgL (created in server/server.go) and I'm sure we're closing that. I'm also not sure what this roachtest is doing that I'm not. Ugh. Maybe I'll see something pop up tomorrow.

cockroach-teamcity · 2018-10-23T18:48:53Z

This change is

knz · 2020-05-08T16:17:37Z

Superseded by #37668

server: avoid deadlocking call to (*grpc.Server).Stop

9aaf4e2

This can deadlock if we don't also shut down the listener, which happens only when we stop the stopper. At the same time, triggering the stopper already also shuts down the grpc server. Fixes cockroachdb#31690. Release note: None

tbg requested a review from a team October 22, 2018 14:13

server: send signal in ./cockroach quit instead of ad-hoc shutdown

4c5d96a

Release note: None

tbg mentioned this pull request Nov 12, 2018

server: stopper.Stop called twice #32148

Closed

tbg mentioned this pull request May 21, 2019

roachtest: version/mixed/nodes=5 failed #37425

Closed

tbg added the X-noremind Bots won't notify about PRs with X-noremind label Jun 19, 2019

knz closed this May 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: avoid deadlocking call to (*grpc.Server).Stop #31692

server: avoid deadlocking call to (*grpc.Server).Stop #31692

tbg commented Oct 22, 2018

tbg commented Oct 22, 2018

tbg commented Oct 22, 2018

tbg commented Oct 22, 2018

tbg commented Oct 22, 2018

tbg commented Oct 22, 2018

knz commented Oct 22, 2018

tbg commented Oct 22, 2018

cockroach-teamcity commented Oct 23, 2018

knz commented May 8, 2020

server: avoid deadlocking call to (*grpc.Server).Stop #31692

server: avoid deadlocking call to (*grpc.Server).Stop #31692

Conversation

tbg commented Oct 22, 2018

tbg commented Oct 22, 2018

tbg commented Oct 22, 2018

tbg commented Oct 22, 2018

tbg commented Oct 22, 2018

tbg commented Oct 22, 2018

knz commented Oct 22, 2018

tbg commented Oct 22, 2018

cockroach-teamcity commented Oct 23, 2018

knz commented May 8, 2020