Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: avoid deadlocking call to (*grpc.Server).Stop #31692

Closed
wants to merge 2 commits into from

Conversation

tbg
Copy link
Member

@tbg tbg commented Oct 22, 2018

This can deadlock if we don't also shut down the listener, which happens
only when we stop the stopper. At the same time, triggering the stopper
already also shuts down the grpc server.

Fixes #31690.

Release note: None

This can deadlock if we don't also shut down the listener, which happens
only when we stop the stopper. At the same time, triggering the stopper
already also shuts down the grpc server.

Fixes cockroachdb#31690.

Release note: None
@tbg tbg requested a review from a team October 22, 2018 14:13
@tbg
Copy link
Member Author

tbg commented Oct 22, 2018

Hmm, still hitting a deadlock with this. A worse one even, because now it's stuck with both ports closed & it doesn't react to SIGABRT.

@tbg
Copy link
Member Author

tbg commented Oct 22, 2018

@knz something that's really weird here is that in this situation, the node definitely will not shut down cleanly (I see that from the logs), but I can see ./cockroach quit --insecure return immediately.
I assume this is by design since that waits for the connection to drop, but I'm seeing that this doesn't work very well in conjunction with this deadlock.

I'm going to have to look into the root cause once more, but in the meantime for discussion here's a second commit that just sends a signal to the own process which then activates the "regular" shutdown machinery. I'm curious what you think of that approach. (That one does not get stuck, I'm not sure why).

@tbg
Copy link
Member Author

tbg commented Oct 22, 2018

The start code would also get stuck if it weren't for the "hard shutdown". I'm seeing the same deadlock where the grpc server doesn't shut down because a call to Serve is waiting on cMux.
Currently trying to close some more listeners on shutdown.

@tbg
Copy link
Member Author

tbg commented Oct 22, 2018

cMux is stuck in pg:

goroutine 3951 [IO wait, 10 minutes]:
internal/poll.runtime_pollWait(0xc160000, 0x72, 0xc0094afb08)
        /usr/local/Cellar/go/1.11.1/libexec/src/runtime/netpoll.go:173 +0x66
internal/poll.(*pollDesc).wait(0xc00d219718, 0x72, 0xffffffffffffff00, 0x6def7e0, 0x8471c30)
        /usr/local/Cellar/go/1.11.1/libexec/src/internal/poll/fd_poll_runtime.go:85 +0x9a
internal/poll.(*pollDesc).waitRead(0xc00d219718, 0xc00bdc9d00, 0x4, 0x4)
        /usr/local/Cellar/go/1.11.1/libexec/src/internal/poll/fd_poll_runtime.go:90 +0x3d
internal/poll.(*FD).Read(0xc00d219700, 0xc00bdc9d98, 0x4, 0x4, 0x0, 0x0, 0x0)
        /usr/local/Cellar/go/1.11.1/libexec/src/internal/poll/fd_unix.go:169 +0x1d6
net.(*netFD).Read(0xc00d219700, 0xc00bdc9d98, 0x4, 0x4, 0x40680d8, 0xc00e14a820, 0xc00bafa000)
        /usr/local/Cellar/go/1.11.1/libexec/src/net/fd_unix.go:202 +0x4f
net.(*conn).Read(0xc00e361be0, 0xc00bdc9d98, 0x4, 0x4, 0x0, 0x0, 0x0)
        /usr/local/Cellar/go/1.11.1/libexec/src/net/net.go:177 +0x68
github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/cmux.(*bufferedReader).Read(0xc00cf8a578, 0xc00bdc9d98, 0x4, 0x4, 0x203002, 0x0, 0x0)
        /Users/tschottdorf/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/cmux/buffer.go:42 +0x103
io.ReadAtLeast(0x6debc20, 0xc00cf8a578, 0xc00bdc9d98, 0x4, 0x4, 0x4, 0x8, 0x20, 0xc00bdc9d80)
        /usr/local/Cellar/go/1.11.1/libexec/src/io/io.go:310 +0x88
io.ReadFull(0x6debc20, 0xc00cf8a578, 0xc00bdc9d98, 0x4, 0x4, 0xa0, 0x6635260, 0xc000c49db0)
        /usr/local/Cellar/go/1.11.1/libexec/src/io/io.go:329 +0x58
github.com/cockroachdb/cockroach/pkg/sql/pgwire/pgwirebase.(*ReadBuffer).ReadUntypedMsg(0xc00bdc9d80, 0x6debc20, 0xc00cf8a578, 0xc000c49ee0, 0xc00c08a5c0, 0x8a57180)
        /Users/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/sql/pgwire/pgwirebase/encoding.go:100 +0x63
github.com/cockroachdb/cockroach/pkg/sql/pgwire.Match(0x6debc20, 0xc00cf8a578, 0xa1d6fc8)
        /Users/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/sql/pgwire/server.go:201 +0x58
github.com/cockroachdb/cockroach/pkg/server.(*Server).Start.func1(0x6debc20, 0xc00cf8a578, 0xc00e361be0)
        /Users/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/server/server.go:1113 +0x35
github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/cmux.(*cMux).serve(0xc000875e80, 0x6e3ad00, 0xc00e361be0, 0xc0005a0900, 0xc0002d0000)
        /Users/tschottdorf/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/cmux/cmux.go:143 +0x1f4
created by github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/cmux.(*cMux).Serve
        /Users/tschottdorf/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/cmux/cmux.go:133 +0x15e

@tbg
Copy link
Member Author

tbg commented Oct 22, 2018

I'm confused how that happens. If a start a single node cluster, open a SQL prompt and even open a txn (unbuffered), ./cockroach quit will still manage a clean shutdown.

@knz
Copy link
Contributor

knz commented Oct 22, 2018

(Your idea to have the process send itself SIGINT is sound -- it's an age-old trick -- and I have no concern with it.)

@tbg
Copy link
Member Author

tbg commented Oct 22, 2018

The stuck goroutine is eventually reading from pgL (created in server/server.go) and I'm sure we're closing that. I'm also not sure what this roachtest is doing that I'm not. Ugh. Maybe I'll see something pop up tomorrow.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@tbg tbg added the X-noremind Bots won't notify about PRs with X-noremind label Jun 19, 2019
@knz
Copy link
Contributor

knz commented May 8, 2020

Superseded by #37668

@knz knz closed this May 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
X-noremind Bots won't notify about PRs with X-noremind
Projects
None yet
Development

Successfully merging this pull request may close these issues.

cli: stuck in grpc.Server.Close() on quit
3 participants