Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug Report: Demote Primary stuck #14760

Closed
GuptaManan100 opened this issue Dec 12, 2023 · 3 comments · Fixed by #14939 or #15339
Closed

Bug Report: Demote Primary stuck #14760

GuptaManan100 opened this issue Dec 12, 2023 · 3 comments · Fixed by #14939 or #15339

Comments

@GuptaManan100
Copy link
Member

Overview of the Issue

It has been noticed that DemotePrimary is still getting stuck waiting for requests to be empty.

vttablets are stuck trying to send a message back to vtgates -

             google.golang.org/grpc/internal/transport.(*writeQuota).get (inline)
             google.golang.org/grpc/internal/transport.(*http2Server).Write
             google.golang.org/grpc.(*serverStream).SendMsg
             vitess.io/vitess/go/vt/proto/queryservice.(*queryStreamExecuteServer).Send
             vitess.io/vitess/go/vt/vttablet/grpcqueryservice.(*query).StreamExecute.func1
             vitess.io/vitess/go/vt/vttablet/tabletserver.(*StreamConsolidator).Consolidate.func3
             vitess.io/vitess/go/vt/vttablet/tabletserver.(*QueryExecutor).Stream.func2.1
             vitess.io/vitess/go/vt/vttablet/tabletserver.(*QueryExecutor).execStreamSQL.func1
             vitess.io/vitess/go/vt/vttablet/tabletserver/connpool.(*DBConn).Stream.func1
             vitess.io/vitess/go/vt/dbconnpool.(*DBConnection).ExecuteStreamFetch
             vitess.io/vitess/go/vt/vttablet/tabletserver/connpool.(*DBConn).streamOnce
             vitess.io/vitess/go/vt/vttablet/tabletserver/connpool.(*DBConn).Stream
             vitess.io/vitess/go/vt/vttablet/tabletserver.(*QueryExecutor).execStreamSQL
             vitess.io/vitess/go/vt/vttablet/tabletserver.(*QueryExecutor).Stream.func2
             vitess.io/vitess/go/vt/vttablet/tabletserver.(*StreamConsolidator).Consolidate
             vitess.io/vitess/go/vt/vttablet/tabletserver.(*QueryExecutor).Stream
             vitess.io/vitess/go/vt/vttablet/tabletserver.(*TabletServer).streamExecute.func1
             vitess.io/vitess/go/vt/vttablet/tabletserver.(*TabletServer).execRequest
             vitess.io/vitess/go/vt/vttablet/tabletserver.(*TabletServer).streamExecute
             vitess.io/vitess/go/vt/vttablet/tabletserver.(*TabletServer).StreamExecute
             vitess.io/vitess/go/vt/vttablet/grpcqueryservice.(*query).StreamExecute
             vitess.io/vitess/go/vt/proto/queryservice._Query_StreamExecute_Handler

and the same goes for vtgates, they are stuck trying to send a message back to the clients -

             google.golang.org/grpc/internal/transport.(*writeQuota).get (inline)
             google.golang.org/grpc/internal/transport.(*http2Server).Write
             google.golang.org/grpc.(*serverStream).SendMsg
             vitess.io/vitess/go/vt/proto/vtgateservice.(*vitessStreamExecuteServer).Send
             vitess.io/vitess/go/vt/vtgate/grpcvtgateservice.(*VTGate).StreamExecute.func1
             vitess.io/vitess/go/vt/vtgate.(*VTGate).StreamExecute.func1
             vitess.io/vitess/go/vt/vtgate.(*Executor).StreamExecute.func1.1
             vitess.io/vitess/go/vt/vtgate.(*streaminResultReceiver).storeResultStats
             vitess.io/vitess/go/vt/vtgate.(*Executor).StreamExecute.func1.2
             vitess.io/vitess/go/vt/vtgate/engine.(*Route).streamExecuteShards.func1
             vitess.io/vitess/go/vt/vttablet/queryservice.(*wrappedService).StreamExecute.func1.1
             vitess.io/vitess/go/vt/vttablet/grpctabletconn.(*gRPCQueryClient).StreamExecute
             vitess.io/vitess/go/vt/vttablet/queryservice.(*wrappedService).StreamExecute.func1
             vitess.io/vitess/go/vt/vtgate.(*TabletGateway).withRetry
             vitess.io/vitess/go/vt/vttablet/queryservice.(*wrappedService).StreamExecute
             vitess.io/vitess/go/vt/vtgate.(*ScatterConn).StreamExecuteMulti.func1
             vitess.io/vitess/go/vt/vtgate.(*ScatterConn).multiGoTransaction.func1
             vitess.io/vitess/go/vt/vtgate.(*ScatterConn).multiGoTransaction
             vitess.io/vitess/go/vt/vtgate.(*ScatterConn).StreamExecuteMulti
             vitess.io/vitess/go/vt/vtgate.(*Executor).StreamExecuteMulti
             vitess.io/vitess/go/vt/vtgate.(*vcursorImpl).StreamExecuteMulti
             vitess.io/vitess/go/vt/vtgate/engine.(*Route).streamExecuteShards
             vitess.io/vitess/go/vt/vtgate/engine.(*Route).TryStreamExecute
             vitess.io/vitess/go/vt/vtgate.(*vcursorImpl).StreamExecutePrimitive
             vitess.io/vitess/go/vt/vtgate.(*Executor).StreamExecute.func1
             vitess.io/vitess/go/vt/vtgate.(*Executor).newExecute
             vitess.io/vitess/go/vt/vtgate.(*Executor).StreamExecute
             vitess.io/vitess/go/vt/vtgate.(*VTGate).StreamExecute

Since the client is using the grpc protocol and sending a stream execute call that has no timeout, we can't control the client and make it receive more packets.

This however blocks DemotePrimary from going through, since a go-routine is stuck running execRequest

Reproduction Steps

Run stream execute grpc call and don't read the messages
Run PRS

Binary Version

main and all other versions

Operating System and Environment details

-

Log Fragments

No response

@GuptaManan100
Copy link
Member Author

#14939 might not be the correct bug fix. We still merged it, cause its a good change regardless, but we'll keep the issue open until the follow up on #14939 (review) has completed

@GuptaManan100
Copy link
Member Author

I've been looking at the code to follow up on the comment from @vmg #14939 (review) and I believe I have found where we are breaking the contract for gRPC!

Take a look at this specific go routine -

vitess.io/vitess/go/vt/vtgate.(*ScatterConn).multiGoTransaction(scatter_conn.go:668)
vitess.io/vitess/go/vt/vtgate.(*ScatterConn).StreamExecuteMulti(scatter_conn.go:360)
vitess.io/vitess/go/vt/vtgate.(*Executor).StreamExecuteMulti(executor.go:1406)
vitess.io/vitess/go/vt/vtgate.(*vcursorImpl).StreamExecuteMulti(vcursor_impl.go:595)
vitess.io/vitess/go/vt/vtgate/engine.(*Route).streamExecuteShards(route.go:306)
vitess.io/vitess/go/vt/vtgate/engine.(*Route).TryStreamExecute(route.go:266)
vitess.io/vitess/go/vt/vtgate.(*vcursorImpl).StreamExecutePrimitive(vcursor_impl.go:506)
vitess.io/vitess/go/vt/vtgate.(*Executor).StreamExecute.func1(executor.go:325)
vitess.io/vitess/go/vt/vtgate.(*Executor).newExecute(plan_execute.go:154)
vitess.io/vitess/go/vt/vtgate.(*Executor).StreamExecute(executor.go:359)
vitess.io/vitess/go/vt/vtgate.(*VTGate).StreamExecute(vtgate.go:515)
vitess.io/vitess/go/vt/vtgate/grpcvtgateservice.(*VTGate).StreamExecute(server.go:190)
vitess.io/vitess/go/vt/proto/vtgateservice._Vitess_StreamExecute_Handler(vtgateservice_grpc.pb.go:285)

If we see the specific comment in https://github.com/vitessio/vitess/blob/main/go/vt/vtgate/grpcvtgateservice/server.go#L191-L192 we see that we expect the vtgate to not break the contract of calling the callback in multiple go routines. But this is being broken in the code in https://github.com/vitessio/vitess/blob/main/go/vt/vtgate/scatter_conn.go#L666-L669. This code eventually leads to calling SendMsg from multiple separate go routines, which isn't safe.

@GuptaManan100
Copy link
Member Author

I misunderstood the docs and as it turns out that #14760 (comment) isn't really an issue. I thought that when the doc said but it is not safe to call SendMsg on the same stream in different goroutines., it was literal and all the SendMsg calls should be from the same single go-routine, but they only need to be serialized and that is already guaranteed in the executor as Harshit pointed out in #15266 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment