-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug Report: VTTablet & VTGate routing traffic to hung MySQL server #11884
Comments
I think this is where the initial "healthy" state comes from: vitess/go/vt/vttablet/tabletserver/health_streamer.go Lines 212 to 213 in ece6501
|
👋 all, we are seeing the same issue in production. Happy to collaborate on a fix! |
@latentflip / @arthurschreiber I did some testing on:
And I tested coredump behaviour by sending a During this time I gathered goroutine pprof profiles once a second. In essentially every 1 second samples, up until the time the tablet became
TL;DR: I think your proposal to add timeouts should resolve this I will make a PR to try this in the next few days 👍 |
On 2nd thought, there are two ways this could be resolved:
Thoughts appreciated! |
Vitess has query killer logic which can be reused here. It is an external go routine that needs to watch on the conn if it is still executing on the timeout. This query timeout is available on |
Overview of the Issue
We had an issue today where:
serving true => false
but then it would immediately transition it back fromfalse => true
Cause
I believe @dm-2 and I have tracked this down to being caused by a few things:
context.TODO()
togetPoolReconnect
:vitess/go/vt/mysqlctl/replication.go
Line 326 in a36de2c
vitess/go/vt/mysqlctl/query.go
Lines 34 to 40 in a36de2c
conn.ExecuteFetch
will then block indefinitely if the MySQL host is in a bad state after the crash while it's doing a core-dumptablet_health_check
code will detect that it's not seen an updated serving status come in from the vttablet for it's timeout (in our case 1 minute) and so will transition the vtgate's own state toserving false
.closeConnection
to transition the tablet toserving false
here:vitess/go/vt/discovery/tablet_health_check.go
Lines 334 to 336 in 7fafc94
vitess/go/vt/vttablet/tabletserver/health_streamer.go
Lines 211 to 212 in a36de2c
serving true
and starts routing traffic to it again.Replication()
call from (1) holds thesm.mu.Lock()
the whole time, which causes other code to be blocked indefinitely that relies on taking the lock, like/vars/debug
which cannot get the currentIsServing
state because of the lock:vitess/go/vt/vttablet/tabletserver/state_manager.go
Lines 678 to 683 in 7fafc94
Possible fixes:
It seems like the connection setup, and query execution to mysql in
getPoolReconnect
needs a timeout to prevent it blocking indefinitely, and to set the vttablet's state to not serving if it cannot fetch replication state from mysql. I think that would resolve all the issues here?Reproduction Steps
I don't know how to reliably reproduce this with clear steps. I'm guessing the problem would be visible if we crashed mysql in a non-graceful way such that it didn't fully terminate connections (or whatever happens while it's doing a core dump).
I think the analysis above covers it though.
Binary Version
Server version: 5.7.32-vitess Version: 14.0.1 (Git revision 631084ae79181ba816ba2d98bee07c16d8b2f7b4 branch 'master') built on Mon Nov 21 16:30:24 UTC 2022 by root@bcaa51ae028b using go1.18.4 linux/amd64
Operating System and Environment details
Log Fragments
we then see the exact same pattern repeated one minute later, and every minute, until the core-dump completes and the host restarts.
The text was updated successfully, but these errors were encountered: