-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug Report: show vitess_tablets outputs wrong tablets #14588
Comments
Maybe we should improve the error checking here? vitess/go/vt/discovery/tablet_health_check.go Lines 296 to 303 in 61a0f02
RIght now, the error message matches only if the same host:port is taken by another tablet. But the host:port is empty in our case post tablet deletion.
|
|
Maybe this minor change would work? diff --git a/go/vt/discovery/tablet_health_check.go b/go/vt/discovery/tablet_health_check.go
index 24496155e7..a11b54325b 100644
--- a/go/vt/discovery/tablet_health_check.go
+++ b/go/vt/discovery/tablet_health_check.go
@@ -293,7 +293,7 @@ func (thc *tabletHealthCheck) checkConn(hc *HealthCheckImpl) {
// cluster, the new tablet record will be fetched from the topology server and re-added to
// the healthcheck cache again via the topology watcher.
// WARNING: Under no other circumstances should we be deleting the tablet here.
- if strings.Contains(err.Error(), "health stats mismatch") {
+ if strings.Contains(err.Error(), "health stats mismatch") || strings.Contains(err.Error(), "context canceled") {
log.Warningf("deleting tablet %v from healthcheck due to health stats mismatch", thc.Tablet)
hc.deleteTablet(thc.Tablet)
return |
When you say
It should re-add it with the new target i.e. as a REPLICA. If you can see in a debugger that we are deleting and adding it back as PRIMARY, then that is something that should be fixed. |
It will probably solve it for your test case, but breaks other cases badly. It's already been tried and the results were not pretty. It was reverted as part of a larger PR #9237, which is worth reading if you want to understand prior attempts to fix the kind of thing you are seeing. Clearly, the fixes so far are incomplete and there are still conditions under which we are being left with zombie records in vtgate's healthcheck, but unfortunately it's not going to be a simple fix. If you are able to observe the sequence of events in a debugger and share a summary here, that might help develop a fix. |
Overview of the Issue
After deploying vitesscluster in the k8s environment, delete the primary vttablet of a shard, then execute
show vitess_tablets
.After the vttablet container is restarted, the IP address has changed,
HealthCheckImpl.ReplaceTablet(old, new *topodata.Tablet)
will delete old tablet fromhc.healthData
. But, will re-add it tohc.healthData
.vitess/go/vt/discovery/healthcheck.go
Line 526 in 40f314c
Reproduction Steps
see above
Binary Version
Version: 16.0.1 (Git revision 3550fc17830589a7f6c4f8f00b59275077dc40cf branch 'HEAD') built on Wed Nov 22 11:24:41 UTC 2023 by vitess@buildkitsandbox using go1.20.2 linux/amd64
Operating System and Environment details
Linux bootstrap 5.15.67-6.cl9.x86_64 #1 SMP Wed Mar 8 06:32:59 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Log Fragments
The text was updated successfully, but these errors were encountered: