Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug Report: show vitess_tablets outputs wrong tablets #14588

Open
dk-lockdown opened this issue Nov 23, 2023 · 5 comments
Open

Bug Report: show vitess_tablets outputs wrong tablets #14588

dk-lockdown opened this issue Nov 23, 2023 · 5 comments

Comments

@dk-lockdown
Copy link

Overview of the Issue

After deploying vitesscluster in the k8s environment, delete the primary vttablet of a shard, then execute show vitess_tablets.
image
After the vttablet container is restarted, the IP address has changed, HealthCheckImpl.ReplaceTablet(old, new *topodata.Tablet) will delete old tablet from hc.healthData. But, will re-add it to hc.healthData.

hc.healthData[targetKey][tabletAlias] = th

Reproduction Steps

see above

Binary Version

Version: 16.0.1 (Git revision 3550fc17830589a7f6c4f8f00b59275077dc40cf branch 'HEAD') built on Wed Nov 22 11:24:41 UTC 2023 by vitess@buildkitsandbox using go1.20.2 linux/amd64

Operating System and Environment details

Linux bootstrap 5.15.67-6.cl9.x86_64 #1 SMP Wed Mar 8 06:32:59 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Log Fragments

none
@dk-lockdown dk-lockdown added Needs Triage This issue needs to be correctly labelled and triaged Type: Bug labels Nov 23, 2023
@shlomi-noach shlomi-noach removed the Needs Triage This issue needs to be correctly labelled and triaged label Nov 26, 2023
@FirePing32
Copy link
Contributor

Maybe we should improve the error checking here?

if strings.Contains(err.Error(), "health stats mismatch") {
log.Warningf("deleting tablet %v from healthcheck due to health stats mismatch", thc.Tablet)
hc.deleteTablet(thc.Tablet)
return
}
// trivialUpdate = false because this is an error
// up = false because we did not get a healthy response
hc.updateHealth(thc.SimpleCopy(), thc.Target, false, false)

RIght now, the error message matches only if the same host:port is taken by another tablet. But the host:port is empty in our case post tablet deletion.

@dk-lockdown
Copy link
Author

HealthCheckImpl.ReplaceTablet(old, new *topodata.Tablet) has deleted tablet from hc.healthData. And the stream err is Code: CANCELED vttablet: rpc error: code = Canceled desc = context canceled.

W1121 12:32:53.649596       1 tablet_health_check.go:346] tablet alias:{cell:"zone1" uid:1555561643} hostname:"21.202.242.206" port_map:{key:"grpc" value:15999} port_map:{key:"vt" value:15000} keyspace:"ywkha" shard:"c0-" key_range:{start:"\xc0"} type:PRIMARY db_name_override:"vt_ywkha" mysql_hostname:"21.202.242.206" mysql_port:3306 primary_term_start_time:{seconds:1700472990 nanoseconds:728423584} db_server_version:"8.0.34" default_conn_collation:255 healthcheck stream error: Code: CANCELED
vttablet: rpc error: code = Canceled desc = context canceled

@FirePing32
Copy link
Contributor

Maybe this minor change would work?

diff --git a/go/vt/discovery/tablet_health_check.go b/go/vt/discovery/tablet_health_check.go
index 24496155e7..a11b54325b 100644
--- a/go/vt/discovery/tablet_health_check.go
+++ b/go/vt/discovery/tablet_health_check.go
@@ -293,7 +293,7 @@ func (thc *tabletHealthCheck) checkConn(hc *HealthCheckImpl) {
                        // cluster, the new tablet record will be fetched from the topology server and re-added to
                        // the healthcheck cache again via the topology watcher.
                        // WARNING: Under no other circumstances should we be deleting the tablet here.
-                       if strings.Contains(err.Error(), "health stats mismatch") {
+                       if strings.Contains(err.Error(), "health stats mismatch") || strings.Contains(err.Error(), "context canceled") {
                                log.Warningf("deleting tablet %v from healthcheck due to health stats mismatch", thc.Tablet)
                                hc.deleteTablet(thc.Tablet)
                                return

cc @deepthi @harshit-gangal

@deepthi
Copy link
Member

deepthi commented Dec 6, 2023

When you say

But, will re-add it to hc.healthData.

It should re-add it with the new target i.e. as a REPLICA. If you can see in a debugger that we are deleting and adding it back as PRIMARY, then that is something that should be fixed.

@deepthi
Copy link
Member

deepthi commented Dec 6, 2023

Maybe this minor change would work?

It will probably solve it for your test case, but breaks other cases badly. It's already been tried and the results were not pretty. It was reverted as part of a larger PR #9237, which is worth reading if you want to understand prior attempts to fix the kind of thing you are seeing. Clearly, the fixes so far are incomplete and there are still conditions under which we are being left with zombie records in vtgate's healthcheck, but unfortunately it's not going to be a simple fix.

If you are able to observe the sequence of events in a debugger and share a summary here, that might help develop a fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants