-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flakes: Use new healthy shard check in vreplication e2e tests #12502
Flakes: Use new healthy shard check in vreplication e2e tests #12502
Conversation
This is needed because checking that there's a primary tablet for the shard in vtgate's healtcheck is no longer a reliable indicator that the shard has a healthy serving primary, because now a primary needs to initialize its sidecar database and wait for that to replicate via semi-sync before it becomes serving and can proceed to perform normal functions. So this delay could cause test flakiness if you required a healthy shard before continuing with the test. Signed-off-by: Matt Lord <[email protected]>
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
If a new flag is being introduced:
If a workflow is added or modified:
Bug fixes
Non-trivial changes
New/Existing features
Backward compatibility
|
They looked like this: WARNING: DATA RACE Write at 0x000005bf9b60 by goroutine 27141: github.com/spf13/pflag.newUint64Value() /home/runner/go/pkg/mod/github.com/spf13/[email protected]/uint64.go:9 +0x5a github.com/spf13/pflag.(*FlagSet).Uint64Var() /home/runner/go/pkg/mod/github.com/spf13/[email protected]/uint64.go:45 +0x55 vitess.io/vitess/go/vt/log.RegisterFlags() /home/runner/work/vitess/vitess/go/vt/log/log.go:81 +0x64 vitess.io/vitess/go/vt/servenv.GetFlagSetFor() /home/runner/work/vitess/vitess/go/vt/servenv/servenv.go:347 +0x183 vitess.io/vitess/go/vt/servenv.ParseFlags() /home/runner/work/vitess/vitess/go/vt/servenv/servenv.go:326 +0x49 ... Previous read at 0x000005bf9b60 by goroutine 27136: 1744 github.com/golang/glog.(*syncBuffer).Write() ... And they most often occurred in the wrangler unit tests, which makes sense because it creates a log of loggers. Signed-off-by: Matt Lord <[email protected]>
This reverts commit 51992b8. Signed-off-by: Matt Lord <[email protected]>
d61154a
to
a85c129
Compare
Signed-off-by: Matt Lord <[email protected]>
Signed-off-by: Matt Lord <[email protected]>
Signed-off-by: Matt Lord <[email protected]>
…althy Signed-off-by: Matt Lord <[email protected]>
Looks like a good change. Do we have a sense for how long the schema init is taking? IIRC, in local testing it was only 1-2 seconds. |
Yeah, 1-3 seconds seems normal locally. But that is plenty of time for races. And sometimes we encounter very interesting timings on Actions runners. 🥲 |
I was unable to backport this Pull Request to the following branches: |
…io#12502) * Use new healthy shard check in vreplication e2e tests This is needed because checking that there's a primary tablet for the shard in vtgate's healtcheck is no longer a reliable indicator that the shard has a healthy serving primary, because now a primary needs to initialize its sidecar database and wait for that to replicate via semi-sync before it becomes serving and can proceed to perform normal functions. So this delay could cause test flakiness if you required a healthy shard before continuing with the test. Signed-off-by: Matt Lord <[email protected]> * Try to address unit test race flakes around log size They looked like this: WARNING: DATA RACE Write at 0x000005bf9b60 by goroutine 27141: github.com/spf13/pflag.newUint64Value() /home/runner/go/pkg/mod/github.com/spf13/[email protected]/uint64.go:9 +0x5a github.com/spf13/pflag.(*FlagSet).Uint64Var() /home/runner/go/pkg/mod/github.com/spf13/[email protected]/uint64.go:45 +0x55 vitess.io/vitess/go/vt/log.RegisterFlags() /home/runner/work/vitess/vitess/go/vt/log/log.go:81 +0x64 vitess.io/vitess/go/vt/servenv.GetFlagSetFor() /home/runner/work/vitess/vitess/go/vt/servenv/servenv.go:347 +0x183 vitess.io/vitess/go/vt/servenv.ParseFlags() /home/runner/work/vitess/vitess/go/vt/servenv/servenv.go:326 +0x49 ... Previous read at 0x000005bf9b60 by goroutine 27136: 1744 github.com/golang/glog.(*syncBuffer).Write() ... And they most often occurred in the wrangler unit tests, which makes sense because it creates a log of loggers. Signed-off-by: Matt Lord <[email protected]> * Revert "Try to address unit test race flakes around log size" This reverts commit 51992b8. Signed-off-by: Matt Lord <[email protected]> * Use external cluster vtctld in TestMigrate Signed-off-by: Matt Lord <[email protected]> * Use subshell vs command output interpolation Signed-off-by: Matt Lord <[email protected]> * Ingnore any config files in mysql alias Signed-off-by: Matt Lord <[email protected]> --------- Signed-off-by: Matt Lord <[email protected]>
#12740) * Use new healthy shard check in vreplication e2e tests This is needed because checking that there's a primary tablet for the shard in vtgate's healtcheck is no longer a reliable indicator that the shard has a healthy serving primary, because now a primary needs to initialize its sidecar database and wait for that to replicate via semi-sync before it becomes serving and can proceed to perform normal functions. So this delay could cause test flakiness if you required a healthy shard before continuing with the test. * Try to address unit test race flakes around log size They looked like this: WARNING: DATA RACE Write at 0x000005bf9b60 by goroutine 27141: github.com/spf13/pflag.newUint64Value() /home/runner/go/pkg/mod/github.com/spf13/[email protected]/uint64.go:9 +0x5a github.com/spf13/pflag.(*FlagSet).Uint64Var() /home/runner/go/pkg/mod/github.com/spf13/[email protected]/uint64.go:45 +0x55 vitess.io/vitess/go/vt/log.RegisterFlags() /home/runner/work/vitess/vitess/go/vt/log/log.go:81 +0x64 vitess.io/vitess/go/vt/servenv.GetFlagSetFor() /home/runner/work/vitess/vitess/go/vt/servenv/servenv.go:347 +0x183 vitess.io/vitess/go/vt/servenv.ParseFlags() /home/runner/work/vitess/vitess/go/vt/servenv/servenv.go:326 +0x49 ... Previous read at 0x000005bf9b60 by goroutine 27136: 1744 github.com/golang/glog.(*syncBuffer).Write() ... And they most often occurred in the wrangler unit tests, which makes sense because it creates a log of loggers. * Revert "Try to address unit test race flakes around log size" This reverts commit 51992b8. * Use external cluster vtctld in TestMigrate * Use subshell vs command output interpolation * Ingnore any config files in mysql alias --------- Signed-off-by: Matt Lord <[email protected]> Co-authored-by: Matt Lord <[email protected]>
Description
I have noticed that e.g. the
vreplication_basic
CI workflow has become less reliable since the SidecarDB init work was merged. The seen failures were always on theTestVreplicationCopyThrottling
test. The reason is because we now needed a more reliable check for a healthy shard.This is needed because checking that there's a primary tablet for the shard in vtgate's healtcheck — which is what
vtgate.WaitForStatusOfTabletInShard()
does — is no longer a reliable indicator that the shard has a healthy serving primary, because now after a primary is elected it needs to initialize its sidecar database and wait for those DDLs to replicate via semi-sync replication before it becomes serving and can proceed to perform normal functions. So this delay could cause test flakiness if you required a healthy shard before continuing with the test as was the case inTestVreplicationCopyThrottling
.This PR migrates all of the places in the
vreplication
endtoend tests where we were using this to wait for a healthy shard:vtgate.WaitForStatusOfTabletInShard(fmt.Sprintf("%s.%s.primary", keyspace, shard), 1)
to instead using the newWaitForHealthyShard
cluster package helper:Related Issue(s)
Checklist