Flakes: Use new healthy shard check in vreplication e2e tests #12502

mattlord · 2023-02-27T22:41:34Z

Description

I have noticed that e.g. the vreplication_basic CI workflow has become less reliable since the SidecarDB init work was merged. The seen failures were always on the TestVreplicationCopyThrottling test. The reason is because we now needed a more reliable check for a healthy shard.

This is needed because checking that there's a primary tablet for the shard in vtgate's healtcheck — which is what vtgate.WaitForStatusOfTabletInShard() does — is no longer a reliable indicator that the shard has a healthy serving primary, because now after a primary is elected it needs to initialize its sidecar database and wait for those DDLs to replicate via semi-sync replication before it becomes serving and can proceed to perform normal functions. So this delay could cause test flakiness if you required a healthy shard before continuing with the test as was the case in TestVreplicationCopyThrottling.

This PR migrates all of the places in the vreplication endtoend tests where we were using this to wait for a healthy shard:
vtgate.WaitForStatusOfTabletInShard(fmt.Sprintf("%s.%s.primary", keyspace, shard), 1) to instead using the new WaitForHealthyShard cluster package helper:

err := cluster.WaitForHealthyShard(vc.VtctldClient, keyspace, shard)
require.NoError(t, err)

Note After these changes, the vreplication_basic CI workflow e.g. has passed every time: https://github.com/vitessio/vitess/actions/runs/4297831822

Related Issue(s)

Follow-up to: vttablet sidecar schema:use schemadiff to reach desired schema on tablet init replacing the withDDL-based approach #11520

Checklist

"Backport to:" labels have been added if this change should be back-ported
Tests were added or are not required
Did the new or modified tests pass consistently locally and on the CI
Documentation was added or is not required

This is needed because checking that there's a primary tablet for the shard in vtgate's healtcheck is no longer a reliable indicator that the shard has a healthy serving primary, because now a primary needs to initialize its sidecar database and wait for that to replicate via semi-sync before it becomes serving and can proceed to perform normal functions. So this delay could cause test flakiness if you required a healthy shard before continuing with the test. Signed-off-by: Matt Lord <[email protected]>

vitess-bot · 2023-02-27T22:41:37Z

They looked like this: WARNING: DATA RACE Write at 0x000005bf9b60 by goroutine 27141: github.com/spf13/pflag.newUint64Value() /home/runner/go/pkg/mod/github.com/spf13/[email protected]/uint64.go:9 +0x5a github.com/spf13/pflag.(*FlagSet).Uint64Var() /home/runner/go/pkg/mod/github.com/spf13/[email protected]/uint64.go:45 +0x55 vitess.io/vitess/go/vt/log.RegisterFlags() /home/runner/work/vitess/vitess/go/vt/log/log.go:81 +0x64 vitess.io/vitess/go/vt/servenv.GetFlagSetFor() /home/runner/work/vitess/vitess/go/vt/servenv/servenv.go:347 +0x183 vitess.io/vitess/go/vt/servenv.ParseFlags() /home/runner/work/vitess/vitess/go/vt/servenv/servenv.go:326 +0x49 ... Previous read at 0x000005bf9b60 by goroutine 27136: 1744 github.com/golang/glog.(*syncBuffer).Write() ... And they most often occurred in the wrangler unit tests, which makes sense because it creates a log of loggers. Signed-off-by: Matt Lord <[email protected]>

This reverts commit 51992b8. Signed-off-by: Matt Lord <[email protected]>

Signed-off-by: Matt Lord <[email protected]>

…althy Signed-off-by: Matt Lord <[email protected]>

deepthi · 2023-03-01T01:05:17Z

Looks like a good change. Do we have a sense for how long the schema init is taking? IIRC, in local testing it was only 1-2 seconds.

mattlord · 2023-03-01T08:07:53Z

Looks like a good change. Do we have a sense for how long the schema init is taking? IIRC, in local testing it was only 1-2 seconds.

Yeah, 1-3 seconds seems normal locally. But that is plenty of time for races. And sometimes we encounter very interesting timings on Actions runners. 🥲

vitess-bot · 2023-03-01T08:09:15Z

I was unable to backport this Pull Request to the following branches: release-16.0.

…io#12502) * Use new healthy shard check in vreplication e2e tests This is needed because checking that there's a primary tablet for the shard in vtgate's healtcheck is no longer a reliable indicator that the shard has a healthy serving primary, because now a primary needs to initialize its sidecar database and wait for that to replicate via semi-sync before it becomes serving and can proceed to perform normal functions. So this delay could cause test flakiness if you required a healthy shard before continuing with the test. Signed-off-by: Matt Lord <[email protected]> * Try to address unit test race flakes around log size They looked like this: WARNING: DATA RACE Write at 0x000005bf9b60 by goroutine 27141: github.com/spf13/pflag.newUint64Value() /home/runner/go/pkg/mod/github.com/spf13/[email protected]/uint64.go:9 +0x5a github.com/spf13/pflag.(*FlagSet).Uint64Var() /home/runner/go/pkg/mod/github.com/spf13/[email protected]/uint64.go:45 +0x55 vitess.io/vitess/go/vt/log.RegisterFlags() /home/runner/work/vitess/vitess/go/vt/log/log.go:81 +0x64 vitess.io/vitess/go/vt/servenv.GetFlagSetFor() /home/runner/work/vitess/vitess/go/vt/servenv/servenv.go:347 +0x183 vitess.io/vitess/go/vt/servenv.ParseFlags() /home/runner/work/vitess/vitess/go/vt/servenv/servenv.go:326 +0x49 ... Previous read at 0x000005bf9b60 by goroutine 27136: 1744 github.com/golang/glog.(*syncBuffer).Write() ... And they most often occurred in the wrangler unit tests, which makes sense because it creates a log of loggers. Signed-off-by: Matt Lord <[email protected]> * Revert "Try to address unit test race flakes around log size" This reverts commit 51992b8. Signed-off-by: Matt Lord <[email protected]> * Use external cluster vtctld in TestMigrate Signed-off-by: Matt Lord <[email protected]> * Use subshell vs command output interpolation Signed-off-by: Matt Lord <[email protected]> * Ingnore any config files in mysql alias Signed-off-by: Matt Lord <[email protected]> --------- Signed-off-by: Matt Lord <[email protected]>

#12740) * Use new healthy shard check in vreplication e2e tests This is needed because checking that there's a primary tablet for the shard in vtgate's healtcheck is no longer a reliable indicator that the shard has a healthy serving primary, because now a primary needs to initialize its sidecar database and wait for that to replicate via semi-sync before it becomes serving and can proceed to perform normal functions. So this delay could cause test flakiness if you required a healthy shard before continuing with the test. * Try to address unit test race flakes around log size They looked like this: WARNING: DATA RACE Write at 0x000005bf9b60 by goroutine 27141: github.com/spf13/pflag.newUint64Value() /home/runner/go/pkg/mod/github.com/spf13/[email protected]/uint64.go:9 +0x5a github.com/spf13/pflag.(*FlagSet).Uint64Var() /home/runner/go/pkg/mod/github.com/spf13/[email protected]/uint64.go:45 +0x55 vitess.io/vitess/go/vt/log.RegisterFlags() /home/runner/work/vitess/vitess/go/vt/log/log.go:81 +0x64 vitess.io/vitess/go/vt/servenv.GetFlagSetFor() /home/runner/work/vitess/vitess/go/vt/servenv/servenv.go:347 +0x183 vitess.io/vitess/go/vt/servenv.ParseFlags() /home/runner/work/vitess/vitess/go/vt/servenv/servenv.go:326 +0x49 ... Previous read at 0x000005bf9b60 by goroutine 27136: 1744 github.com/golang/glog.(*syncBuffer).Write() ... And they most often occurred in the wrangler unit tests, which makes sense because it creates a log of loggers. * Revert "Try to address unit test race flakes around log size" This reverts commit 51992b8. * Use external cluster vtctld in TestMigrate * Use subshell vs command output interpolation * Ingnore any config files in mysql alias --------- Signed-off-by: Matt Lord <[email protected]> Co-authored-by: Matt Lord <[email protected]>

vitess-bot bot added NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsWebsiteDocsUpdate What it says labels Feb 27, 2023

mattlord added Type: Internal Cleanup Component: VReplication Type: Testing Backport to: release-16.0 and removed NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsWebsiteDocsUpdate What it says labels Feb 27, 2023

mattlord requested review from rohit-nayak-ps and shlomi-noach February 27, 2023 22:51

mattlord requested a review from ajm188 February 27, 2023 23:54

Revert "Try to address unit test race flakes around log size"

a85c129

This reverts commit 51992b8. Signed-off-by: Matt Lord <[email protected]>

mattlord force-pushed the flakes_vrepl_shard_healthy branch from d61154a to a85c129 Compare February 28, 2023 01:30

mattlord removed the request for review from ajm188 February 28, 2023 01:31

mattlord marked this pull request as ready for review February 28, 2023 01:31

mattlord requested a review from deepthi as a code owner February 28, 2023 01:31

mattlord added 2 commits February 27, 2023 23:12

Use external cluster vtctld in TestMigrate

337b47a

Signed-off-by: Matt Lord <[email protected]>

Use subshell vs command output interpolation

7cecadb

Signed-off-by: Matt Lord <[email protected]>

mattlord requested review from GuptaManan100, frouioui and harshit-gangal as code owners February 28, 2023 05:40

mattlord added 2 commits February 28, 2023 16:56

Ingnore any config files in mysql alias

187aa99

Signed-off-by: Matt Lord <[email protected]>

Merge remote-tracking branch 'origin/main' into flakes_vrepl_shard_he…

3aec817

…althy Signed-off-by: Matt Lord <[email protected]>

deepthi approved these changes Mar 1, 2023

View reviewed changes

frouioui approved these changes Mar 1, 2023

View reviewed changes

mattlord merged commit 96c3dca into vitessio:main Mar 1, 2023

mattlord deleted the flakes_vrepl_shard_healthy branch March 1, 2023 08:09

GuptaManan100 mentioned this pull request Mar 28, 2023

[release-16.0] Flakes: Use new healthy shard check in vreplication e2e tests (#12502) #12740

Merged

hmaurer mentioned this pull request Mar 21, 2024

oops #15542

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flakes: Use new healthy shard check in vreplication e2e tests #12502

Flakes: Use new healthy shard check in vreplication e2e tests #12502

mattlord commented Feb 27, 2023 •

edited

Loading

vitess-bot bot commented Feb 27, 2023 •

edited by frouioui

Loading

deepthi commented Mar 1, 2023

mattlord commented Mar 1, 2023

vitess-bot bot commented Mar 1, 2023

Flakes: Use new healthy shard check in vreplication e2e tests #12502

Flakes: Use new healthy shard check in vreplication e2e tests #12502

Conversation

mattlord commented Feb 27, 2023 • edited Loading

Description

Related Issue(s)

Checklist

vitess-bot bot commented Feb 27, 2023 • edited by frouioui Loading

Review Checklist

General

If a new flag is being introduced:

If a workflow is added or modified:

Bug fixes

Non-trivial changes

New/Existing features

Backward compatibility

deepthi commented Mar 1, 2023

mattlord commented Mar 1, 2023

vitess-bot bot commented Mar 1, 2023

mattlord commented Feb 27, 2023 •

edited

Loading

vitess-bot bot commented Feb 27, 2023 •

edited by frouioui

Loading