Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flakes: Use new healthy shard check in vreplication e2e tests #12502

Merged
merged 7 commits into from
Mar 1, 2023

Conversation

mattlord
Copy link
Contributor

@mattlord mattlord commented Feb 27, 2023

Description

I have noticed that e.g. the vreplication_basic CI workflow has become less reliable since the SidecarDB init work was merged. The seen failures were always on the TestVreplicationCopyThrottling test. The reason is because we now needed a more reliable check for a healthy shard.

This is needed because checking that there's a primary tablet for the shard in vtgate's healtcheck — which is what vtgate.WaitForStatusOfTabletInShard() does — is no longer a reliable indicator that the shard has a healthy serving primary, because now after a primary is elected it needs to initialize its sidecar database and wait for those DDLs to replicate via semi-sync replication before it becomes serving and can proceed to perform normal functions. So this delay could cause test flakiness if you required a healthy shard before continuing with the test as was the case in TestVreplicationCopyThrottling.

This PR migrates all of the places in the vreplication endtoend tests where we were using this to wait for a healthy shard:
vtgate.WaitForStatusOfTabletInShard(fmt.Sprintf("%s.%s.primary", keyspace, shard), 1) to instead using the new WaitForHealthyShard cluster package helper:

err := cluster.WaitForHealthyShard(vc.VtctldClient, keyspace, shard)
require.NoError(t, err)

Note After these changes, the vreplication_basic CI workflow e.g. has passed every time: https://github.com/vitessio/vitess/actions/runs/4297831822

Related Issue(s)

Checklist

  • "Backport to:" labels have been added if this change should be back-ported
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on the CI
  • Documentation was added or is not required

This is needed because checking that there's a primary tablet for
the shard in vtgate's healtcheck is no longer a reliable indicator
that the shard has a healthy serving primary, because now a
primary needs to initialize its sidecar database and wait for
that to replicate via semi-sync before it becomes serving and can
proceed to perform normal functions. So this delay could cause
test flakiness if you required a healthy shard before continuing
with the test.

Signed-off-by: Matt Lord <[email protected]>
@vitess-bot vitess-bot bot added NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsWebsiteDocsUpdate What it says labels Feb 27, 2023
@vitess-bot
Copy link
Contributor

vitess-bot bot commented Feb 27, 2023

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • If this is a change that users need to know about, please apply the release notes (needs details) label so that merging is blocked unless the summary release notes document is included.
  • If a test is added or modified, there should be a documentation on top of the test to explain what the expected behavior is what the test does.

If a new flag is being introduced:

  • Is it really necessary to add this flag?
  • Flag names should be clear and intuitive (as far as possible)
  • Help text should be descriptive.
  • Flag names should use dashes (-) as word separators rather than underscores (_).

If a workflow is added or modified:

  • Each item in Jobs should be named in order to mark it as required.
  • If the workflow should be required, the maintainer team should be notified.

Bug fixes

  • There should be at least one unit or end-to-end test.
  • The Pull Request description should include a link to an issue that describes the bug.

Non-trivial changes

  • There should be some code comments as to why things are implemented the way they are.

New/Existing features

  • Should be documented, either by modifying the existing documentation or creating new documentation.
  • New features should have a link to a feature request issue or an RFC that documents the use cases, corner cases and test cases.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • vtctl command output order should be stable and awk-able.
  • RPC changes should be compatible with vitess-operator
  • If a flag is removed, then it should also be removed from VTop, if used there.

They looked like this:
WARNING: DATA RACE
Write at 0x000005bf9b60 by goroutine 27141:
  github.com/spf13/pflag.newUint64Value()
      /home/runner/go/pkg/mod/github.com/spf13/[email protected]/uint64.go:9 +0x5a
  github.com/spf13/pflag.(*FlagSet).Uint64Var()
      /home/runner/go/pkg/mod/github.com/spf13/[email protected]/uint64.go:45 +0x55
  vitess.io/vitess/go/vt/log.RegisterFlags()
      /home/runner/work/vitess/vitess/go/vt/log/log.go:81 +0x64
  vitess.io/vitess/go/vt/servenv.GetFlagSetFor()
      /home/runner/work/vitess/vitess/go/vt/servenv/servenv.go:347 +0x183
  vitess.io/vitess/go/vt/servenv.ParseFlags()
      /home/runner/work/vitess/vitess/go/vt/servenv/servenv.go:326 +0x49
...
Previous read at 0x000005bf9b60 by goroutine 27136:
1744
  github.com/golang/glog.(*syncBuffer).Write()
...

And they most often occurred in the wrangler unit tests, which makes sense
because it creates a log of loggers.

Signed-off-by: Matt Lord <[email protected]>
@mattlord mattlord requested a review from ajm188 February 27, 2023 23:54
@mattlord mattlord force-pushed the flakes_vrepl_shard_healthy branch from d61154a to a85c129 Compare February 28, 2023 01:30
@mattlord mattlord removed the request for review from ajm188 February 28, 2023 01:31
@mattlord mattlord marked this pull request as ready for review February 28, 2023 01:31
@mattlord mattlord requested a review from deepthi as a code owner February 28, 2023 01:31
@deepthi
Copy link
Member

deepthi commented Mar 1, 2023

Looks like a good change. Do we have a sense for how long the schema init is taking? IIRC, in local testing it was only 1-2 seconds.

@mattlord
Copy link
Contributor Author

mattlord commented Mar 1, 2023

Looks like a good change. Do we have a sense for how long the schema init is taking? IIRC, in local testing it was only 1-2 seconds.

Yeah, 1-3 seconds seems normal locally. But that is plenty of time for races. And sometimes we encounter very interesting timings on Actions runners. 🥲

@mattlord mattlord merged commit 96c3dca into vitessio:main Mar 1, 2023
@mattlord mattlord deleted the flakes_vrepl_shard_healthy branch March 1, 2023 08:09
@vitess-bot
Copy link
Contributor

vitess-bot bot commented Mar 1, 2023

I was unable to backport this Pull Request to the following branches: release-16.0.

GuptaManan100 pushed a commit to planetscale/vitess that referenced this pull request Mar 28, 2023
…io#12502)

* Use new healthy shard check in vreplication e2e tests

This is needed because checking that there's a primary tablet for
the shard in vtgate's healtcheck is no longer a reliable indicator
that the shard has a healthy serving primary, because now a
primary needs to initialize its sidecar database and wait for
that to replicate via semi-sync before it becomes serving and can
proceed to perform normal functions. So this delay could cause
test flakiness if you required a healthy shard before continuing
with the test.

Signed-off-by: Matt Lord <[email protected]>

* Try to address unit test race flakes around log size

They looked like this:
WARNING: DATA RACE
Write at 0x000005bf9b60 by goroutine 27141:
  github.com/spf13/pflag.newUint64Value()
      /home/runner/go/pkg/mod/github.com/spf13/[email protected]/uint64.go:9 +0x5a
  github.com/spf13/pflag.(*FlagSet).Uint64Var()
      /home/runner/go/pkg/mod/github.com/spf13/[email protected]/uint64.go:45 +0x55
  vitess.io/vitess/go/vt/log.RegisterFlags()
      /home/runner/work/vitess/vitess/go/vt/log/log.go:81 +0x64
  vitess.io/vitess/go/vt/servenv.GetFlagSetFor()
      /home/runner/work/vitess/vitess/go/vt/servenv/servenv.go:347 +0x183
  vitess.io/vitess/go/vt/servenv.ParseFlags()
      /home/runner/work/vitess/vitess/go/vt/servenv/servenv.go:326 +0x49
...
Previous read at 0x000005bf9b60 by goroutine 27136:
1744
  github.com/golang/glog.(*syncBuffer).Write()
...

And they most often occurred in the wrangler unit tests, which makes sense
because it creates a log of loggers.

Signed-off-by: Matt Lord <[email protected]>

* Revert "Try to address unit test race flakes around log size"

This reverts commit 51992b8.

Signed-off-by: Matt Lord <[email protected]>

* Use external cluster vtctld in TestMigrate

Signed-off-by: Matt Lord <[email protected]>

* Use subshell vs command output interpolation

Signed-off-by: Matt Lord <[email protected]>

* Ingnore any config files in mysql alias

Signed-off-by: Matt Lord <[email protected]>

---------

Signed-off-by: Matt Lord <[email protected]>
frouioui pushed a commit that referenced this pull request Mar 28, 2023
#12740)

* Use new healthy shard check in vreplication e2e tests

This is needed because checking that there's a primary tablet for
the shard in vtgate's healtcheck is no longer a reliable indicator
that the shard has a healthy serving primary, because now a
primary needs to initialize its sidecar database and wait for
that to replicate via semi-sync before it becomes serving and can
proceed to perform normal functions. So this delay could cause
test flakiness if you required a healthy shard before continuing
with the test.



* Try to address unit test race flakes around log size

They looked like this:
WARNING: DATA RACE
Write at 0x000005bf9b60 by goroutine 27141:
  github.com/spf13/pflag.newUint64Value()
      /home/runner/go/pkg/mod/github.com/spf13/[email protected]/uint64.go:9 +0x5a
  github.com/spf13/pflag.(*FlagSet).Uint64Var()
      /home/runner/go/pkg/mod/github.com/spf13/[email protected]/uint64.go:45 +0x55
  vitess.io/vitess/go/vt/log.RegisterFlags()
      /home/runner/work/vitess/vitess/go/vt/log/log.go:81 +0x64
  vitess.io/vitess/go/vt/servenv.GetFlagSetFor()
      /home/runner/work/vitess/vitess/go/vt/servenv/servenv.go:347 +0x183
  vitess.io/vitess/go/vt/servenv.ParseFlags()
      /home/runner/work/vitess/vitess/go/vt/servenv/servenv.go:326 +0x49
...
Previous read at 0x000005bf9b60 by goroutine 27136:
1744
  github.com/golang/glog.(*syncBuffer).Write()
...

And they most often occurred in the wrangler unit tests, which makes sense
because it creates a log of loggers.



* Revert "Try to address unit test race flakes around log size"

This reverts commit 51992b8.



* Use external cluster vtctld in TestMigrate



* Use subshell vs command output interpolation



* Ingnore any config files in mysql alias



---------

Signed-off-by: Matt Lord <[email protected]>
Co-authored-by: Matt Lord <[email protected]>
@hmaurer hmaurer mentioned this pull request Mar 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants