Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Examples, Flakes: Wait for Shard's VReplication Engine to Open #12560

Merged
merged 2 commits into from
Mar 7, 2023

Conversation

mattlord
Copy link
Contributor

@mattlord mattlord commented Mar 6, 2023

Description

The local examples CI workflow is still flaky — failing ~ 35% of the time — and it always fails here (or in the same way in later VReplication steps):

+ for shard in "customer/0"
+ true
+ command mysql --no-defaults -h 127.0.0.1 -P 15306 customer/0 -e 'show tables'
+ break
+ ./202_move_tables.sh
E0306 18:54:21.968491   13432 main.go:96] E0306 18:54:21.967743 traffic_switcher.go:208] buildTrafficSwitcher failed: rpc error: code = Unknown desc = TabletManager.VReplicationExec on zone1-0000000200 error: vreplication engine is closed: vreplication engine is closed
E0306 18:54:21.969504   13432 main.go:105] remote error: rpc error: code = Unknown desc = TabletManager.VReplicationExec on zone1-0000000200 error: vreplication engine is closed: vreplication engine is closed
MoveTables Error: rpc error: code = Unknown desc = TabletManager.VReplicationExec on zone1-0000000200 error: vreplication engine is closed: vreplication engine is closed

You can see an example here (see any of the previous 4 failed runs): https://github.com/vitessio/vitess/actions/runs/4346133011

The CI test specifically waits for the shard to be able to successfully serve a query through the vtgate, but for some reason that doesn't always work as expected in the CI; I have not been able to repeat it locally — so perhaps a bash version difference or some other missing context variable OR perhaps the vtgate is able to serve queries but the VReplication engine takes much longer to open (GitHub runners being very slow at times is a common factor in our tests). Given that users can also run into this when running the steps as a sequence: ./101...; ./201...; ./202...; ./203... I added a VReplication engine check to the wait_for_healthy_shard library function.

With the changes in this PR, the workflow ran successfully 10 times in a row in the CI with each workflow run containing 3 subtests: 1 each for consul, etcd, and k8s topos, so the base test really passed 30 times in a row: https://github.com/vitessio/vitess/actions/runs/4347322200

Related Issue(s)

Checklist

  • "Backport to:" labels have been added if this change should be back-ported
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on the CI
  • Documentation was added or is not required

Deployment Notes

@vitess-bot
Copy link
Contributor

vitess-bot bot commented Mar 6, 2023

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • If this is a change that users need to know about, please apply the release notes (needs details) label so that merging is blocked unless the summary release notes document is included.
  • If a test is added or modified, there should be a documentation on top of the test to explain what the expected behavior is what the test does.

If a new flag is being introduced:

  • Is it really necessary to add this flag?
  • Flag names should be clear and intuitive (as far as possible)
  • Help text should be descriptive.
  • Flag names should use dashes (-) as word separators rather than underscores (_).

If a workflow is added or modified:

  • Each item in Jobs should be named in order to mark it as required.
  • If the workflow should be required, the maintainer team should be notified.

Bug fixes

  • There should be at least one unit or end-to-end test.
  • The Pull Request description should include a link to an issue that describes the bug.

Non-trivial changes

  • There should be some code comments as to why things are implemented the way they are.

New/Existing features

  • Should be documented, either by modifying the existing documentation or creating new documentation.
  • New features should have a link to a feature request issue or an RFC that documents the use cases, corner cases and test cases.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • vtctl command output order should be stable and awk-able.
  • RPC changes should be compatible with vitess-operator
  • If a flag is removed, then it should also be removed from VTop, if used there.

@mattlord mattlord force-pushed the flakes_local_example_vrepl branch from 0b43617 to 2584204 Compare March 6, 2023 19:43
@mattlord mattlord force-pushed the flakes_local_example_vrepl branch from 2584204 to 9a312eb Compare March 6, 2023 19:46
@mattlord mattlord removed NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsWebsiteDocsUpdate What it says labels Mar 6, 2023
@mattlord mattlord marked this pull request as ready for review March 6, 2023 23:24
@mattlord mattlord requested a review from deepthi as a code owner March 6, 2023 23:24
Signed-off-by: Matt Lord <[email protected]>
Comment on lines +85 to +87
if vtctlclient --server=localhost:15999 Workflow -- "${keyspace}" listall &>/dev/null; then
break
fi
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

listall here exits with 1 or another value than 0 if the list is not valid?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you mean by "the list is not valid", but:

$ vtctlclient Workflow -- commerce listall; echo $?
Workflow Error: rpc error: code = Unknown desc = no primary found for shard commerce/0
1

$ vtctlclient Workflow -- commerce listall; echo $?
Workflow Error: rpc error: code = Unknown desc = TabletManager.VReplicationExec on zone1-0000000100 error: vreplication engine is closed: vreplication engine is closed
1

$ vtctlclient Workflow -- foobar listall; echo $?
Workflow Error: rpc error: code = Unknown desc = node doesn't exist: /vitess/global/keyspaces/foobar/shards/
1

$ vtctlclient Workflow -- commerce listall; echo $?
No workflows found in keyspace commerce
0

Copy link
Contributor

@rohit-nayak-ps rohit-nayak-ps left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@mattlord mattlord merged commit 2719170 into vitessio:main Mar 7, 2023
@mattlord mattlord deleted the flakes_local_example_vrepl branch March 7, 2023 22:26
mattlord added a commit to planetscale/vitess that referenced this pull request Mar 9, 2023
…sio#12560)

* Examples: wait for shard's vreplication engine to open

Signed-off-by: Matt Lord <[email protected]>

* Minor comment changes

Signed-off-by: Matt Lord <[email protected]>

---------

Signed-off-by: Matt Lord <[email protected]>
frouioui pushed a commit that referenced this pull request Mar 9, 2023
… (#12581)

* Examples: wait for shard's vreplication engine to open



* Minor comment changes



---------

Signed-off-by: Matt Lord <[email protected]>
@hmaurer hmaurer mentioned this pull request Mar 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants