Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flakes: Address Common Unit Test Races #12546

Merged
merged 4 commits into from
Mar 6, 2023

Conversation

mattlord
Copy link
Contributor

@mattlord mattlord commented Mar 3, 2023

Description

The unit race workflow was failing pretty regularly in the CI (~20% of the time), and the two failures that I would regularly see were:

  1. testVDiffEnv related races — most often in the wrangler unit tests, as they have the widest use
    • These were simply NOT concurrency-safe
  2. glog.MaxSize related races — most often in the wrangler unit tests as they create a LOT of loggers, primarily ConsoleLoggers and MemoryLoggers
    • The way we modified the glog.MaxSize global exported variable in our --log_rotate_max_size flag handling was not concurrency-safe

I was able to quickly repeat these failures locally as well using the method below. After the changes in this PR, I was able to run it successfully 100 times in a row:

cnt=1
while true; do
    echo -n "Attempt# ${cnt} :: "
    go test -timeout 30s -parallel 6 -count 1 -race vitess.io/vitess/go/vt/wrangler 2>>/tmp/results
    if [[ $? -ne 0 ]]; then
         echo "Failed on attempt ${cnt}"
         break
    fi
    let cnt+=1
    sleep 1
done
Attempt# 1 :: ok  	vitess.io/vitess/go/vt/wrangler	9.320s
...
Attempt# 100 :: ok  	vitess.io/vitess/go/vt/wrangler	9.503s

It was failing in the CI about 1 in 4-5 attempts. In this PR it passed 19 out of 20 times, 18 times in a row — which is at least a big improvement.

Related Issue(s)

Checklist

  • "Backport to:" labels have been added if this change should be back-ported
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on the CI
  • Documentation was added or is not required

Signed-off-by: Matt Lord <[email protected]>
@vitess-bot vitess-bot bot added NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsWebsiteDocsUpdate What it says labels Mar 3, 2023
@vitess-bot
Copy link
Contributor

vitess-bot bot commented Mar 3, 2023

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • If this is a change that users need to know about, please apply the release notes (needs details) label so that merging is blocked unless the summary release notes document is included.
  • If a test is added or modified, there should be a documentation on top of the test to explain what the expected behavior is what the test does.

If a new flag is being introduced:

  • Is it really necessary to add this flag?
  • Flag names should be clear and intuitive (as far as possible)
  • Help text should be descriptive.
  • Flag names should use dashes (-) as word separators rather than underscores (_).

If a workflow is added or modified:

  • Each item in Jobs should be named in order to mark it as required.
  • If the workflow should be required, the maintainer team should be notified.

Bug fixes

  • There should be at least one unit or end-to-end test.
  • The Pull Request description should include a link to an issue that describes the bug.

Non-trivial changes

  • There should be some code comments as to why things are implemented the way they are.

New/Existing features

  • Should be documented, either by modifying the existing documentation or creating new documentation.
  • New features should have a link to a feature request issue or an RFC that documents the use cases, corner cases and test cases.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • vtctl command output order should be stable and awk-able.
  • RPC changes should be compatible with vitess-operator
  • If a flag is removed, then it should also be removed from VTop, if used there.

@mattlord mattlord added Component: Build/CI Type: Testing Flakes and removed NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsWebsiteDocsUpdate What it says labels Mar 3, 2023
@mattlord mattlord force-pushed the flakes_unit_test_race branch from a662305 to 89e142b Compare March 3, 2023 15:18
@mattlord mattlord force-pushed the flakes_unit_test_race branch from 89e142b to 8a63f38 Compare March 3, 2023 15:39
@mattlord mattlord marked this pull request as ready for review March 5, 2023 01:05
go/vt/log/log.go Outdated Show resolved Hide resolved
go/vt/log/log.go Outdated
@@ -78,5 +81,32 @@ var (
// calls this function, or call this function directly before parsing
// command-line arguments.
func RegisterFlags(fs *pflag.FlagSet) {
fs.Uint64Var(&glog.MaxSize, "log_rotate_max_size", glog.MaxSize, "size in bytes at which logs are rotated (glog.MaxSize)")
flagVal := logRotateMaxSize{
val: "1887436800", // glog.MaxSize value (which is not concurrency safe)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of copying the value here, can't we use glob.MaxSize itself here? Maybe with an atomic load then I guess?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, lol, come on GitHub PR UI #12546 (comment)

so, uhh, yeah I agree with @dbussink here 😂

Copy link
Contributor Author

@mattlord mattlord Mar 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was the first thing I tried and it helped but not as much (failed every 20-30 times locally IIRC). I could go back to that though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Giving it a go in the CI now (will do 20 test runs again): 0c5be06

IIRC, the race moved to pflag.Parse/TryParse around flag.TrickGlog but let's see.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good news is that we don't seem to lose much if any reliability in the CI with that change: https://github.com/vitessio/vitess/actions/runs/4336855297/attempts/19

So I think we all feel better about it now. 🙂 Thanks!

Copy link
Contributor

@ajm188 ajm188 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, one small (potential) improvement, take it or leave it! and thanks!

@@ -94,6 +80,17 @@ func newTestVDiffEnv(sourceShards, targetShards []string, query string, position
}
env.wr = wrangler.NewTestWrangler(env.cmdlog, env.topoServ, env.tmc)

dialerName := fmt.Sprintf("VDiffTest-%d", rand.Intn(1000000000))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we update this to take a *testing.T (or better, testing.TB), you can parameterize this to fmt.Sprintf("VDiffTest-%s", t.Name()) and avoid any chance of collision entirely

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea! I did that here (kept the rand int part for too): 2d7e509

Thanks!

go/vt/wrangler/wrangler_env_test.go Outdated Show resolved Hide resolved
@mattlord mattlord merged commit ef28c13 into vitessio:main Mar 6, 2023
@mattlord mattlord deleted the flakes_unit_test_race branch March 6, 2023 18:14
rohit-nayak-ps pushed a commit to planetscale/vitess that referenced this pull request Mar 28, 2023
* Deflake unit race tests

Signed-off-by: Matt Lord <[email protected]>

* Try to address glog.MaxSize race

Signed-off-by: Matt Lord <[email protected]>

* Test w/o using literal copy of glog.MaxSize value in CI

Signed-off-by: Matt Lord <[email protected]>

* Make the dialer names truly unique

Signed-off-by: Matt Lord <[email protected]>

---------

Signed-off-by: Matt Lord <[email protected]>
frouioui pushed a commit that referenced this pull request Mar 29, 2023
* Flakes: Address Common Unit Test Races (#12546)

* Deflake unit race tests

Signed-off-by: Matt Lord <[email protected]>

* Try to address glog.MaxSize race

Signed-off-by: Matt Lord <[email protected]>

* Test w/o using literal copy of glog.MaxSize value in CI

Signed-off-by: Matt Lord <[email protected]>

* Make the dialer names truly unique

Signed-off-by: Matt Lord <[email protected]>

---------

Signed-off-by: Matt Lord <[email protected]>

* Use atomic.Bool for fakesqldb behavior flags (#12603)

Signed-off-by: Matt Lord <[email protected]>
Signed-off-by: Max Englander <[email protected]>
Co-authored-by: Matt Lord <[email protected]>

---------

Signed-off-by: Matt Lord <[email protected]>
Signed-off-by: Max Englander <[email protected]>
Co-authored-by: Matt Lord <[email protected]>
Co-authored-by: Max Englander <[email protected]>
@hmaurer hmaurer mentioned this pull request Mar 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants