storage: deflake TestNodeLivenessStatusMap #36952

knz · 2019-04-19T10:26:25Z

Prior to this patch, this test would fail stressrace after a few
dozen iterations.

With this patch, stressrace succeeds thousands of iterations.

I have checked that the test logic is preserved: if I change one of
the expected statuses in testData, the test still fail properly.

Release note: None

cockroach-teamcity · 2019-04-19T10:26:31Z

This change is

knz · 2019-04-19T10:28:45Z

I do not have a suitable explanation as to why this change fixes it.

Some hypothesis: before my change, it would test all 4 statuses concurrently with separate SucceedsSoon loops. This also means that there were 4 goroutines mashing on the cluster setting presumably in a racy fashion.
With the change, there is just one SucceedsSoon loop and the cluster setting is refreshed in an orderly fashion.

tbg

Thanks for looking at this! Possible explanations inline.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz and @tbg)

pkg/storage/node_liveness_test.go, line 872 at r1 (raw file):

		t.Run(test.expectedStatus.String(), func(t *testing.T) {
			nodeID, expectedStatus := test.nodeID, test.expectedStatus
			t.Parallel()

We're not taking a loop-local copy of test. If nodeID and expectedStatus end up taken from different tests, things would be messed up. Unfortunately, that doesn't seem to be the case. Also, I can't reproduce this test flake at all, neither on my laptop nor on my gceworker.

But the test flakes we see is always the LIVE test failing because the node is DEAD. TestTimeStoreUntilDead is 5ms, so under heavy stress that test is basically expected to fail; running more stuff in t.Parallel() can only hurt.

Mind updating your new code so that it passes the time until dead that it wants in the test case? deadNodeID and removedNodeID would pass TestTimeUntilStoreDead, the others would pass something high (5s or the default, if we can get that).

pkg/storage/node_liveness_test.go, line 886 at r1 (raw file):

		var errBuf bytes.Buffer

		for _, test := range testData {

I'd still do t.Run here, but not t.Parallel().

knz

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @tbg)

pkg/storage/node_liveness_test.go, line 872 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

We're not taking a loop-local copy of test. If nodeID and expectedStatus end up taken from different tests, things would be messed up. Unfortunately, that doesn't seem to be the case. Also, I can't reproduce this test flake at all, neither on my laptop nor on my gceworker.

But the test flakes we see is always the LIVE test failing because the node is DEAD. TestTimeStoreUntilDead is 5ms, so under heavy stress that test is basically expected to fail; running more stuff in t.Parallel() can only hurt.

Mind updating your new code so that it passes the time until dead that it wants in the test case? deadNodeID and removedNodeID would pass TestTimeUntilStoreDead, the others would pass something high (5s or the default, if we can get that).

So you're saying that given enough load and a small enough TestTimeStoreUntilDead, any live node is going to appear dead sometimes?

In that case what you say below dooms this tests to uselessness: if the node that's removed by the test is incorrectly not shut down, it could still appear as dead (although it's incorrectly still live) and then the test would pass with a false negative.

What is this test really testing?

pkg/storage/node_liveness_test.go, line 886 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

I'd still do t.Run here, but not t.Parallel().

You can't t.Run inside SucceedsSoon. That's not how the test harness works.

tbg

mod the remaining comments.

Reviewed 1 of 1 files at r1.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @knz)

pkg/storage/node_liveness_test.go, line 872 at r1 (raw file):

then the test would pass with a false negative

Yes, but very rarely. This doesn't exactly let your heart jump with joy, but it will catch a regression here as fast as a test that doesn't have this false negative. Any timing-based test has this sort of problem. Arguably this test is working at a level that's too high. It should set up a self-contained nodeLiveness, set its state accordingly, and test the resulting status map.

What is this test really testing?

Heh. You know that you wrote this test, right? It's testing something useful, but not in a very useful way. We can't really blame the test on that though because there's just not a chance that you can set up a NodeLiveness manually -- it's just not modular at all.

I still think you should increase the time until store dead for the live and decommissioning case to avoid further flakes.

pkg/storage/node_liveness_test.go, line 886 at r1 (raw file):

Previously, knz (kena) wrote…

You can't t.Run inside SucceedsSoon. That's not how the test harness works.

I was thinking you'd pull the SucceedsSoon in so that you arrive at the usual test structure again:

for _, test := range testData {
  t.Run(test.expectedStatus.String(), func(t *testing.T) {
    testutils.SucceedsSoon(t, func() error { /* actual test */ })
  }
}

knz

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @knz and @tbg)

pkg/storage/node_liveness_test.go, line 898 at r2 (raw file):

					dur = storage.TestTimeUntilStoreDead
				}
				storage.TimeUntilStoreDead.Override(&firstServer.ClusterSettings().SV, dur)

See that's where I don't see where you are going with this -- changing TimeUntilStoreDead on each of the sub-tests seems incorrect to me.

The situation I'm seeing (and that I am presuming you're not seeing) is that all the node states have been set at the beginning, before all the sub-tests were started. So before the first sub-test even starts, we have 1 live node and 3 nodes that are in various states of shutting down.

Now at that point your code here expands to:

set TimeUntilStoreDead to 5 minutes
check that node 1 is live
set TimeUntilStoreDead to 15 ms
check that node 2 is dead
set TimeUntilStoreDead to 5 minutes
check that node 3 is decommissioning
set TimeUntilStoreDead to 15 ms
check that node 4 is decommissioned

I think that's incorrect, because the time change at step 3 is likely going to make node 3 be marked as dead and make step 6 fail.

My understanding is that each change to TimeUntilStoreDead in this test structure will impact all the nodes in the test cluster at once, not just the one whose status is being checked at that step.

I think there are two ways forward here:

A. either re-arrange the 4 test cases to have the test on dead and decommissionned node appear last, after the two nodes expected to remain live, or
B. have each sub-test do the work of changing the state of the node from live to the desired state, before running SucceedsSoon

The drawback with A is that we will lose one feature of the current code (mine), which is its ability to assert that node 1 which is meant to remain live throughout actually does remain live. If we re-organize the code like you did or like solution A, the node can be detected as live by the first sub-test, then its liveness record be screwed up by an update to the liveness of another node, and the remaining sub-tests won't catch that.

The drawback with B is that it loses the ability of the test to check how the liveness code tests simultaneous changes. Maybe that's not so bad? Or maybe this needs to be split into two tests.

knz

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @knz and @tbg)

pkg/storage/node_liveness_test.go, line 898 at r2 (raw file):

Previously, knz (kena) wrote…

See that's where I don't see where you are going with this -- changing TimeUntilStoreDead on each of the sub-tests seems incorrect to me.

The situation I'm seeing (and that I am presuming you're not seeing) is that all the node states have been set at the beginning, before all the sub-tests were started. So before the first sub-test even starts, we have 1 live node and 3 nodes that are in various states of shutting down.

Now at that point your code here expands to:

set TimeUntilStoreDead to 5 minutes

check that node 1 is live

set TimeUntilStoreDead to 15 ms

check that node 2 is dead

set TimeUntilStoreDead to 5 minutes

check that node 3 is decommissioning

set TimeUntilStoreDead to 15 ms

check that node 4 is decommissioned

I think that's incorrect, because the time change at step 3 is likely going to make node 3 be marked as dead and make step 6 fail.

My understanding is that each change to TimeUntilStoreDead in this test structure will impact all the nodes in the test cluster at once, not just the one whose status is being checked at that step.

I think there are two ways forward here:

A. either re-arrange the 4 test cases to have the test on dead and decommissionned node appear last, after the two nodes expected to remain live, or
B. have each sub-test do the work of changing the state of the node from live to the desired state, before running SucceedsSoon

The drawback with A is that we will lose one feature of the current code (mine), which is its ability to assert that node 1 which is meant to remain live throughout actually does remain live. If we re-organize the code like you did or like solution A, the node can be detected as live by the first sub-test, then its liveness record be screwed up by an update to the liveness of another node, and the remaining sub-tests won't catch that.

The drawback with B is that it loses the ability of the test to check how the liveness code tests simultaneous changes. Maybe that's not so bad? Or maybe this needs to be split into two tests.

Also, I'd like to point out that you have still not explained to me why my initial PR (the commit without your fix on top) is wrong and why it needs to be changed.

tbg

I looked some more and found something interesting. Guess what this test outputs:

func TestFoo(t *testing.T) {
	for i := 0; i < 3; i++ {
		i := i
		log.Warningf(context.Background(), "running %d", i)
		t.Run("", func(t *testing.T) {
			t.Parallel()
			time.Sleep(time.Second)
			log.Warningf(context.Background(), "done %d", i)
		})
	}
	log.Warningf(context.Background(), "main test finished")
}

You (at least I) would expect "main test finished" to be the last message in the output, but in fact it's printed right after going through the subtests:

I190424 08:31:41.703784 1 rand.go:87  Random seed: 1463201139793967084
=== RUN   TestFoo
W190424 08:31:41.704131 37 storage/node_liveness_test.go:778  running 0
--- PASS: TestFoo (0.00s)
=== RUN   TestFoo/#00
=== PAUSE TestFoo/#00
W190424 08:31:41.704162 37 storage/node_liveness_test.go:778  running 1
=== CONT  TestFoo/#00
    --- PASS: TestFoo/#00 (1.00s)
=== RUN   TestFoo/#01
=== PAUSE TestFoo/#01
W190424 08:31:41.704181 37 storage/node_liveness_test.go:778  running 2
=== CONT  TestFoo/#01
    --- PASS: TestFoo/#01 (1.00s)
=== RUN   TestFoo/#02
=== PAUSE TestFoo/#02
W190424 08:31:41.704205 37 storage/node_liveness_test.go:785  main test finished
=== CONT  TestFoo/#02
W190424 08:31:42.705271 39 storage/node_liveness_test.go:782  done 1
W190424 08:31:42.705293 38 storage/node_liveness_test.go:782  done 0
W190424 08:31:42.705298 40 storage/node_liveness_test.go:782  done 2
    --- PASS: TestFoo/#02 (1.00s)

This means that the test in this PR tears down the main stopper before the subtests have necessarily finished. I'm also seeing this in the logs (there's some unexplained stopper output after we've explicitly stopped the nodes we were planning to stop, and before the test fails 45s later)

This is good, because that means that all this PR needs to do is to remove the t.Parallel(), and we should also be able to kill this hacky wannabe fix:

cockroach/pkg/storage/node_liveness_test.go

Lines 891 to 893 in d591a49

    
           if expectedStatus == storagepb.NodeLivenessStatus_DECOMMISSIONING && st == storagepb.NodeLivenessStatus_DECOMMISSIONED { 
        
           	// Server somehow shut down super-fast. Tolerating the mismatch. 
        
           	return nil

WDYT? If you agree, mind rewriting the history in this PR and adding a commit message?

Taking a step back, we should also basically never use t.Parallel(). We do in a few places, wouldn't be surprised if not all of those are safe (at the very least, they steal time from leaktest.AfterTest(t)()):

pkg/rpc/context_test.go:			t.Parallel()
pkg/sql/distsqlrun/tablereader.go:const ParallelScanResultThreshold = 10000
pkg/sql/logictest/logic.go:							t.Parallel()
pkg/sql/physical_props_test.go:				t.Parallel()
pkg/sql/physical_props_test.go:						t.Parallel()
pkg/sql/physical_props_test.go:			t.Parallel()
pkg/sql/physical_props_test.go:			t.Parallel()
pkg/sql/physical_props_test.go:			t.Parallel()
pkg/sql/physical_props_test.go:			t.Parallel()
pkg/storage/node_liveness_test.go:			t.Parallel()
pkg/storage/node_liveness_test.go:			t.Parallel()
pkg/ts/rollup.go:				result.variance = computeParallelVariance(
pkg/ts/rollup.go:					datapoint.variance = computeParallelVariance(
pkg/ts/rollup.go:// computeParallelVariance computes the combined variance of two previously
pkg/ts/rollup.go:func computeParallelVariance(left, right parallelVarianceArgs) float64 {
pkg/util/fast_int_map_test.go:			t.Parallel()
pkg/util/fast_int_set_test.go:			t.Parallel()

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @knz and @tbg)

pkg/storage/node_liveness_test.go, line 898 at r2 (raw file):
TimeUntilStoreDead matters in the call to GetLivenessStatusMap:

cockroach/pkg/storage/node_liveness.go

Lines 666 to 669 in a134caa

    
           now := nl.clock.PhysicalTime() 
        
           livenesses := nl.GetLivenesses() 
        
           threshold := TimeUntilStoreDead.Get(&nl.st.SV) 
        
           maxOffset := nl.clock.MaxOffset()

Either I'm being really thick or we're talking past each other. The subtests don't run in parallel, they run sequentially (note absence of a call to t.Parallel()). So in your example, the call to GetLivenessStatusMap for n3 (step 6) will use 5 minutes and things will be well.

But I just took another look and now I'm not so sure of my theory about why it was flaky in the first place before. I assumed that the node wasn't able to mark itself as live (or seemed dead) due to the low timeout, but that's not true. The time until store dead only matters after the liveness record expires, and that's after ~5s (which is also why this test takes >5s), and only if it doesn't get renewed (which it ought to). So my earlier statement that given a short enough dead timeout implies that live nodes sometimes seem dead is false. The running boolean and everything attached to it should go away again, and we're back to not really understanding why the test flaked in the first place. I'm looking at the logs right now and there's no sign of failed liveness heartbeats. I think we might be seeing something silly about the parallel subtests (and the missing test closure) yet, but I don't have a good theory what it might be because the output seems to indicate that we're testing the right things. Will look some more.

Also, I'd like to point out that you have still not explained to me why my initial PR (the commit without your fix on top) is wrong and why it needs to be changed.

You've deflaked the test but you didn't have an explanation of why you deflaked it. The theory I had (which is now proven not correct, meaning parts of the change are useless) implied that you had only happened to have fixed it in the setup you were testing, but that the real flake was still there. Additionally, I reintroduced the subtests and moved the retry loop back into them which you hadn't done (due to what we're discussing above) which makes the test cleaner.

As I've mentioned before, this test is operating at the wrong level. The real test needs a refactor of node liveness so that it can be mocked out to get rid of all the timing and KV dependencies, and then this can be a true unit test that doesn't start servers and plays fast and loose with settings. I'd welcome that refactor, though there's a chance that it's hairier than I make it out to be.

knz

I am familiar with the behavior of t.Parallel() (I had to reverse-engineer it to understand why it was behaving so differently, back when I was implementing github.com/knz/cppbench) however I never connected the dots and realized that this behavior was invalid when the main test is responsible for setting up a test cluster. Well spotted. You are right that we need to audit this. In fact, even with leaktest.AfterTest the behavior is still incorrect if the main test sets up a test cluster. In other places (e.g. the fast int tests, SQL physical properties, logic tests) I think it is safe (but we can still audit). I will look at the other tests after we are done with this PR.

As to simply removing t.Parallel() to de-flake: I have tested and it seems to work. This also explains why my change in the initial commit of this PR also fixes it (it also removes t.Parallel()).

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @knz and @tbg)

pkg/storage/node_liveness_test.go, line 898 at r2 (raw file):

The subtests don't run in parallel, they run sequentially (note absence of a call to t.Parallel()).

Yes I am aware of this, this is why my comment provides a sequential list of steps (1 to 6).

So in your example, the call to GetLivenessStatusMap for n3 (step 6) will use 5 minutes and things will be well.

So here I sense you seem to know something I don't. My assumption, correct me if I'm wrong, is that there is a gossip/liveness goroutine running in the background while the test is running, in charge of updating liveness records, and the TimeUntilStoreDead session variable is used both by GetLivenessStatusMap and that goroutine.

So my earlier comment "the time change at step 3 is likely going to make node 3 be marked as dead" refers to the change to the cluster variable being picked up by that concurrent goroutine, not by the call to GetLivenessStatusMap performed by the test.

If you know / can guarantee that gossip updates records without ever looking at TimeUntilStoreDead then I'll have learned something, thank you for it, and then I'll retract my review comment and add a copious explanatory comment in the code.

Additionally, I reintroduced the subtests and moved the retry loop back into them

The drawback of having SucceedsSoon in the inner loop is that we're paying the price of the convergence for each of the 4 cases (up to a 45 seconds timeout), whereas my change in the original PR commit would make the test pay that price just once.

tbg

Reviewed 1 of 1 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @knz)

pkg/storage/node_liveness_test.go, line 898 at r2 (raw file):

So here I sense you seem to know something I don't.

I think we clarified this offline already, but for the record that's not how it works, it's only used by GetLivenessStatusMap (probably somewhere else too, but not where it matters for this test). The liveness frequency and durations aren't changed by that, they're on the order of seconds, in this test too.

The drawback of having SucceedsSoon in the inner loop

No, and maybe this was also based on the previous misunderstanding. Entering the SucceedsSoon has no bearing on when it succeeds, so we don't have to enter it artificially early in this test. (and even if it cost us a tiny bit of time negligible in comparison to the total time taken by the test, I'll personally always take the clearer test)

knz

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @knz and @tbg)

pkg/storage/node_liveness_test.go, line 898 at r2 (raw file):
Two points:

Entering the SucceedsSoon has no bearing on when it succeeds,

In the (unlikely) case where the test is going to fail (due to a regression elsewhere), having SucceedsSoon in each of the sub-tests will make the entire test fail in 4 x 45 = 180 seconds. If SucceedsSoon surrounds the loop, the test would fail in max 45 seconds. What do you think?

Then on to the other point, you haven't commented on this:

The drawback with [checking the statuses serially] is that we will lose one feature of the current code (mine), which is its ability to assert that node 1 which is meant to remain live throughout actually does remain live. If we re-organize the code like you did [...], the node can be detected as live by the first sub-test, then its liveness record be screwed up by an update to the liveness of another node, and the remaining sub-tests won't catch that.

The original code (before this PR) was kinda testing this already (but also unreliably/wrongly), because the 4 SucceedsSoon loops were running concurrently. The original commit in this PR preserves the status assertion throughout the entire test (the test verifies that when the last node is in the expected status, all nodes have remained in the expected status). Is this not important to test?

tbg

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @knz)

pkg/storage/node_liveness_test.go, line 898 at r2 (raw file):

What do you think?

I don't think that matters if as a result the test is less clear to read.

Then on to the other point, you haven't commented on this:

Yeah, that's an argument I can warm up to more than the other one generally, but I'm going to return to the point I've made a few times: this code is supposed to unit test the LiveStatusMap() code, it's not supposed to test that.. running nodes heartbeat their liveness record? That is tested elsewhere. It's always tempting to test more but this test is already exercising 100% of the stack just to create a map from a number of liveness records.

This isn't to say that I'll be in the way of changing it back. I think it's the wrong decision, but decisions like that are admittedly subjective. Having identified the root cause of the flakiness (t.Parallel() is what matters to me, do what you think is best with the rest).

knz

I have reworked the history and the commit message to reflect the outcome of our discussion.
Thank you again for all your help.

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @knz and @tbg)

pkg/storage/node_liveness_test.go, line 872 at r1 (raw file):

You know that you wrote this test, right?

For the record, I wrote this test at a time when I had no idea what I was doing. I wrote this test but really the design was yours :)

Anyway all is good now.

pkg/storage/node_liveness_test.go, line 898 at r2 (raw file):

this code is supposed to unit test the LiveStatusMap() code, it's not supposed to test that.. running nodes heartbeat their liveness record

ah, I didn't catch that the link between the liveness records and heartbeats was already tested elsewhere. In that case my objection is removed.

knz · 2019-04-24T13:20:18Z

bors r=tbg

tbg

Reviewed 1 of 1 files at r3.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @knz)

pkg/storage/node_liveness_test.go, line 862 at r3 (raw file):

		expectedStatus storagepb.NodeLivenessStatus

		// This is a bit of a hack: we want to run with a low TimeUntilStoreDead

I thought we'd agreed that running and all associated to it would go away because I was wrong about the dead timeout influencing whether a live node would seem dead. We should always run with the 5ms timeout.

knz · 2019-04-24T13:22:31Z

oh I misunderstood

bors r-

craig · 2019-04-24T13:22:32Z

Canceled

knz

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @tbg)

pkg/storage/node_liveness_test.go, line 862 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

I thought we'd agreed that running and all associated to it would go away because I was wrong about the dead timeout influencing whether a live node would seem dead. We should always run with the 5ms timeout.

Done.

tbg

If this is flaky now, I'll be so sad. Thanks for fixing!

Prior to this patch, this test would fail `stressrace` after a few dozen iterations. The root cause of this was the invalid call to `t.Parallel()`, which this patch removes. Additionally, this patch adapts TimeUntilStoreDead for each test case to avoid flakes, and removes a previous hack obviated by this simplification. Release note: None Co-authored-by: Tobias Schottdorf <[email protected]>

knz · 2019-04-24T13:28:14Z

bors r=tbg

34197: server,rpc: validate node IDs in RPC heartbeats r=tbg a=knz Fixes #34158. Prior to this patch, it was possible for a RPC client to dial a node ID and get a connection to another node instead. This is because the mapping of node ID -> address may be stale, and a different node could take the address of the intended node from "under" the dialer. (See #34155 for a scenario.) This happened to be "safe" in many cases where it matters because: - RPC requests for distSQL are OK with being served on a different node than intended (with potential performance drop); - RPC requests to the KV layer are OK with being served on a different node than intended (they would route underneath); - RPC requests to the storage layer are rejected by the remote node because the store ID in the request would not match. However this safety is largely accidental, and we should not work with the assumption that any RPC request is safe to be mis-routed. (In fact, we have not audited all the RPC endpoints and cannot establish this safety exists throughout.) This patch works to prevent these mis-routings by adding a check of the intended node ID during RPC heartbeats (including the initial heartbeat), when the intended node ID is known. A new API `GRPCDialNode()` is introduced to establish such connections. Release note (bug fix): CockroachDB now performs fewer attempts to communicate with the wrong node, when a node is restarted with another node's address. 36952: storage: deflake TestNodeLivenessStatusMap r=tbg a=knz Fixes #35675. Prior to this patch, this test would fail `stressrace` after a few dozen iterations. With this patch, `stressrace` succeeds thousands of iterations. I have checked that the test logic is preserved: if I change one of the expected statuses in `testData`, the test still fail properly. Release note: None Co-authored-by: Raphael 'kena' Poss <[email protected]>

craig · 2019-04-24T13:48:36Z

Build succeeded

GitHub CI (Cockroach)

knz requested review from tbg and a team April 19, 2019 10:26

tbg reviewed Apr 19, 2019

View reviewed changes

knz commented Apr 20, 2019

View reviewed changes

tbg approved these changes Apr 23, 2019

View reviewed changes