fix flaky TestReplication_FederationStates test due to race conditions #7612

pierresouchay · 2020-04-08T09:43:22Z

When comparing FederationState between local and remote states, the update was comparing the lastIndex, but this index on remote could be lower than max index retrieved and stored from remote, to be exact, I found it could be zero forever in some case (probably a race condition, but was not able to pinpoint it).

This fix is not fixing the initial root cause: the ModifiedIndex is set to 0 in remote DC, but it is a useful workaround as it fixes the convergence if the remote server is not behaving as it should.

In that case, it would break sync of updates and the remote object would be never synced.

To be sure we don't forget anything, I also added a test over isSame() to be sure the objects are identical.

This was the reason for the test TestReplication_FederationStates() to fail randomly. This is also probably the cause for some of the integration tests to fail randomly as well.

With this fix, the command:

i=0; while /usr/local/bin/go test -timeout 30s github.com/hashicorp/consul/agent/consul -run '^(TestReplication_FederationStates)$'; do go clean -testcache; i=$((i + 1)); printf "$i "; done

That used to break on my machine in less than 20 runs is now running 150+ times without any issue.

Might also fix #7575

When comparing FederationState between local and remote states, the update was comparing the lastIndex, but this index on remote could be lower than max index retrieved and stored from remote, to be exact, I found it could be zero forever in some case (probably a race condition, but was not able to pinpoint it). This fix is not fixing the initial root cause: the ModifiedIndex is set to 0 in remote DC, but it is a useful workaround as it fixes the convergence if the remote server is not behaving as it should. In that case, it would break sync of updates and the remote object would be never synced. To be sure we don't forget anything, I also added a test over `isSame()` to be sure the objects are identical. This was the reason for the test `TestReplication_FederationStates()` to fail randomly. This is also probably the cause for some of the integration tests to fail randomly as well. With this fix, the command: ``` i=0; while /usr/local/bin/go test -timeout 30s github.com/hashicorp/consul/agent/consul -run '^(TestReplication_FederationStates)$'; do go clean -testcache; i=$((i + 1)); printf "$i "; done ``` That used to break on my machine in less than 20 runs is now running 150+ times without any issue. Might also fix hashicorp#7575

rboyer · 2020-04-08T14:13:11Z

Let me dig into this. The general logic around that part of the replication is actually present in replication paths for several things already so if it's a bug in one of them it likely is either a bug in all of them, or there's an underlying problem with how federation states were implemented.

pierresouchay · 2020-04-08T16:08:38Z

@rboyer Yes, that's weird, I did not see anything special in the code, from the Tx methods to the sync code, but it definitely occurs.

To highlight it:

i=0; while /usr/local/bin/go test -timeout 30s github.com/hashicorp/consul/agent/consul -run '^(TestReplication_FederationStates)$'; do go clean -testcache; i=$((i + 1)); printf "$i "; done

Put a warning when any of the remote has a ModifiedIndex of 0 (even a panic)... everytime the test fails (~1/10 times on my machine, with the patch, I was able to perform this test 300+ times without any error), it happens (it does not when the test do work).

I also tried to change the number off updates in the test from 3 to 255 (instead of 50 updates), it fails with around the same frequency.

rboyer · 2020-04-08T16:28:46Z

Ah I think I figured out what is actually happening. Because this is taking the short-circuit RPC path (via inmemCodec) I think the original structs.FederationState instances passed in to populate the state machine are actually the ones that get persisted in memdb. When things get committed they get their ModifyIndex & CreateIndex assigned. This is the data race.

This means the slice of federation states captured in the start of the test are actually all pointers to memdb state, rather than copies. Since there's a data race all bets are off as to what actually happens during the rest of the test.

Try this patch and see if you can make it fail anymore:

diff --git agent/consul/federation_state_replication_test.go agent/consul/federation_state_replication_test.go
index ac047e0db..e8d486a58 100644
--- agent/consul/federation_state_replication_test.go
+++ agent/consul/federation_state_replication_test.go
@@ -10,6 +10,7 @@ import (
 	"github.com/hashicorp/consul/api"
 	"github.com/hashicorp/consul/sdk/testutil/retry"
 	"github.com/hashicorp/consul/testrpc"
+	"github.com/mitchellh/copystructure"
 	"github.com/stretchr/testify/require"
 )
 
@@ -42,6 +43,12 @@ func TestReplication_FederationStates(t *testing.T) {
 	testrpc.WaitForLeader(t, s1.RPC, "dc1")
 	testrpc.WaitForLeader(t, s1.RPC, "dc2")
 
+	duplicate := func(t *testing.T, s *structs.FederationState) *structs.FederationState {
+		s2, err := copystructure.Copy(s)
+		require.NoError(t, err)
+		return s2.(*structs.FederationState)
+	}
+
 	// Create some new federation states (weird because we're having dc1 update it for the other 50)
 	var fedStates []*structs.FederationState
 	for i := 0; i < 50; i++ {
@@ -67,7 +74,7 @@ func TestReplication_FederationStates(t *testing.T) {
 
 		out := false
 		require.NoError(t, s1.RPC("FederationState.Apply", &arg, &out))
-		fedStates = append(fedStates, arg.State)
+		fedStates = append(fedStates, duplicate(t, arg.State))
 	}
 
 	checkSame := func(t *retry.R) error {
@@ -130,7 +137,7 @@ func TestReplication_FederationStates(t *testing.T) {
 		arg := structs.FederationStateRequest{
 			Datacenter: "dc1",
 			Op:         structs.FederationStateDelete,
-			State:      fedState,
+			State:      duplicate(t, fedState),
 		}
 
 		out := false

pierresouchay · 2020-04-08T16:32:10Z

dammit, I thought about this as well during my investigations, but did not see how there could be a race condition here.... I am testing this

…ith remote." This reverts commit 437dee2.

rboyer · 2020-04-08T16:37:27Z

inmemCodec is sneaky because it causes an unexpected data relationship during the test. The more common problem with mutating the contents of memdb is reading something back out of memdb and forgetting it's a shared reference that you should not mutate.

pierresouchay · 2020-04-08T16:45:38Z

@rboyer No, still failing with your patch :( -> I am using the last patch for the branch (so, exactly your patch, I did revert mine), it fails the same:

i=0; while /usr/local/bin/go test -timeout 30s github.com/hashicorp/consul/agent/consul -run '^(TestReplication_FederationStates)$'; do go clean -testcache; i=$((i + 1)); printf "$i "; done
ok  	github.com/hashicorp/consul/agent/consul	5.510s
1 ok  	github.com/hashicorp/consul/agent/consul	5.232s
2 ok  	github.com/hashicorp/consul/agent/consul	5.468s
3 ok  	github.com/hashicorp/consul/agent/consul	4.990s
4 ok  	github.com/hashicorp/consul/agent/consul	5.096s
5 [INFO] freeport: blockSize 1500 too big for system limit 256. Adjusting...
[INFO] freeport: detected ephemeral port range of [49152, 65535]
--- FAIL: TestReplication_FederationStates (11.14s)
    writer.go:29: 2020-04-08T18:44:12.433+0200 [INFO]  TestReplication_FederationStates-node-1.server.raft: initial configuration: index=1 servers="[{Suffrage:Voter ID:89bfdb87-2be9-d8c8-02db-7af1a185b991 Address:127.0.0.1:12278}]"

rboyer · 2020-04-08T16:51:54Z

Weird I didn't get any failures with that patch. Just to rule out data races entirely, here's a variant that removes even more theoretical race conditions:

diff --git agent/consul/federation_state_replication_test.go agent/consul/federation_state_replication_test.go
index ac047e0db..79f9eb8bb 100644
--- agent/consul/federation_state_replication_test.go
+++ agent/consul/federation_state_replication_test.go
@@ -43,7 +43,7 @@ func TestReplication_FederationStates(t *testing.T) {
 	testrpc.WaitForLeader(t, s1.RPC, "dc2")
 
 	// Create some new federation states (weird because we're having dc1 update it for the other 50)
-	var fedStates []*structs.FederationState
+	var fedStateDCs []string
 	for i := 0; i < 50; i++ {
 		dc := fmt.Sprintf("alt-dc%d", i+1)
 		ip1 := fmt.Sprintf("1.2.3.%d", i+1)
@@ -67,7 +67,7 @@ func TestReplication_FederationStates(t *testing.T) {
 
 		out := false
 		require.NoError(t, s1.RPC("FederationState.Apply", &arg, &out))
-		fedStates = append(fedStates, arg.State)
+		fedStateDCs = append(fedStateDCs, dc)
 	}
 
 	checkSame := func(t *retry.R) error {
@@ -126,11 +126,13 @@ func TestReplication_FederationStates(t *testing.T) {
 		checkSame(r)
 	})
 
-	for _, fedState := range fedStates {
+	for _, fedStateDC := range fedStateDCs {
 		arg := structs.FederationStateRequest{
 			Datacenter: "dc1",
 			Op:         structs.FederationStateDelete,
-			State:      fedState,
+			State: &structs.FederationState{
+				Datacenter: fedStateDC,
+			},
 		}
 
 		out := false

I also am running a faster, modified version of your scriptlet:

i=0; while go test -timeout 30s ./agent/consul -count=1 -run '^(TestReplication_FederationStates)$'; do i=$((i + 1)); printf "$i "; done

The main difference is exploiting that -count=ANYTHING forces go to bypass the testcache so you don't have to clear it explicitly.

rboyer · 2020-04-08T16:58:59Z

🤦‍♂️

The more common problem with mutating the contents of memdb is reading something back out of memdb and forgetting it's a shared reference that you should not mutate.

	checkSame := func(t *retry.R) error {
		_, remote, err := s1.fsm.State().FederationStateList(nil)
		require.NoError(t, err)
		_, local, err := s2.fsm.State().FederationStateList(nil)
		require.NoError(t, err)

		require.Len(t, local, len(remote))
		for i, _ := range remote {
			// zero out the raft data for future comparisons
			remote[i].RaftIndex = structs.RaftIndex{} // <---LIKE THIS
			local[i].RaftIndex = structs.RaftIndex{}//   <---AND THIS
			require.Equal(t, remote[i], local[i])
		}
		return nil
	}

	// Wait for the replica to converge.
	retry.Run(t, func(r *retry.R) {
		checkSame(r)
	})

rboyer · 2020-04-08T18:26:18Z

So after heeding my own advice from upthread I think I fixed both memdb corruptions in the test:

diff --git agent/consul/federation_state_replication_test.go agent/consul/federation_state_replication_test.go
index ac047e0db..614682b90 100644
--- agent/consul/federation_state_replication_test.go
+++ agent/consul/federation_state_replication_test.go
@@ -43,7 +43,7 @@ func TestReplication_FederationStates(t *testing.T) {
 	testrpc.WaitForLeader(t, s1.RPC, "dc2")
 
 	// Create some new federation states (weird because we're having dc1 update it for the other 50)
-	var fedStates []*structs.FederationState
+	var fedStateDCs []string
 	for i := 0; i < 50; i++ {
 		dc := fmt.Sprintf("alt-dc%d", i+1)
 		ip1 := fmt.Sprintf("1.2.3.%d", i+1)
@@ -67,7 +67,7 @@ func TestReplication_FederationStates(t *testing.T) {
 
 		out := false
 		require.NoError(t, s1.RPC("FederationState.Apply", &arg, &out))
-		fedStates = append(fedStates, arg.State)
+		fedStateDCs = append(fedStateDCs, dc)
 	}
 
 	checkSame := func(t *retry.R) error {
@@ -78,10 +78,14 @@ func TestReplication_FederationStates(t *testing.T) {
 
 		require.Len(t, local, len(remote))
 		for i, _ := range remote {
+			// Make lightweight copies so we can zero out the raft fields
+			// without mutating the copies in memdb.
+			remoteCopy := *remote[i]
+			localCopy := *local[i]
 			// zero out the raft data for future comparisons
-			remote[i].RaftIndex = structs.RaftIndex{}
-			local[i].RaftIndex = structs.RaftIndex{}
-			require.Equal(t, remote[i], local[i])
+			remoteCopy.RaftIndex = structs.RaftIndex{}
+			localCopy.RaftIndex = structs.RaftIndex{}
+			require.Equal(t, remoteCopy, localCopy)
 		}
 		return nil
 	}
@@ -126,11 +130,13 @@ func TestReplication_FederationStates(t *testing.T) {
 		checkSame(r)
 	})
 
-	for _, fedState := range fedStates {
+	for _, fedStateDC := range fedStateDCs {
 		arg := structs.FederationStateRequest{
 			Datacenter: "dc1",
 			Op:         structs.FederationStateDelete,
-			State:      fedState,
+			State: &structs.FederationState{
+				Datacenter: fedStateDC,
+			},
 		}
 
 		out := false

rboyer · 2020-04-08T18:29:14Z

The reason this isn't broken in tests for acl tokens, roles, and policies is that in all of the other cases the tests don't do the deep equality checking that the federation state test does (instead testing some canary fields like Hash).

pierresouchay · 2020-04-08T19:49:57Z

CS is hard

pierresouchay · 2020-04-08T22:01:41Z

@rboyer The test is currently ongoing (80+ times without failure now), and seems Ok.
Thank you for the help, I had been bitten already by this 1y ago, I should have looked more closely.
I was convinced it was not a unit test problem because it look similar to random failures in some integration tests :(

pierresouchay · 2020-04-08T22:09:04Z

103 tests without failures, seems fixed

rboyer

LGTM

pierresouchay mentioned this pull request Apr 8, 2020

Fix unstable test TestReplication_FederationStates() #7588

Closed

pierresouchay force-pushed the fix_incorrect_sync_of_state_in_DiffRemoteAndLocalState branch from fc1cb8f to 1512d59 Compare April 8, 2020 11:30

pierresouchay force-pushed the fix_incorrect_sync_of_state_in_DiffRemoteAndLocalState branch from 1512d59 to 437dee2 Compare April 8, 2020 11:51

Revert "[BUGFIX] DiffRemoteAndLocalState incorrectly sync changes w…

a407f6e

…ith remote." This reverts commit 437dee2.

Patch from @rboyer -> still failing

0f44b64

Full copy of datastructures from @rboyer

d4cf941

rboyer changed the title ~~[BUGFIX] DiffRemoteAndLocalState incorrectly sync changes with remote~~ fix flaky TestReplication_FederationStates test due to race conditions Apr 9, 2020

rboyer approved these changes Apr 9, 2020

View reviewed changes

rboyer merged commit 1b4218a into hashicorp:master Apr 9, 2020

pierresouchay mentioned this pull request Jul 1, 2020

agent/consul: Add support for NotModified to two endpoints #8215

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix flaky TestReplication_FederationStates test due to race conditions #7612

fix flaky TestReplication_FederationStates test due to race conditions #7612

pierresouchay commented Apr 8, 2020

rboyer commented Apr 8, 2020

pierresouchay commented Apr 8, 2020 •

edited

Loading

rboyer commented Apr 8, 2020 •

edited by dnephin

Loading

pierresouchay commented Apr 8, 2020

rboyer commented Apr 8, 2020

pierresouchay commented Apr 8, 2020

rboyer commented Apr 8, 2020 •

edited

Loading

rboyer commented Apr 8, 2020

rboyer commented Apr 8, 2020

rboyer commented Apr 8, 2020 •

edited

Loading

pierresouchay commented Apr 8, 2020

pierresouchay commented Apr 8, 2020

pierresouchay commented Apr 8, 2020

rboyer left a comment

fix flaky TestReplication_FederationStates test due to race conditions #7612

fix flaky TestReplication_FederationStates test due to race conditions #7612

Conversation

pierresouchay commented Apr 8, 2020

rboyer commented Apr 8, 2020

pierresouchay commented Apr 8, 2020 • edited Loading

rboyer commented Apr 8, 2020 • edited by dnephin Loading

pierresouchay commented Apr 8, 2020

rboyer commented Apr 8, 2020

pierresouchay commented Apr 8, 2020

rboyer commented Apr 8, 2020 • edited Loading

rboyer commented Apr 8, 2020

rboyer commented Apr 8, 2020

rboyer commented Apr 8, 2020 • edited Loading

pierresouchay commented Apr 8, 2020

pierresouchay commented Apr 8, 2020

pierresouchay commented Apr 8, 2020

rboyer left a comment

Choose a reason for hiding this comment

pierresouchay commented Apr 8, 2020 •

edited

Loading

rboyer commented Apr 8, 2020 •

edited by dnephin

Loading

rboyer commented Apr 8, 2020 •

edited

Loading

rboyer commented Apr 8, 2020 •

edited

Loading