Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix flaky TestReplication_FederationStates test due to race conditions #7612

Conversation

pierresouchay
Copy link
Contributor

When comparing FederationState between local and remote states, the update was comparing the lastIndex, but this index on remote could be lower than max index retrieved and stored from remote, to be exact, I found it could be zero forever in some case (probably a race condition, but was not able to pinpoint it).

This fix is not fixing the initial root cause: the ModifiedIndex is set to 0 in remote DC, but it is a useful workaround as it fixes the convergence if the remote server is not behaving as it should.

In that case, it would break sync of updates and the remote object would be never synced.

To be sure we don't forget anything, I also added a test over isSame() to be sure the objects are identical.

This was the reason for the test TestReplication_FederationStates() to fail randomly. This is also probably the cause for some of the integration tests to fail randomly as well.

With this fix, the command:

i=0; while /usr/local/bin/go test -timeout 30s github.com/hashicorp/consul/agent/consul -run '^(TestReplication_FederationStates)$'; do go clean -testcache; i=$((i + 1)); printf "$i "; done

That used to break on my machine in less than 20 runs is now running 150+ times without any issue.

Might also fix #7575

@pierresouchay pierresouchay force-pushed the fix_incorrect_sync_of_state_in_DiffRemoteAndLocalState branch from fc1cb8f to 1512d59 Compare April 8, 2020 11:30
When comparing FederationState between local and remote states, the update was comparing the lastIndex, but this index on remote could be lower than max index retrieved and stored from remote, to be exact, I found it could be zero forever in some case (probably a race condition, but was not able to pinpoint it).

This fix is not fixing the initial root cause: the ModifiedIndex is set to 0 in remote DC, but it is a useful workaround as it fixes the convergence if the remote server is not behaving as it should.

In that case, it would break sync of updates and the remote object would be never synced.

To be sure we don't forget anything, I also added a test over `isSame()` to be sure the objects are identical.

This was the reason for the test `TestReplication_FederationStates()` to fail randomly. This is also probably the cause for some of the integration tests to fail randomly as well.

With this fix, the command:
```
i=0; while /usr/local/bin/go test -timeout 30s github.com/hashicorp/consul/agent/consul -run '^(TestReplication_FederationStates)$'; do go clean -testcache; i=$((i + 1)); printf "$i "; done
```
That used to break on my machine in less than 20 runs is now running 150+ times without any issue.

Might also fix hashicorp#7575
@pierresouchay pierresouchay force-pushed the fix_incorrect_sync_of_state_in_DiffRemoteAndLocalState branch from 1512d59 to 437dee2 Compare April 8, 2020 11:51
@rboyer
Copy link
Member

rboyer commented Apr 8, 2020

Let me dig into this. The general logic around that part of the replication is actually present in replication paths for several things already so if it's a bug in one of them it likely is either a bug in all of them, or there's an underlying problem with how federation states were implemented.

@pierresouchay
Copy link
Contributor Author

pierresouchay commented Apr 8, 2020

@rboyer Yes, that's weird, I did not see anything special in the code, from the Tx methods to the sync code, but it definitely occurs.

To highlight it:

i=0; while /usr/local/bin/go test -timeout 30s github.com/hashicorp/consul/agent/consul -run '^(TestReplication_FederationStates)$'; do go clean -testcache; i=$((i + 1)); printf "$i "; done

Put a warning when any of the remote has a ModifiedIndex of 0 (even a panic)... everytime the test fails (~1/10 times on my machine, with the patch, I was able to perform this test 300+ times without any error), it happens (it does not when the test do work).

I also tried to change the number off updates in the test from 3 to 255 (instead of 50 updates), it fails with around the same frequency.

@rboyer
Copy link
Member

rboyer commented Apr 8, 2020

Ah I think I figured out what is actually happening. Because this is taking the short-circuit RPC path (via inmemCodec) I think the original structs.FederationState instances passed in to populate the state machine are actually the ones that get persisted in memdb. When things get committed they get their ModifyIndex & CreateIndex assigned. This is the data race.

This means the slice of federation states captured in the start of the test are actually all pointers to memdb state, rather than copies. Since there's a data race all bets are off as to what actually happens during the rest of the test.

Try this patch and see if you can make it fail anymore:

diff --git agent/consul/federation_state_replication_test.go agent/consul/federation_state_replication_test.go
index ac047e0db..e8d486a58 100644
--- agent/consul/federation_state_replication_test.go
+++ agent/consul/federation_state_replication_test.go
@@ -10,6 +10,7 @@ import (
 	"github.com/hashicorp/consul/api"
 	"github.com/hashicorp/consul/sdk/testutil/retry"
 	"github.com/hashicorp/consul/testrpc"
+	"github.com/mitchellh/copystructure"
 	"github.com/stretchr/testify/require"
 )
 
@@ -42,6 +43,12 @@ func TestReplication_FederationStates(t *testing.T) {
 	testrpc.WaitForLeader(t, s1.RPC, "dc1")
 	testrpc.WaitForLeader(t, s1.RPC, "dc2")
 
+	duplicate := func(t *testing.T, s *structs.FederationState) *structs.FederationState {
+		s2, err := copystructure.Copy(s)
+		require.NoError(t, err)
+		return s2.(*structs.FederationState)
+	}
+
 	// Create some new federation states (weird because we're having dc1 update it for the other 50)
 	var fedStates []*structs.FederationState
 	for i := 0; i < 50; i++ {
@@ -67,7 +74,7 @@ func TestReplication_FederationStates(t *testing.T) {
 
 		out := false
 		require.NoError(t, s1.RPC("FederationState.Apply", &arg, &out))
-		fedStates = append(fedStates, arg.State)
+		fedStates = append(fedStates, duplicate(t, arg.State))
 	}
 
 	checkSame := func(t *retry.R) error {
@@ -130,7 +137,7 @@ func TestReplication_FederationStates(t *testing.T) {
 		arg := structs.FederationStateRequest{
 			Datacenter: "dc1",
 			Op:         structs.FederationStateDelete,
-			State:      fedState,
+			State:      duplicate(t, fedState),
 		}
 
 		out := false

@pierresouchay
Copy link
Contributor Author

dammit, I thought about this as well during my investigations, but did not see how there could be a race condition here.... I am testing this

@rboyer
Copy link
Member

rboyer commented Apr 8, 2020

inmemCodec is sneaky because it causes an unexpected data relationship during the test. The more common problem with mutating the contents of memdb is reading something back out of memdb and forgetting it's a shared reference that you should not mutate.

@pierresouchay
Copy link
Contributor Author

@rboyer No, still failing with your patch :( -> I am using the last patch for the branch (so, exactly your patch, I did revert mine), it fails the same:

i=0; while /usr/local/bin/go test -timeout 30s github.com/hashicorp/consul/agent/consul -run '^(TestReplication_FederationStates)$'; do go clean -testcache; i=$((i + 1)); printf "$i "; done
ok  	github.com/hashicorp/consul/agent/consul	5.510s
1 ok  	github.com/hashicorp/consul/agent/consul	5.232s
2 ok  	github.com/hashicorp/consul/agent/consul	5.468s
3 ok  	github.com/hashicorp/consul/agent/consul	4.990s
4 ok  	github.com/hashicorp/consul/agent/consul	5.096s
5 [INFO] freeport: blockSize 1500 too big for system limit 256. Adjusting...
[INFO] freeport: detected ephemeral port range of [49152, 65535]
--- FAIL: TestReplication_FederationStates (11.14s)
    writer.go:29: 2020-04-08T18:44:12.433+0200 [INFO]  TestReplication_FederationStates-node-1.server.raft: initial configuration: index=1 servers="[{Suffrage:Voter ID:89bfdb87-2be9-d8c8-02db-7af1a185b991 Address:127.0.0.1:12278}]"

@rboyer
Copy link
Member

rboyer commented Apr 8, 2020

Weird I didn't get any failures with that patch. Just to rule out data races entirely, here's a variant that removes even more theoretical race conditions:

diff --git agent/consul/federation_state_replication_test.go agent/consul/federation_state_replication_test.go
index ac047e0db..79f9eb8bb 100644
--- agent/consul/federation_state_replication_test.go
+++ agent/consul/federation_state_replication_test.go
@@ -43,7 +43,7 @@ func TestReplication_FederationStates(t *testing.T) {
 	testrpc.WaitForLeader(t, s1.RPC, "dc2")
 
 	// Create some new federation states (weird because we're having dc1 update it for the other 50)
-	var fedStates []*structs.FederationState
+	var fedStateDCs []string
 	for i := 0; i < 50; i++ {
 		dc := fmt.Sprintf("alt-dc%d", i+1)
 		ip1 := fmt.Sprintf("1.2.3.%d", i+1)
@@ -67,7 +67,7 @@ func TestReplication_FederationStates(t *testing.T) {
 
 		out := false
 		require.NoError(t, s1.RPC("FederationState.Apply", &arg, &out))
-		fedStates = append(fedStates, arg.State)
+		fedStateDCs = append(fedStateDCs, dc)
 	}
 
 	checkSame := func(t *retry.R) error {
@@ -126,11 +126,13 @@ func TestReplication_FederationStates(t *testing.T) {
 		checkSame(r)
 	})
 
-	for _, fedState := range fedStates {
+	for _, fedStateDC := range fedStateDCs {
 		arg := structs.FederationStateRequest{
 			Datacenter: "dc1",
 			Op:         structs.FederationStateDelete,
-			State:      fedState,
+			State: &structs.FederationState{
+				Datacenter: fedStateDC,
+			},
 		}
 
 		out := false

I also am running a faster, modified version of your scriptlet:

i=0; while go test -timeout 30s ./agent/consul -count=1 -run '^(TestReplication_FederationStates)$'; do i=$((i + 1)); printf "$i "; done

The main difference is exploiting that -count=ANYTHING forces go to bypass the testcache so you don't have to clear it explicitly.

@rboyer
Copy link
Member

rboyer commented Apr 8, 2020

🤦‍♂️

The more common problem with mutating the contents of memdb is reading something back out of memdb and forgetting it's a shared reference that you should not mutate.

	checkSame := func(t *retry.R) error {
		_, remote, err := s1.fsm.State().FederationStateList(nil)
		require.NoError(t, err)
		_, local, err := s2.fsm.State().FederationStateList(nil)
		require.NoError(t, err)

		require.Len(t, local, len(remote))
		for i, _ := range remote {
			// zero out the raft data for future comparisons
			remote[i].RaftIndex = structs.RaftIndex{} // <---LIKE THIS
			local[i].RaftIndex = structs.RaftIndex{}//   <---AND THIS
			require.Equal(t, remote[i], local[i])
		}
		return nil
	}

	// Wait for the replica to converge.
	retry.Run(t, func(r *retry.R) {
		checkSame(r)
	})

@rboyer
Copy link
Member

rboyer commented Apr 8, 2020

So after heeding my own advice from upthread I think I fixed both memdb corruptions in the test:

diff --git agent/consul/federation_state_replication_test.go agent/consul/federation_state_replication_test.go
index ac047e0db..614682b90 100644
--- agent/consul/federation_state_replication_test.go
+++ agent/consul/federation_state_replication_test.go
@@ -43,7 +43,7 @@ func TestReplication_FederationStates(t *testing.T) {
 	testrpc.WaitForLeader(t, s1.RPC, "dc2")
 
 	// Create some new federation states (weird because we're having dc1 update it for the other 50)
-	var fedStates []*structs.FederationState
+	var fedStateDCs []string
 	for i := 0; i < 50; i++ {
 		dc := fmt.Sprintf("alt-dc%d", i+1)
 		ip1 := fmt.Sprintf("1.2.3.%d", i+1)
@@ -67,7 +67,7 @@ func TestReplication_FederationStates(t *testing.T) {
 
 		out := false
 		require.NoError(t, s1.RPC("FederationState.Apply", &arg, &out))
-		fedStates = append(fedStates, arg.State)
+		fedStateDCs = append(fedStateDCs, dc)
 	}
 
 	checkSame := func(t *retry.R) error {
@@ -78,10 +78,14 @@ func TestReplication_FederationStates(t *testing.T) {
 
 		require.Len(t, local, len(remote))
 		for i, _ := range remote {
+			// Make lightweight copies so we can zero out the raft fields
+			// without mutating the copies in memdb.
+			remoteCopy := *remote[i]
+			localCopy := *local[i]
 			// zero out the raft data for future comparisons
-			remote[i].RaftIndex = structs.RaftIndex{}
-			local[i].RaftIndex = structs.RaftIndex{}
-			require.Equal(t, remote[i], local[i])
+			remoteCopy.RaftIndex = structs.RaftIndex{}
+			localCopy.RaftIndex = structs.RaftIndex{}
+			require.Equal(t, remoteCopy, localCopy)
 		}
 		return nil
 	}
@@ -126,11 +130,13 @@ func TestReplication_FederationStates(t *testing.T) {
 		checkSame(r)
 	})
 
-	for _, fedState := range fedStates {
+	for _, fedStateDC := range fedStateDCs {
 		arg := structs.FederationStateRequest{
 			Datacenter: "dc1",
 			Op:         structs.FederationStateDelete,
-			State:      fedState,
+			State: &structs.FederationState{
+				Datacenter: fedStateDC,
+			},
 		}
 
 		out := false

@rboyer
Copy link
Member

rboyer commented Apr 8, 2020

The reason this isn't broken in tests for acl tokens, roles, and policies is that in all of the other cases the tests don't do the deep equality checking that the federation state test does (instead testing some canary fields like Hash).

@pierresouchay
Copy link
Contributor Author

CS is hard

@pierresouchay
Copy link
Contributor Author

@rboyer The test is currently ongoing (80+ times without failure now), and seems Ok.
Thank you for the help, I had been bitten already by this 1y ago, I should have looked more closely.
I was convinced it was not a unit test problem because it look similar to random failures in some integration tests :(

@pierresouchay
Copy link
Contributor Author

103 tests without failures, seems fixed

@rboyer rboyer changed the title [BUGFIX] DiffRemoteAndLocalState incorrectly sync changes with remote fix flaky TestReplication_FederationStates test due to race conditions Apr 9, 2020
Copy link
Member

@rboyer rboyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants