-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix flaky TestReplication_FederationStates test due to race conditions #7612
fix flaky TestReplication_FederationStates test due to race conditions #7612
Conversation
fc1cb8f
to
1512d59
Compare
When comparing FederationState between local and remote states, the update was comparing the lastIndex, but this index on remote could be lower than max index retrieved and stored from remote, to be exact, I found it could be zero forever in some case (probably a race condition, but was not able to pinpoint it). This fix is not fixing the initial root cause: the ModifiedIndex is set to 0 in remote DC, but it is a useful workaround as it fixes the convergence if the remote server is not behaving as it should. In that case, it would break sync of updates and the remote object would be never synced. To be sure we don't forget anything, I also added a test over `isSame()` to be sure the objects are identical. This was the reason for the test `TestReplication_FederationStates()` to fail randomly. This is also probably the cause for some of the integration tests to fail randomly as well. With this fix, the command: ``` i=0; while /usr/local/bin/go test -timeout 30s github.com/hashicorp/consul/agent/consul -run '^(TestReplication_FederationStates)$'; do go clean -testcache; i=$((i + 1)); printf "$i "; done ``` That used to break on my machine in less than 20 runs is now running 150+ times without any issue. Might also fix hashicorp#7575
1512d59
to
437dee2
Compare
Let me dig into this. The general logic around that part of the replication is actually present in replication paths for several things already so if it's a bug in one of them it likely is either a bug in all of them, or there's an underlying problem with how federation states were implemented. |
@rboyer Yes, that's weird, I did not see anything special in the code, from the Tx methods to the sync code, but it definitely occurs. To highlight it:
Put a warning when any of the remote has a ModifiedIndex of 0 (even a panic)... everytime the test fails (~1/10 times on my machine, with the patch, I was able to perform this test 300+ times without any error), it happens (it does not when the test do work). I also tried to change the number off updates in the test from 3 to 255 (instead of 50 updates), it fails with around the same frequency. |
Ah I think I figured out what is actually happening. Because this is taking the short-circuit RPC path (via This means the slice of federation states captured in the start of the test are actually all pointers to memdb state, rather than copies. Since there's a data race all bets are off as to what actually happens during the rest of the test. Try this patch and see if you can make it fail anymore: diff --git agent/consul/federation_state_replication_test.go agent/consul/federation_state_replication_test.go
index ac047e0db..e8d486a58 100644
--- agent/consul/federation_state_replication_test.go
+++ agent/consul/federation_state_replication_test.go
@@ -10,6 +10,7 @@ import (
"github.com/hashicorp/consul/api"
"github.com/hashicorp/consul/sdk/testutil/retry"
"github.com/hashicorp/consul/testrpc"
+ "github.com/mitchellh/copystructure"
"github.com/stretchr/testify/require"
)
@@ -42,6 +43,12 @@ func TestReplication_FederationStates(t *testing.T) {
testrpc.WaitForLeader(t, s1.RPC, "dc1")
testrpc.WaitForLeader(t, s1.RPC, "dc2")
+ duplicate := func(t *testing.T, s *structs.FederationState) *structs.FederationState {
+ s2, err := copystructure.Copy(s)
+ require.NoError(t, err)
+ return s2.(*structs.FederationState)
+ }
+
// Create some new federation states (weird because we're having dc1 update it for the other 50)
var fedStates []*structs.FederationState
for i := 0; i < 50; i++ {
@@ -67,7 +74,7 @@ func TestReplication_FederationStates(t *testing.T) {
out := false
require.NoError(t, s1.RPC("FederationState.Apply", &arg, &out))
- fedStates = append(fedStates, arg.State)
+ fedStates = append(fedStates, duplicate(t, arg.State))
}
checkSame := func(t *retry.R) error {
@@ -130,7 +137,7 @@ func TestReplication_FederationStates(t *testing.T) {
arg := structs.FederationStateRequest{
Datacenter: "dc1",
Op: structs.FederationStateDelete,
- State: fedState,
+ State: duplicate(t, fedState),
}
out := false |
dammit, I thought about this as well during my investigations, but did not see how there could be a race condition here.... I am testing this |
…ith remote." This reverts commit 437dee2.
|
@rboyer No, still failing with your patch :( -> I am using the last patch for the branch (so, exactly your patch, I did revert mine), it fails the same:
|
Weird I didn't get any failures with that patch. Just to rule out data races entirely, here's a variant that removes even more theoretical race conditions:
I also am running a faster, modified version of your scriptlet:
The main difference is exploiting that |
🤦♂️
|
So after heeding my own advice from upthread I think I fixed both memdb corruptions in the test:
|
The reason this isn't broken in tests for acl tokens, roles, and policies is that in all of the other cases the tests don't do the deep equality checking that the federation state test does (instead testing some canary fields like |
CS is hard |
@rboyer The test is currently ongoing (80+ times without failure now), and seems Ok. |
103 tests without failures, seems fixed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
When comparing FederationState between local and remote states, the update was comparing the lastIndex, but this index on remote could be lower than max index retrieved and stored from remote, to be exact, I found it could be zero forever in some case (probably a race condition, but was not able to pinpoint it).
This fix is not fixing the initial root cause: the ModifiedIndex is set to 0 in remote DC, but it is a useful workaround as it fixes the convergence if the remote server is not behaving as it should.
In that case, it would break sync of updates and the remote object would be never synced.
To be sure we don't forget anything, I also added a test over
isSame()
to be sure the objects are identical.This was the reason for the test
TestReplication_FederationStates()
to fail randomly. This is also probably the cause for some of the integration tests to fail randomly as well.With this fix, the command:
That used to break on my machine in less than 20 runs is now running 150+ times without any issue.
Might also fix #7575