-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix flapping of mesh gateway connect-service watches #7575
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👏 Great work! Debugging flakey tests can be frustrating but surprising often does turn up actual bugs with enough perseverance. Thanks for digging into this one!
Also, should we backport this to 1.7.x Is seems like a simple change to backport and a nasty bug in gateways that culd bite peope in real deployments too? |
When comparing FederationState between local and remote states, the update was comparing the lastIndex, but this index on remote could be lower than max index retrieved and stored from remote, to be exact, I found it could be zero forever in some case (probably a race condition, but was not able to pinpoint it). This fix is not fixing the initial root cause: the ModifiedIndex is set to 0 in remote DC, but it is a useful workaround as it fixes the convergence if the remote server is not behaving as it should. In that case, it would break sync of updates and the remote object would be never synced. To be sure we don't forget anything, I also added a test over `isSame()` to be sure the objects are identical. This was the reason for the test `TestReplication_FederationStates()` to fail randomly. This is also probably the cause for some of the integration tests to fail randomly as well. With this fix, the command: ``` i=0; while /usr/local/bin/go test -timeout 30s github.com/hashicorp/consul/agent/consul -run '^(TestReplication_FederationStates)$'; do go clean -testcache; i=$((i + 1)); printf "$i "; done ``` That used to break on my machine in less than 20 runs is now running 150+ times without any issue. Might also fix hashicorp#7575
#7612) The test had two racy bugs related to memdb references. The first was when we initially populated data and retained the FederationState objects in a slice. Due to how the `inmemCodec` works these were actually the identical objects passed into memdb. The second was that the `checkSame` assertion function was reading from memdb and setting the RaftIndexes to zeros to aid in equality checks. This was mutating the contents of memdb which is a no-no. With this fix, the command: ``` i=0; while /usr/local/bin/go test -count=1 -timeout 30s github.com/hashicorp/consul/agent/consul -run '^(TestReplication_FederationStates)$'; do i=$((i + 1)); printf "$i "; done ``` That used to break on my machine in less than 20 runs is now running 150+ times without any issue. Might also fix #7575
Fixes #7503 , at least 1 specific cause of flakes.
What is the Problem?
While investigating flakey tests, I found the root cause to be a bug in our mesh gateway watch logic.
The bug happens like so:
A Mesh Gateway sets up an initial watch for all of the services in its datacenter, so that it can proxy traffic to them.
Upon receiving an update event for this watch, the
proxycfg
package will iterate over the entire list of services and set up acacetype.HealthServicesName
watch for any service that is not being watched already. During this loop, we also add the serviceID to a map so that we can cancel any unneeded watches, however we only added the serviceID to the map if we do not have a watch already. This is the bug.After iterating over the services, we then iterate over a map currently watched services and cancel any services that are not in the map we constructed in step 2. This causes currently available services to be canceled because of the bug in step 2.
An example of the issue:
and generates health-service watches for them.
This new list of services will currently generate a new health-services watch for the
dashboard
service, and cancel the watches for the other services.This can manifest itself in delayed endpoint updates for services, where a service health check might have updated some time in the past but Envoy continues to think it is healthy/unhealthy. Or in pathological cases, this can manifest in the service never even being added as an Envoy cluster in the first place, if no other service-list updates come in.
Verification
While investigating, I ran this for loop to get the
gateways-local
integration test to fail consistently:After applying the fix, I was able to run this to completion without any failures.