Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix flapping of mesh gateway connect-service watches #7575

Merged
merged 1 commit into from
Apr 2, 2020

Conversation

crhino
Copy link
Contributor

@crhino crhino commented Apr 2, 2020

Fixes #7503 , at least 1 specific cause of flakes.

What is the Problem?

While investigating flakey tests, I found the root cause to be a bug in our mesh gateway watch logic.

The bug happens like so:

  1. A Mesh Gateway sets up an initial watch for all of the services in its datacenter, so that it can proxy traffic to them.

  2. Upon receiving an update event for this watch, the proxycfg package will iterate over the entire list of services and set up a cacetype.HealthServicesName watch for any service that is not being watched already. During this loop, we also add the serviceID to a map so that we can cancel any unneeded watches, however we only added the serviceID to the map if we do not have a watch already. This is the bug.

  3. After iterating over the services, we then iterate over a map currently watched services and cancel any services that are not in the map we constructed in step 2. This causes currently available services to be canceled because of the bug in step 2.

An example of the issue:

  1. We have a mesh gateway which has received a list of services, like so:
web
api
consul
mesh-gateway
db
redis
mysql

and generates health-service watches for them.

  1. Before these health-service watches return, another service-list watch fires and returns a new list of services, like so:
web
api
consul
mesh-gateway
db
redis
mysql
dashboard

This new list of services will currently generate a new health-services watch for the dashboard service, and cancel the watches for the other services.

This can manifest itself in delayed endpoint updates for services, where a service health check might have updated some time in the past but Envoy continues to think it is healthy/unhealthy. Or in pathological cases, this can manifest in the service never even being added as an Envoy cluster in the first place, if no other service-list updates come in.

Verification

While investigating, I ran this for loop to get the gateways-local integration test to fail consistently:

for i in $(seq 1 30); do FILTER_TESTS="gateways-local" ENVOY_VERSIONS="1.13.1" make test-envoy-integ > test.out; if [ $? -ne 0 ]; then break; fi; echo; done

After applying the fix, I was able to run this to completion without any failures.

@crhino crhino requested a review from a team April 2, 2020 14:56
Copy link
Member

@mkeeler mkeeler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

Copy link
Member

@banks banks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👏 Great work! Debugging flakey tests can be frustrating but surprising often does turn up actual bugs with enough perseverance. Thanks for digging into this one!

@banks
Copy link
Member

banks commented Apr 2, 2020

Also, should we backport this to 1.7.x Is seems like a simple change to backport and a nasty bug in gateways that culd bite peope in real deployments too?

@crhino crhino merged commit 584f90b into master Apr 2, 2020
@crhino crhino deleted the bug/mesh-gateway-watch-flapping branch April 2, 2020 15:12
pierresouchay added a commit to pierresouchay/consul that referenced this pull request Apr 8, 2020
When comparing FederationState between local and remote states, the update was comparing the lastIndex, but this index on remote could be lower than max index retrieved and stored from remote, to be exact, I found it could be zero forever in some case (probably a race condition, but was not able to pinpoint it).

This fix is not fixing the initial root cause: the ModifiedIndex is set to 0 in remote DC, but it is a useful workaround as it fixes the convergence if the remote server is not behaving as it should.

In that case, it would break sync of updates and the remote object would be never synced.

To be sure we don't forget anything, I also added a test over `isSame()` to be sure the objects are identical.

This was the reason for the test `TestReplication_FederationStates()` to fail randomly. This is also probably the cause for some of the integration tests to fail randomly as well.

With this fix, the command:
```
i=0; while /usr/local/bin/go test -timeout 30s github.com/hashicorp/consul/agent/consul -run '^(TestReplication_FederationStates)$'; do go clean -testcache; i=$((i + 1)); printf "$i "; done
```
That used to break on my machine in less than 20 runs is now running 150+ times without any issue.

Might also fix hashicorp#7575
rboyer pushed a commit that referenced this pull request Apr 9, 2020
#7612)

The test had two racy bugs related to memdb references.

The first was when we initially populated data and retained the FederationState objects in a slice. Due to how the `inmemCodec` works these were actually the identical objects passed into memdb.

The second was that the `checkSame` assertion function was reading from memdb and setting the RaftIndexes to zeros to aid in equality checks. This was mutating the contents of memdb which is a no-no.

With this fix, the command:
```
i=0; while /usr/local/bin/go test -count=1 -timeout 30s github.com/hashicorp/consul/agent/consul -run '^(TestReplication_FederationStates)$'; do i=$((i + 1)); printf "$i "; done
```
That used to break on my machine in less than 20 runs is now running 150+ times without any issue.

Might also fix #7575
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

flake: envoy integration test gateways-local
3 participants