Fix flapping of mesh gateway connect-service watches #7575

crhino · 2020-04-02T14:56:50Z

Fixes #7503 , at least 1 specific cause of flakes.

What is the Problem?

While investigating flakey tests, I found the root cause to be a bug in our mesh gateway watch logic.

The bug happens like so:

A Mesh Gateway sets up an initial watch for all of the services in its datacenter, so that it can proxy traffic to them.
Upon receiving an update event for this watch, the proxycfg package will iterate over the entire list of services and set up a cacetype.HealthServicesName watch for any service that is not being watched already. During this loop, we also add the serviceID to a map so that we can cancel any unneeded watches, however we only added the serviceID to the map if we do not have a watch already. This is the bug.
After iterating over the services, we then iterate over a map currently watched services and cancel any services that are not in the map we constructed in step 2. This causes currently available services to be canceled because of the bug in step 2.

An example of the issue:

We have a mesh gateway which has received a list of services, like so:

web
api
consul
mesh-gateway
db
redis
mysql

and generates health-service watches for them.

Before these health-service watches return, another service-list watch fires and returns a new list of services, like so:

web
api
consul
mesh-gateway
db
redis
mysql
dashboard

This new list of services will currently generate a new health-services watch for the dashboard service, and cancel the watches for the other services.

This can manifest itself in delayed endpoint updates for services, where a service health check might have updated some time in the past but Envoy continues to think it is healthy/unhealthy. Or in pathological cases, this can manifest in the service never even being added as an Envoy cluster in the first place, if no other service-list updates come in.

Verification

While investigating, I ran this for loop to get the gateways-local integration test to fail consistently:

for i in $(seq 1 30); do FILTER_TESTS="gateways-local" ENVOY_VERSIONS="1.13.1" make test-envoy-integ > test.out; if [ $? -ne 0 ]; then break; fi; echo; done

After applying the fix, I was able to run this to completion without any failures.

mkeeler

banks

👏 Great work! Debugging flakey tests can be frustrating but surprising often does turn up actual bugs with enough perseverance. Thanks for digging into this one!

banks · 2020-04-02T15:06:36Z

Also, should we backport this to 1.7.x Is seems like a simple change to backport and a nasty bug in gateways that culd bite peope in real deployments too?

When comparing FederationState between local and remote states, the update was comparing the lastIndex, but this index on remote could be lower than max index retrieved and stored from remote, to be exact, I found it could be zero forever in some case (probably a race condition, but was not able to pinpoint it). This fix is not fixing the initial root cause: the ModifiedIndex is set to 0 in remote DC, but it is a useful workaround as it fixes the convergence if the remote server is not behaving as it should. In that case, it would break sync of updates and the remote object would be never synced. To be sure we don't forget anything, I also added a test over `isSame()` to be sure the objects are identical. This was the reason for the test `TestReplication_FederationStates()` to fail randomly. This is also probably the cause for some of the integration tests to fail randomly as well. With this fix, the command: ``` i=0; while /usr/local/bin/go test -timeout 30s github.com/hashicorp/consul/agent/consul -run '^(TestReplication_FederationStates)$'; do go clean -testcache; i=$((i + 1)); printf "$i "; done ``` That used to break on my machine in less than 20 runs is now running 150+ times without any issue. Might also fix hashicorp#7575

#7612) The test had two racy bugs related to memdb references. The first was when we initially populated data and retained the FederationState objects in a slice. Due to how the `inmemCodec` works these were actually the identical objects passed into memdb. The second was that the `checkSame` assertion function was reading from memdb and setting the RaftIndexes to zeros to aid in equality checks. This was mutating the contents of memdb which is a no-no. With this fix, the command: ``` i=0; while /usr/local/bin/go test -count=1 -timeout 30s github.com/hashicorp/consul/agent/consul -run '^(TestReplication_FederationStates)$'; do i=$((i + 1)); printf "$i "; done ``` That used to break on my machine in less than 20 runs is now running 150+ times without any issue. Might also fix #7575

Fix flapping of mesh gateway connect-service watches

0f47c7d

crhino requested a review from a team April 2, 2020 14:56

mkeeler approved these changes Apr 2, 2020

View reviewed changes

banks approved these changes Apr 2, 2020

View reviewed changes

crhino added the backport/1.7 label Apr 2, 2020

crhino merged commit 584f90b into master Apr 2, 2020

crhino deleted the bug/mesh-gateway-watch-flapping branch April 2, 2020 15:12

hashicorp-ci pushed a commit that referenced this pull request Apr 2, 2020

Fix flapping of mesh gateway connect-service watches (#7575)

a5029c1

pierresouchay mentioned this pull request Apr 8, 2020

fix flaky TestReplication_FederationStates test due to race conditions #7612

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix flapping of mesh gateway connect-service watches #7575

Fix flapping of mesh gateway connect-service watches #7575

crhino commented Apr 2, 2020 •

edited

Loading

mkeeler left a comment

banks left a comment

banks commented Apr 2, 2020

Fix flapping of mesh gateway connect-service watches #7575

Fix flapping of mesh gateway connect-service watches #7575

Conversation

crhino commented Apr 2, 2020 • edited Loading

What is the Problem?

Verification

mkeeler left a comment

Choose a reason for hiding this comment

banks left a comment

Choose a reason for hiding this comment

banks commented Apr 2, 2020

crhino commented Apr 2, 2020 •

edited

Loading