Fix issue where consul esm can generate spurious updates #68

mikeyyuen · 2020-06-21T11:20:25Z

ESM can generate many repeated updates while it waits for local agents or server repliacs to
catch up, in the case of a service thats getting a lot of updates in a short period of time
this can delay updating any health checks for a while (we've observed metrics showing this to be
in the order of 5s for some consul cluster configurations).

This change makes sure to do a read from the leader before making a CAS update to the check value.

Additionally adds extra logging to help clarify where some errors are generated.

ESM can generate many repeated updates while it waits for local agents or server repliacs to catch up, in the case of a service thats getting a lot of updates in a short period of time this can delay updating any health checks for a while (we've observed metrics showing this to be in the order of 5s for some consul cluster configurations) This change makes sure to do a read from the leader before making a CAS update to the check value

mikeyyuen · 2020-06-21T11:21:41Z

@lornasong We found this performance issue last week, this may be of interested. It reduced our convergency time for updates down from 5s to < 1s in a lot of cases. It should be a small tweak this time!

lornasong

@mikeyyuen - thanks for this change! This sounds great and would be really helpful for other practitioners too.

I wanted to double-check to make sure I understand this. For the API that you updated to consistent-mode, when it was in default mode, was the issue that the checks returned did not include the check in question or that it did return the check in question but with different attributes?

We don’t often use consistent-mode so I was thinking it might be helpful down the line to document this deliberate decision. I put up a suggestion for a comment - feel free to modify though and write something better.

Thanks!

check.go

Co-authored-by: lornasong <[email protected]>

mikeyyuen · 2020-06-23T18:43:04Z

I wanted to double-check to make sure I understand this. For the API that you updated to consistent-mode, when it was in default mode, was the issue that the checks returned did not include the check in question or that it did return the check in question but with different attributes?

@lornasong It could be both. I think the majority we saw were that the output and status for checks would be old compared to the leaders copy i.e. unset, or older versions.

The comment suggestion is a good idea, I've merged it! TY.

lornasong

@mikeyyuen thanks for the clarification and including the comment!

edevil · 2020-07-24T09:14:14Z

@mikeyyuen Did you ever encounter the problem where followers had a different ModifyIndex for a check, and so the update only affected the leader and the followers never updated the status of the check?

mikeyyuen · 2020-07-24T10:03:15Z

@edevil in the ESM Cluster or in the Consul cluster. We didn't see that issue in the ESM, however yes, there is 100% some bug in Consul where it fails to update the followers, we think this changed around version 1.4 and is related to the txn end points and the caching added around that version (i.e. non-txn updates are not affected).

We have never been able to trace it fully enough to create a bug report / pr / issue. It seems intermittent and if the follower becomes a leader it does seem to pick up the right result again. Does that sound familiar?

edevil · 2020-07-24T10:12:43Z

We see that happening in Consul. The same check resource has different ModifyIndex values between leader and some/all followers (even though they are in sync regarding global raft index), ESM reads that consistently and sends the TxN with the CAS value equal to what's on the leader. The check is updated in the leader but not on followers which disagree on the ModifyIndex value.

We've just encountered this issue a few days ago so we haven't yet tried to rebuild the followers or make one of them the leader. hashicorp/consul#8377

Caching could be related but since we have already restarted the nodes it should have been cleared.

mikeyyuen · 2020-07-24T10:18:42Z

Interesting, I've subscribed to that issue, I would definitely like to help try and fix the bug, we spent quite a bit of time trying to find it but have so far not been able to re-produce it in a test environment. Feel free to @ me if you'd like a second opinion on stuff, I can also throw in our (anecdotal sadly) evidence of the issue.

lornasong added the bug label Jun 22, 2020

lornasong reviewed Jun 23, 2020

View reviewed changes

check.go Show resolved Hide resolved

Update check.go

867c2c3

Co-authored-by: lornasong <[email protected]>

lornasong approved these changes Jun 23, 2020

View reviewed changes

lornasong merged commit 00a2294 into hashicorp:master Jun 23, 2020

mikeyyuen deleted the to_upstream branch July 24, 2020 09:55

lornasong added this to the 0.4.0 milestone Jul 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix issue where consul esm can generate spurious updates #68

Fix issue where consul esm can generate spurious updates #68

mikeyyuen commented Jun 21, 2020

mikeyyuen commented Jun 21, 2020

lornasong left a comment

mikeyyuen commented Jun 23, 2020

lornasong left a comment

edevil commented Jul 24, 2020

mikeyyuen commented Jul 24, 2020

edevil commented Jul 24, 2020

mikeyyuen commented Jul 24, 2020

Fix issue where consul esm can generate spurious updates #68

Fix issue where consul esm can generate spurious updates #68

Conversation

mikeyyuen commented Jun 21, 2020

mikeyyuen commented Jun 21, 2020

lornasong left a comment

Choose a reason for hiding this comment

mikeyyuen commented Jun 23, 2020

lornasong left a comment

Choose a reason for hiding this comment

edevil commented Jul 24, 2020

mikeyyuen commented Jul 24, 2020

edevil commented Jul 24, 2020

mikeyyuen commented Jul 24, 2020