Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issue where consul esm can generate spurious updates #68

Merged
merged 2 commits into from
Jun 23, 2020

Conversation

mikeyyuen
Copy link
Contributor

ESM can generate many repeated updates while it waits for local agents or server repliacs to
catch up, in the case of a service thats getting a lot of updates in a short period of time
this can delay updating any health checks for a while (we've observed metrics showing this to be
in the order of 5s for some consul cluster configurations).

This change makes sure to do a read from the leader before making a CAS update to the check value.

Additionally adds extra logging to help clarify where some errors are generated.

ESM can generate many repeated updates while it waits for local agents or server repliacs to
catch up, in the case of a service thats getting a lot of updates in a short period of time
this can delay updating any health checks for a while (we've observed metrics showing this to be
in the order of 5s for some consul cluster configurations)

This change makes sure to do a read from the leader before making a CAS update to the check value
@mikeyyuen
Copy link
Contributor Author

@lornasong We found this performance issue last week, this may be of interested. It reduced our convergency time for updates down from 5s to < 1s in a lot of cases. It should be a small tweak this time!

@lornasong lornasong added the bug label Jun 22, 2020
Copy link
Member

@lornasong lornasong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mikeyyuen - thanks for this change! This sounds great and would be really helpful for other practitioners too.

I wanted to double-check to make sure I understand this. For the API that you updated to consistent-mode, when it was in default mode, was the issue that the checks returned did not include the check in question or that it did return the check in question but with different attributes?

We don’t often use consistent-mode so I was thinking it might be helpful down the line to document this deliberate decision. I put up a suggestion for a comment - feel free to modify though and write something better.

Thanks!

check.go Show resolved Hide resolved
Co-authored-by: lornasong <[email protected]>
@mikeyyuen
Copy link
Contributor Author

I wanted to double-check to make sure I understand this. For the API that you updated to consistent-mode, when it was in default mode, was the issue that the checks returned did not include the check in question or that it did return the check in question but with different attributes?

@lornasong It could be both. I think the majority we saw were that the output and status for checks would be old compared to the leaders copy i.e. unset, or older versions.

The comment suggestion is a good idea, I've merged it! TY.

Copy link
Member

@lornasong lornasong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mikeyyuen thanks for the clarification and including the comment!

@lornasong lornasong merged commit 00a2294 into hashicorp:master Jun 23, 2020
@edevil
Copy link
Contributor

edevil commented Jul 24, 2020

@mikeyyuen Did you ever encounter the problem where followers had a different ModifyIndex for a check, and so the update only affected the leader and the followers never updated the status of the check?

@mikeyyuen mikeyyuen deleted the to_upstream branch July 24, 2020 09:55
@mikeyyuen
Copy link
Contributor Author

@edevil in the ESM Cluster or in the Consul cluster. We didn't see that issue in the ESM, however yes, there is 100% some bug in Consul where it fails to update the followers, we think this changed around version 1.4 and is related to the txn end points and the caching added around that version (i.e. non-txn updates are not affected).

We have never been able to trace it fully enough to create a bug report / pr / issue. It seems intermittent and if the follower becomes a leader it does seem to pick up the right result again. Does that sound familiar?

@edevil
Copy link
Contributor

edevil commented Jul 24, 2020

We see that happening in Consul. The same check resource has different ModifyIndex values between leader and some/all followers (even though they are in sync regarding global raft index), ESM reads that consistently and sends the TxN with the CAS value equal to what's on the leader. The check is updated in the leader but not on followers which disagree on the ModifyIndex value.

We've just encountered this issue a few days ago so we haven't yet tried to rebuild the followers or make one of them the leader. hashicorp/consul#8377

Caching could be related but since we have already restarted the nodes it should have been cleared.

@mikeyyuen
Copy link
Contributor Author

Interesting, I've subscribed to that issue, I would definitely like to help try and fix the bug, we spent quite a bit of time trying to find it but have so far not been able to re-produce it in a test environment. Feel free to @ me if you'd like a second opinion on stuff, I can also throw in our (anecdotal sadly) evidence of the issue.

@lornasong lornasong added this to the 0.4.0 milestone Jul 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants