-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix issue where consul esm can generate spurious updates #68
Conversation
ESM can generate many repeated updates while it waits for local agents or server repliacs to catch up, in the case of a service thats getting a lot of updates in a short period of time this can delay updating any health checks for a while (we've observed metrics showing this to be in the order of 5s for some consul cluster configurations) This change makes sure to do a read from the leader before making a CAS update to the check value
@lornasong We found this performance issue last week, this may be of interested. It reduced our convergency time for updates down from 5s to < 1s in a lot of cases. It should be a small tweak this time! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mikeyyuen - thanks for this change! This sounds great and would be really helpful for other practitioners too.
I wanted to double-check to make sure I understand this. For the API that you updated to consistent-mode, when it was in default mode, was the issue that the checks returned did not include the check in question or that it did return the check in question but with different attributes?
We don’t often use consistent-mode so I was thinking it might be helpful down the line to document this deliberate decision. I put up a suggestion for a comment - feel free to modify though and write something better.
Thanks!
Co-authored-by: lornasong <[email protected]>
@lornasong It could be both. I think the majority we saw were that the output and status for checks would be old compared to the leaders copy i.e. unset, or older versions. The comment suggestion is a good idea, I've merged it! TY. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mikeyyuen thanks for the clarification and including the comment!
@mikeyyuen Did you ever encounter the problem where followers had a different ModifyIndex for a check, and so the update only affected the leader and the followers never updated the status of the check? |
@edevil in the ESM Cluster or in the Consul cluster. We didn't see that issue in the ESM, however yes, there is 100% some bug in Consul where it fails to update the followers, we think this changed around version 1.4 and is related to the txn end points and the caching added around that version (i.e. non-txn updates are not affected). We have never been able to trace it fully enough to create a bug report / pr / issue. It seems intermittent and if the follower becomes a leader it does seem to pick up the right result again. Does that sound familiar? |
We see that happening in Consul. The same check resource has different We've just encountered this issue a few days ago so we haven't yet tried to rebuild the followers or make one of them the leader. hashicorp/consul#8377 Caching could be related but since we have already restarted the nodes it should have been cleared. |
Interesting, I've subscribed to that issue, I would definitely like to help try and fix the bug, we spent quite a bit of time trying to find it but have so far not been able to re-produce it in a test environment. Feel free to @ me if you'd like a second opinion on stuff, I can also throw in our (anecdotal sadly) evidence of the issue. |
ESM can generate many repeated updates while it waits for local agents or server repliacs to
catch up, in the case of a service thats getting a lot of updates in a short period of time
this can delay updating any health checks for a while (we've observed metrics showing this to be
in the order of 5s for some consul cluster configurations).
This change makes sure to do a read from the leader before making a CAS update to the check value.
Additionally adds extra logging to help clarify where some errors are generated.