-
Notifications
You must be signed in to change notification settings - Fork 619
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consul registry performance improvements #928
Conversation
Two changes: 1. Limit the initial list of Health Checks to those with the configured Tag Prefix 2. Reduce the amount of loops done in passingServices
Hi @nathanejohnson and @KTruesdellENA, just wanted to check in and see if you have any feedback. |
@ddreier this looks awesome! I'd like to get it merged if I can get the access. |
Would definitely be nice to get some tests added, and I agree, this looks great. |
I will have a crack at adding some tests next week if only to get this one merged. |
I know ddreier is fairly busy, but we've been running 30 or so instances of Fabio with this code with significantly improved performance 😄 We've been running this way for months now in production. |
Awesome, that's great to hear. I did another look through the code and additional tests may not be required. However, it would be good to update the documentation to reflect these changes. |
Unfortunately I'm not able to commit additional time right now. As @Shackelford-Arden mentioned, we've been running Fabio with these changes in production for over a year at this point. Our 30 or so Fabio instances handle hundreds of millions of public and internal requests per day. The anecdote isn't the same as tests in code, for sure. But I hope it helps with confidence. |
@ddreier thank you, and yes, it definitely helps. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test coverage on passing.go is 100% and service.go is 0% (both unchanged from master), I've been running this code on my systems for at least 7 months without issue so I'm giving it a "LGTM".
Hi, it's me again trying to improve route update performance for our absurd number of routes. 😁
There's two major changes here:
The first is adding "filtering" so that only Consul Health Checks with the applicable Service Tag Prefix (and serf/maintenance checks) are considered during the rest of the process. In our Production environments this brings down the number of checks that Fabio has to look at from ~40k down to ~12k.
The second change is simplifying the$O(n^5)$ process to $O(n^2)$ . Each of the "helper" functions (
passingServices
function from ancountChecks
isAgentCritical
isNodeInMaintenance
isServiceInMaintenance
) did their own loop over each of the Consul Health Checks which added up in time very quickly.Both of these changes drastically reduced the amount of time spent processing responses from Consul. In a test using real Consul data but not serving traffic we saw a reduction from around 90 seconds to less than 10 seconds. We've been testing Fabio with these changes in a non-production environment for about a week now, and have not detected any issues.
(The increase in
makeConfig
shown on the graph is due to how I measured it. It includes time waiting to write on the channel back to the main Fabio goroutine. This wasn't an issue before because processing the Consul data took longer than the time needed to build the actual route table (~10s) but the scales have now flipped. Building the Route Table takes longer than processing the data from Consul.)I also had profiling running during my tests. Here's stock Fabio:
After adding the Check "filtering":
After simplifying
passingServices
:I'm very interested in any feedback about these changes and will do what I can to help satisfy any concerns. Thanks!