-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DeferCheck is leaking, causing high CPU usage and raft index churn. #18429
Comments
another (kinda related) issue -- On that particular node, most of the health check output are stable (Consul Connect sidecar TCP HC & alias HC), only a few is changing. However noticed the registered defercheck number is almost the same as the total number of healthcheck exists on that particular node. Given the condition: // Do nothing if update is idempotent
if c.Check.Status == status && c.Check.Output == output {
return
} Something does not add up... So with the following change ...
if l.config.CheckUpdateInterval > 0 && c.Check.Status == status {
+ oldOutput := c.Check.Output
c.Check.Output = output
if c.DeferCheck == nil {
+ l.logger.Debug("REGISTER defercheck", "check", id, "output1", oldOutput, "output2", output)
d := l.config.CheckUpdateInterval Noticed the following behvior:
Check output is flipping between This happens reliably after restarting Consul with existing mesh services registered in Consul Connect enabled Nomad cluster setup. I suspect this is due to the ordering of the checks registered through Consul vs throgh Nomad: Consul (TCP first then Alias): return []*structs.CheckType{
{
Name: "Connect Sidecar Listening",
TCP: ipaddr.FormatAddressPort(checkAddress, port),
Interval: 10 * time.Second,
},
{
Name: "Connect Sidecar Aliasing " + serviceID,
AliasService: serviceID,
},
} vs Nomad (Alias first then TCP) checks := api.AgentServiceChecks{{
Name: "Connect Sidecar Aliasing " + serviceID,
AliasService: serviceID,
}}
if !css.DisableDefaultTCPCheck {
checks = append(checks, &api.AgentServiceCheck{
Name: "Connect Sidecar Listening",
TCP: net.JoinHostPort(cMapping.HostIP, strconv.Itoa(cMapping.Value)),
Interval: "10s",
})
} Which causes the weird flipping. Either unify the ordering of the checks between the two projects, or sort the checks before registering (which i think would be a better option) should help resolve this issue, tho I havent tested. |
@fredwangwang , thanks for reporting this and detailed analysis. We will look into this issue. |
@fredwangwang , I think the proposed solution makes sense. Will you be able to make a PR? In addition, I’d like to know more about your use case. If I understand correctly, to |
Hi @huikang
No, too lazy to fill a proper pr with test. Think I've done my fair share hehe.
annd I wondered the same. We are using pretty standard setup with nomad, that uses nomad and consul integration. Almost all of our services are registered using service block in nomad jobs. So there could be a bug in nomad or consul, thats for you guys to figure out :) TBH I am lesser interested in that, as frequent service/check register, tho annoying, should be idempotent, and must NOT causing go routine leak and causing system instabilities. |
@fredwangwang , no problem. I will create on PR based on your investigation. Thanks. |
thanks for the PR @huikang! there is another related issue needs some attention -- not as sever as the leaking, but its something :) #18429 (comment) |
Overview of the Issue
With
multiplemany service registered with changing healthcheck output, noticed very high CPU usage on the Consul Server. After restarting the consul cluster, the CPU start to spike up gradually again aftercheck_update_interval (24h) / 2
, till about the same high cpu usage as before restarting eventually.The raft index is also changing at a rate of ~5000 per minute, causing issues with healthcheck blocking queries as well.
Causes
After investigation noticed
DeferCheck
is leaking, with the following patch applied inagent/local/state.go
:Patch
, saw the following messages in log:
On the particular node where the log was captured, there were only ~50 health checks registered. Yet the DeferCheck has been queued up for more than 800 after running for ~5min, and keeps increasing, eventually causes CPU pressure to the server when the DeferChecks gets executed.
Resolution
Adding the above lines to
setCheckStateLocked
helped to resolve the issue.Verified running the patched agent for some period of time, the number of DeferCheck registered is stable, getting down from an ever increasing number to about the same number of DeferCheck as the HealthChecks on the node:
Too lazy to fill a proper pr with test...
Consul info for both Client and Server
Consul v1.15.4
Operating system and Environment details
Linux
The text was updated successfully, but these errors were encountered: