DeferCheck is leaking, causing high CPU usage and raft index churn. #18429

fredwangwang · 2023-08-10T07:34:23Z

Overview of the Issue

With ~~multiple~~ many service registered with changing healthcheck output, noticed very high CPU usage on the Consul Server. After restarting the consul cluster, the CPU start to spike up gradually again after check_update_interval (24h) / 2, till about the same high cpu usage as before restarting eventually.

The raft index is also changing at a rate of ~5000 per minute, causing issues with healthcheck blocking queries as well.

Causes

After investigation noticed DeferCheck is leaking, with the following patch applied in agent/local/state.go:

Patch

--- state.go	2023-08-10 15:18:29.734073200 +0800
+++ state.go.bk-evidence	2023-08-10 15:20:22.963909000 +0800
@@ -26,6 +26,8 @@
 	"github.com/mitchellh/copystructure"
 )
 
+var deferCheckCount atomic.Int64
+
 var StateCounters = []prometheus.CounterDefinition{
 	{
 		Name: []string{"acl", "blocked", "service", "registration"},
@@ -231,6 +233,13 @@
 		agentEnterpriseMeta: *structs.NodeEnterpriseMetaInPartition(c.Partition),
 	}
 	l.SetDiscardCheckOutput(c.DiscardCheckOutput)
+
+	go func() {
+		for {
+			time.Sleep(time.Second * 10)
+			l.logger.Info("COUNT deferCheck", "count", deferCheckCount.Load())
+		}
+	}()
 	return l
 }
 
@@ -710,12 +719,16 @@
 	if l.config.CheckUpdateInterval > 0 && c.Check.Status == status {
 		c.Check.Output = output
 		if c.DeferCheck == nil {
+			deferCheckCount.Add(1)
 			d := l.config.CheckUpdateInterval
 			intv := time.Duration(uint64(d)/2) + lib.RandomStagger(d)
 			c.DeferCheck = time.AfterFunc(intv, func() {
 				l.Lock()
 				defer l.Unlock()
 
+				deferCheckCount.Add(-1)
+				cCreated := c
+
 				c := l.checks[id]
 				if c == nil {
 					return
@@ -724,6 +737,9 @@
 				if c.Deleted {
 					return
 				}
+
+				l.logger.Info("UpdateCheck deferfunc triggered", "check", c.Check.CheckID, "ref", cCreated)
+
 				c.InSync = false
 				l.TriggerSyncChanges()
 			})
@@ -834,6 +850,10 @@
 	existing := l.checks[id]
 	if existing != nil {
 		c.InSync = c.Check.IsSame(existing.Check)
+
+		if existing.DeferCheck != nil && c.DeferCheck == nil {
+			l.logger.Error("FATAL!!! overwriting existing check with defered check to nil, deferCheck leaked!!!", "check", id)
+		}
 	}
 
 	l.checks[id] = c
@@ -1276,6 +1296,7 @@
 			if c.DeferCheck != nil {
 				c.DeferCheck.Stop()
 				c.DeferCheck = nil
+				deferCheckCount.Add(-1)
 			}
 			err = l.syncCheck(id)
 		default:
@@ -1392,6 +1413,7 @@
 	c := l.checks[id]
 	if c != nil && c.DeferCheck != nil {
 		c.DeferCheck.Stop()
+		deferCheckCount.Add(-1)
 	}
 	delete(l.checks, id)
 }

, saw the following messages in log:

2023-08-10T03:14:48.911Z [ERROR] agent: FATAL!!! overwriting existing check with defered check to nil, deferCheck leaked!!!: check=service:_nomad-task-68721c9c-cc5b-205b-bf53-9e937ca0f75c-group-search-hub-api-ldd-1651-group-search-hub-api-ldd-1651-9090-sidecar-proxy:1
2023-08-10T03:14:48.911Z [ERROR] agent: FATAL!!! overwriting existing check with defered check to nil, deferCheck leaked!!!: check=service:_nomad-task-68721c9c-cc5b-205b-bf53-9e937ca0f75c-group-search-hub-api-ldd-1651-group-search-hub-api-ldd-1651-9090-sidecar-proxy:2
2023-08-10T03:14:48.962Z [ERROR] agent: FATAL!!! overwriting existing check with defered check to nil, deferCheck leaked!!!: check=_nomad-check-caf1abcff4b2365c91dd87054358af72e34816b3
2023-08-10T03:14:48.976Z [ERROR] agent: FATAL!!! overwriting existing check with defered check to nil, deferCheck leaked!!!: check=service:_nomad-task-397dad29-b7d0-456e-651a-d6757fc9c763-group-device-sampling-api-ldd-875-group-device-sampling-api-ldd-875-10000-sidecar-proxy:1
2023-08-10T03:14:48.976Z [ERROR] agent: FATAL!!! overwriting existing check with defered check to nil, deferCheck leaked!!!: check=service:_nomad-task-397dad29-b7d0-456e-651a-d6757fc9c763-group-device-sampling-api-ldd-875-group-device-sampling-api-ldd-875-10000-sidecar-proxy:2
2023-08-10T03:14:49.043Z [INFO]  agent: COUNT deferCheck: count=828

On the particular node where the log was captured, there were only ~50 health checks registered. Yet the DeferCheck has been queued up for more than 800 after running for ~5min, and keeps increasing, eventually causes CPU pressure to the server when the DeferChecks gets executed.

Resolution

func (l *State) setCheckStateLocked(c *CheckState) {
	id := c.Check.CompoundCheckID()
	existing := l.checks[id]
	if existing != nil {
		c.InSync = c.Check.IsSame(existing.Check)
+
+		if existing.DeferCheck != nil && c.DeferCheck == nil {
+			c.DeferCheck = existing.DeferCheck
+		}
	}

	l.checks[id] = c

	// If this is a check for an aliased service, then notify the waiters.
	l.notifyIfAliased(c.Check.CompoundServiceID())

	l.TriggerSyncChanges()
}

Adding the above lines to setCheckStateLocked helped to resolve the issue.
Verified running the patched agent for some period of time, the number of DeferCheck registered is stable, getting down from an ever increasing number to about the same number of DeferCheck as the HealthChecks on the node:

2023-08-10T05:45:02.171Z [INFO]  agent: COUNT deferCheck: count=51
2023-08-10T05:45:12.172Z [INFO]  agent: COUNT deferCheck: count=51
2023-08-10T05:45:22.173Z [INFO]  agent: COUNT deferCheck: count=51

Too lazy to fill a proper pr with test...

Consul info for both Client and Server

Consul v1.15.4

Operating system and Environment details

Linux

The text was updated successfully, but these errors were encountered:

fredwangwang · 2023-08-10T07:44:48Z

another (kinda related) issue --

On that particular node, most of the health check output are stable (Consul Connect sidecar TCP HC & alias HC), only a few is changing. However noticed the registered defercheck number is almost the same as the total number of healthcheck exists on that particular node.

Given the condition:

	// Do nothing if update is idempotent
	if c.Check.Status == status && c.Check.Output == output {
		return
	}

Something does not add up...

So with the following change

...
	if l.config.CheckUpdateInterval > 0 && c.Check.Status == status {
+		oldOutput := c.Check.Output
		c.Check.Output = output
		if c.DeferCheck == nil {
+			l.logger.Debug("REGISTER defercheck", "check", id, "output1", oldOutput, "output2", output)

			d := l.config.CheckUpdateInterval

Noticed the following behvior:

2023-08-10T05:36:07.624Z [DEBUG] agent: REGISTER defercheck: check=service:_nomad-task-d63f2c10-c8ce-7b8a-d1d5-d83955c30fb4-group-search-hub-api-ldd-69-group-search-hub-api-ldd-69-9090-sidecar-proxy:2 output1="TCP connect 10.138.99.93:23885: Success" output2="All checks passing."
2023-08-10T05:36:07.624Z [DEBUG] agent: REGISTER defercheck: check=service:_nomad-task-d63f2c10-c8ce-7b8a-d1d5-d83955c30fb4-group-search-hub-api-ldd-69-group-search-hub-api-ldd-69-9090-sidecar-proxy:1 output1="TCP connect 10.138.99.93:23885: Success" output2="All checks passing."
2023-08-10T05:36:07.715Z [DEBUG] agent: REGISTER defercheck: check=service:_nomad-task-efc18a49-8c74-844b-24e6-e17888403b58-group-device-log-api-ldd-1414-group-device-log-api-ldd-1414-10000-sidecar-proxy:1 output1="All checks passing." output2="TCP connect 10.138.99.93:20417: Success"

Check output is flipping between TCP connect ......: Success and All checks passing. Seems like the check ordering gets mixed up.

This happens reliably after restarting Consul with existing mesh services registered in Consul Connect enabled Nomad cluster setup.

I suspect this is due to the ordering of the checks registered through Consul vs throgh Nomad:

Consul (TCP first then Alias):

return []*structs.CheckType{
		{
			Name:     "Connect Sidecar Listening",
			TCP:      ipaddr.FormatAddressPort(checkAddress, port),
			Interval: 10 * time.Second,
		},
		{
			Name:         "Connect Sidecar Aliasing " + serviceID,
			AliasService: serviceID,
		},
	}

vs Nomad (Alias first then TCP)

	checks := api.AgentServiceChecks{{
		Name:         "Connect Sidecar Aliasing " + serviceID,
		AliasService: serviceID,
	}}
	if !css.DisableDefaultTCPCheck {
		checks = append(checks, &api.AgentServiceCheck{
			Name:     "Connect Sidecar Listening",
			TCP:      net.JoinHostPort(cMapping.HostIP, strconv.Itoa(cMapping.Value)),
			Interval: "10s",
		})
	}

Which causes the weird flipping.

Either unify the ordering of the checks between the two projects, or sort the checks before registering (which i think would be a better option) should help resolve this issue, tho I havent tested.

huikang · 2023-08-15T13:45:20Z

@fredwangwang , thanks for reporting this and detailed analysis. We will look into this issue.

huikang · 2023-08-22T01:31:01Z

@fredwangwang , I think the proposed solution makes sense. Will you be able to make a PR?

In addition, I’d like to know more about your use case. If I understand correctly, to
trigger this leak, you seem to keeping adding state.addCheckLocked(...) through the service register API, leading to the leaks of DeferCheck from previous check with the same ID. If so, I am wondering why the check is being added repeatedly. Please correct me if I am wrong. Thanks.

fredwangwang · 2023-08-22T02:37:14Z

Hi @huikang

Will you be able to make a PR?

No, too lazy to fill a proper pr with test. Think I've done my fair share hehe.

you seem to keeping adding state.addCheckLocked(...) through the service register

annd I wondered the same. We are using pretty standard setup with nomad, that uses nomad and consul integration. Almost all of our services are registered using service block in nomad jobs.

So there could be a bug in nomad or consul, thats for you guys to figure out :) TBH I am lesser interested in that, as frequent service/check register, tho annoying, should be idempotent, and must NOT causing go routine leak and causing system instabilities.

huikang · 2023-08-22T03:05:27Z

@fredwangwang , no problem. I will create on PR based on your investigation. Thanks.

…ite-pony

…moeba

fredwangwang · 2023-08-23T16:23:50Z

thanks for the PR @huikang!

there is another related issue needs some attention -- not as sever as the leaking, but its something :) #18429 (comment)

suikast42 · 2024-01-17T13:14:50Z

Same here

huikang self-assigned this Aug 15, 2023

huikang mentioned this issue Aug 22, 2023

bug: prevent go routine leakage due to existing DeferCheck #18558

Merged

4 tasks

hc-github-team-consul-core added a commit that referenced this issue Aug 23, 2023

Merge 06d4c72 into backport/gh-18429-deferrcheck-leak/blatantly-defin…

6f448ed

…ite-pony

hc-github-team-consul-core added a commit that referenced this issue Aug 23, 2023

Merge 42c37bb into backport/gh-18429-deferrcheck-leak/blatantly-defin…

229a713

…ite-pony

hc-github-team-consul-core mentioned this issue Aug 23, 2023

Backport of bug: prevent go routine leakage due to existing DeferCheck into release/1.15.x #18564

Merged

4 tasks

hc-github-team-consul-core added a commit that referenced this issue Aug 23, 2023

Merge 06d4c72 into backport/gh-18429-deferrcheck-leak/yearly-rested-a…

c0cb957

…moeba

hc-github-team-consul-core mentioned this issue Aug 23, 2023

Backport of bug: prevent go routine leakage due to existing DeferCheck into release/1.16.x #18565

Merged

4 tasks

hc-github-team-consul-core added a commit that referenced this issue Aug 23, 2023

Merge 42c37bb into backport/gh-18429-deferrcheck-leak/yearly-rested-a…

1390abe

…moeba

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DeferCheck is leaking, causing high CPU usage and raft index churn. #18429

DeferCheck is leaking, causing high CPU usage and raft index churn. #18429

fredwangwang commented Aug 10, 2023 •

edited

Loading

fredwangwang commented Aug 10, 2023 •

edited

Loading

huikang commented Aug 15, 2023

huikang commented Aug 22, 2023

fredwangwang commented Aug 22, 2023

huikang commented Aug 22, 2023

fredwangwang commented Aug 23, 2023

suikast42 commented Jan 17, 2024

DeferCheck is leaking, causing high CPU usage and raft index churn. #18429

DeferCheck is leaking, causing high CPU usage and raft index churn. #18429

Comments

fredwangwang commented Aug 10, 2023 • edited Loading

Overview of the Issue

Causes

Resolution

Consul info for both Client and Server

Operating system and Environment details

fredwangwang commented Aug 10, 2023 • edited Loading

huikang commented Aug 15, 2023

huikang commented Aug 22, 2023

fredwangwang commented Aug 22, 2023

huikang commented Aug 22, 2023

fredwangwang commented Aug 23, 2023

suikast42 commented Jan 17, 2024

fredwangwang commented Aug 10, 2023 •

edited

Loading

fredwangwang commented Aug 10, 2023 •

edited

Loading