Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DeferCheck is leaking, causing high CPU usage and raft index churn. #18429

Open
fredwangwang opened this issue Aug 10, 2023 · 7 comments
Open
Assignees

Comments

@fredwangwang
Copy link
Contributor

fredwangwang commented Aug 10, 2023

Overview of the Issue

With multiple many service registered with changing healthcheck output, noticed very high CPU usage on the Consul Server. After restarting the consul cluster, the CPU start to spike up gradually again after check_update_interval (24h) / 2, till about the same high cpu usage as before restarting eventually.

image

The raft index is also changing at a rate of ~5000 per minute, causing issues with healthcheck blocking queries as well.


Causes

After investigation noticed DeferCheck is leaking, with the following patch applied in agent/local/state.go:

Patch
--- state.go	2023-08-10 15:18:29.734073200 +0800
+++ state.go.bk-evidence	2023-08-10 15:20:22.963909000 +0800
@@ -26,6 +26,8 @@
 	"github.com/mitchellh/copystructure"
 )
 
+var deferCheckCount atomic.Int64
+
 var StateCounters = []prometheus.CounterDefinition{
 	{
 		Name: []string{"acl", "blocked", "service", "registration"},
@@ -231,6 +233,13 @@
 		agentEnterpriseMeta: *structs.NodeEnterpriseMetaInPartition(c.Partition),
 	}
 	l.SetDiscardCheckOutput(c.DiscardCheckOutput)
+
+	go func() {
+		for {
+			time.Sleep(time.Second * 10)
+			l.logger.Info("COUNT deferCheck", "count", deferCheckCount.Load())
+		}
+	}()
 	return l
 }
 
@@ -710,12 +719,16 @@
 	if l.config.CheckUpdateInterval > 0 && c.Check.Status == status {
 		c.Check.Output = output
 		if c.DeferCheck == nil {
+			deferCheckCount.Add(1)
 			d := l.config.CheckUpdateInterval
 			intv := time.Duration(uint64(d)/2) + lib.RandomStagger(d)
 			c.DeferCheck = time.AfterFunc(intv, func() {
 				l.Lock()
 				defer l.Unlock()
 
+				deferCheckCount.Add(-1)
+				cCreated := c
+
 				c := l.checks[id]
 				if c == nil {
 					return
@@ -724,6 +737,9 @@
 				if c.Deleted {
 					return
 				}
+
+				l.logger.Info("UpdateCheck deferfunc triggered", "check", c.Check.CheckID, "ref", cCreated)
+
 				c.InSync = false
 				l.TriggerSyncChanges()
 			})
@@ -834,6 +850,10 @@
 	existing := l.checks[id]
 	if existing != nil {
 		c.InSync = c.Check.IsSame(existing.Check)
+
+		if existing.DeferCheck != nil && c.DeferCheck == nil {
+			l.logger.Error("FATAL!!! overwriting existing check with defered check to nil, deferCheck leaked!!!", "check", id)
+		}
 	}
 
 	l.checks[id] = c
@@ -1276,6 +1296,7 @@
 			if c.DeferCheck != nil {
 				c.DeferCheck.Stop()
 				c.DeferCheck = nil
+				deferCheckCount.Add(-1)
 			}
 			err = l.syncCheck(id)
 		default:
@@ -1392,6 +1413,7 @@
 	c := l.checks[id]
 	if c != nil && c.DeferCheck != nil {
 		c.DeferCheck.Stop()
+		deferCheckCount.Add(-1)
 	}
 	delete(l.checks, id)
 }

, saw the following messages in log:

2023-08-10T03:14:48.911Z [ERROR] agent: FATAL!!! overwriting existing check with defered check to nil, deferCheck leaked!!!: check=service:_nomad-task-68721c9c-cc5b-205b-bf53-9e937ca0f75c-group-search-hub-api-ldd-1651-group-search-hub-api-ldd-1651-9090-sidecar-proxy:1
2023-08-10T03:14:48.911Z [ERROR] agent: FATAL!!! overwriting existing check with defered check to nil, deferCheck leaked!!!: check=service:_nomad-task-68721c9c-cc5b-205b-bf53-9e937ca0f75c-group-search-hub-api-ldd-1651-group-search-hub-api-ldd-1651-9090-sidecar-proxy:2
2023-08-10T03:14:48.962Z [ERROR] agent: FATAL!!! overwriting existing check with defered check to nil, deferCheck leaked!!!: check=_nomad-check-caf1abcff4b2365c91dd87054358af72e34816b3
2023-08-10T03:14:48.976Z [ERROR] agent: FATAL!!! overwriting existing check with defered check to nil, deferCheck leaked!!!: check=service:_nomad-task-397dad29-b7d0-456e-651a-d6757fc9c763-group-device-sampling-api-ldd-875-group-device-sampling-api-ldd-875-10000-sidecar-proxy:1
2023-08-10T03:14:48.976Z [ERROR] agent: FATAL!!! overwriting existing check with defered check to nil, deferCheck leaked!!!: check=service:_nomad-task-397dad29-b7d0-456e-651a-d6757fc9c763-group-device-sampling-api-ldd-875-group-device-sampling-api-ldd-875-10000-sidecar-proxy:2
2023-08-10T03:14:49.043Z [INFO]  agent: COUNT deferCheck: count=828

On the particular node where the log was captured, there were only ~50 health checks registered. Yet the DeferCheck has been queued up for more than 800 after running for ~5min, and keeps increasing, eventually causes CPU pressure to the server when the DeferChecks gets executed.

Resolution

func (l *State) setCheckStateLocked(c *CheckState) {
	id := c.Check.CompoundCheckID()
	existing := l.checks[id]
	if existing != nil {
		c.InSync = c.Check.IsSame(existing.Check)
+
+		if existing.DeferCheck != nil && c.DeferCheck == nil {
+			c.DeferCheck = existing.DeferCheck
+		}
	}

	l.checks[id] = c

	// If this is a check for an aliased service, then notify the waiters.
	l.notifyIfAliased(c.Check.CompoundServiceID())

	l.TriggerSyncChanges()
}

Adding the above lines to setCheckStateLocked helped to resolve the issue.
Verified running the patched agent for some period of time, the number of DeferCheck registered is stable, getting down from an ever increasing number to about the same number of DeferCheck as the HealthChecks on the node:

2023-08-10T05:45:02.171Z [INFO]  agent: COUNT deferCheck: count=51
2023-08-10T05:45:12.172Z [INFO]  agent: COUNT deferCheck: count=51
2023-08-10T05:45:22.173Z [INFO]  agent: COUNT deferCheck: count=51

Too lazy to fill a proper pr with test...

Consul info for both Client and Server

Consul v1.15.4

Operating system and Environment details

Linux

@fredwangwang
Copy link
Contributor Author

fredwangwang commented Aug 10, 2023

another (kinda related) issue --

On that particular node, most of the health check output are stable (Consul Connect sidecar TCP HC & alias HC), only a few is changing. However noticed the registered defercheck number is almost the same as the total number of healthcheck exists on that particular node.

Given the condition:

	// Do nothing if update is idempotent
	if c.Check.Status == status && c.Check.Output == output {
		return
	}

Something does not add up...

So with the following change

...
	if l.config.CheckUpdateInterval > 0 && c.Check.Status == status {
+		oldOutput := c.Check.Output
		c.Check.Output = output
		if c.DeferCheck == nil {
+			l.logger.Debug("REGISTER defercheck", "check", id, "output1", oldOutput, "output2", output)

			d := l.config.CheckUpdateInterval

Noticed the following behvior:

2023-08-10T05:36:07.624Z [DEBUG] agent: REGISTER defercheck: check=service:_nomad-task-d63f2c10-c8ce-7b8a-d1d5-d83955c30fb4-group-search-hub-api-ldd-69-group-search-hub-api-ldd-69-9090-sidecar-proxy:2 output1="TCP connect 10.138.99.93:23885: Success" output2="All checks passing."
2023-08-10T05:36:07.624Z [DEBUG] agent: REGISTER defercheck: check=service:_nomad-task-d63f2c10-c8ce-7b8a-d1d5-d83955c30fb4-group-search-hub-api-ldd-69-group-search-hub-api-ldd-69-9090-sidecar-proxy:1 output1="TCP connect 10.138.99.93:23885: Success" output2="All checks passing."
2023-08-10T05:36:07.715Z [DEBUG] agent: REGISTER defercheck: check=service:_nomad-task-efc18a49-8c74-844b-24e6-e17888403b58-group-device-log-api-ldd-1414-group-device-log-api-ldd-1414-10000-sidecar-proxy:1 output1="All checks passing." output2="TCP connect 10.138.99.93:20417: Success"

Check output is flipping between TCP connect ......: Success and All checks passing. Seems like the check ordering gets mixed up.

This happens reliably after restarting Consul with existing mesh services registered in Consul Connect enabled Nomad cluster setup.

I suspect this is due to the ordering of the checks registered through Consul vs throgh Nomad:

Consul (TCP first then Alias):

return []*structs.CheckType{
		{
			Name:     "Connect Sidecar Listening",
			TCP:      ipaddr.FormatAddressPort(checkAddress, port),
			Interval: 10 * time.Second,
		},
		{
			Name:         "Connect Sidecar Aliasing " + serviceID,
			AliasService: serviceID,
		},
	}

vs Nomad (Alias first then TCP)

	checks := api.AgentServiceChecks{{
		Name:         "Connect Sidecar Aliasing " + serviceID,
		AliasService: serviceID,
	}}
	if !css.DisableDefaultTCPCheck {
		checks = append(checks, &api.AgentServiceCheck{
			Name:     "Connect Sidecar Listening",
			TCP:      net.JoinHostPort(cMapping.HostIP, strconv.Itoa(cMapping.Value)),
			Interval: "10s",
		})
	}

Which causes the weird flipping.

Either unify the ordering of the checks between the two projects, or sort the checks before registering (which i think would be a better option) should help resolve this issue, tho I havent tested.

@huikang
Copy link
Collaborator

huikang commented Aug 15, 2023

@fredwangwang , thanks for reporting this and detailed analysis. We will look into this issue.

@huikang huikang self-assigned this Aug 15, 2023
@huikang
Copy link
Collaborator

huikang commented Aug 22, 2023

@fredwangwang , I think the proposed solution makes sense. Will you be able to make a PR?

In addition, I’d like to know more about your use case. If I understand correctly, to
trigger this leak, you seem to keeping adding state.addCheckLocked(...) through the service register API, leading to the leaks of DeferCheck from previous check with the same ID. If so, I am wondering why the check is being added repeatedly. Please correct me if I am wrong. Thanks.

@fredwangwang
Copy link
Contributor Author

Hi @huikang

Will you be able to make a PR?

No, too lazy to fill a proper pr with test. Think I've done my fair share hehe.

you seem to keeping adding state.addCheckLocked(...) through the service register

annd I wondered the same. We are using pretty standard setup with nomad, that uses nomad and consul integration. Almost all of our services are registered using service block in nomad jobs.

So there could be a bug in nomad or consul, thats for you guys to figure out :) TBH I am lesser interested in that, as frequent service/check register, tho annoying, should be idempotent, and must NOT causing go routine leak and causing system instabilities.

@huikang
Copy link
Collaborator

huikang commented Aug 22, 2023

@fredwangwang , no problem. I will create on PR based on your investigation. Thanks.

@fredwangwang
Copy link
Contributor Author

thanks for the PR @huikang!

there is another related issue needs some attention -- not as sever as the leaking, but its something :) #18429 (comment)

@suikast42
Copy link

Same here
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants