kube-controller-manager memory leak #112319

xigang · 2022-09-08T12:50:13Z

What happened?

Our cluster has 5600 nodes and kube-controller-manager memory usage 197.5GiB, After restarting kube-controller-manager, the memory usage is only 15GiB, It looks like the controller manager memory is leaking.

The same problem issue: #102718 and #102565

What did you expect to happen?

The kube-controller-manager memory usage should be kept at a smooth water level.

How can we reproduce it (as minimally and precisely as possible)?

NONE

Anything else we need to know?

NONE

Kubernetes version

Client Version: version.Info{
Major:"1", 
Minor:"17+", 
GitVersion:"v1.17.4", 
GitCommit:"f769ba94a8435eb3b446c5d39d7504823224a6f4", 
GitTreeState:"clean", 
BuildDate:"2020-06-22T02:50:15Z", 
GoVersion:"go1.14.2", 
Compiler:"gc", 
Platform:"linux/amd64"}

Cloud provider

NONE

OS version

# cat /etc/os-release
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

Install tools

NONE

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

The text was updated successfully, but these errors were encountered:

xigang · 2022-09-08T12:52:04Z

/sig api-machinery

xigang · 2022-09-08T12:55:31Z

/cc @thockin @wojtek-t @deads2k

xigang · 2022-09-08T12:59:05Z

pmap command report memory map of kube-controller-manager process:
controllermanager.txt

leilajal · 2022-09-08T16:40:59Z

/cc @jpbetz
/triage accepted

aojea · 2022-09-08T20:38:19Z

GitVersion:"v1.17.4",

that is an old version, you need to move to one of the supported versions.

golang pprof should give you a better understanding of the problem

xigang · 2022-09-09T01:55:00Z

GitVersion:"v1.17.4",

that is an old version, you need to move to one of the supported versions.

golang pprof should give you a better understanding of the problem

@aojea
I understand that golang pprof will have an impact on kube-controller-manager service performance? In a 5k node cluster.

xigang · 2022-09-13T02:11:58Z

GitVersion:"v1.17.4",

that is an old version, you need to move to one of the supported versions.

golang pprof should give you a better understanding of the problem

@aojea When the memory usage of kube-controller-manager reaches 250GiB, kube-controller-manager profile.
I presume there is a memory leak in the node-lifecycle-controller.

use the top command output controller manager memory used.

flame graph:

heap profile:

kkkkun · 2022-10-06T13:38:55Z

Could you provide CPU profile？ It may be usefully, @xigang

xigang · 2022-10-07T01:04:19Z

Could you provide CPU profile？ It may be usefully, @xigang

@kkkkun It has not been reproduced recently. At that time, I guess that it was caused by the leak of goroutine, and checked the goroutine pprof.

sxllwx · 2022-10-26T09:44:31Z

I don't think this issue is related to issue #102565 issue #102718.

issue #102565 #102718 reflects the use of a lot of memory when doing ListObject.

And from the kube-controller-manager you provided, the memory-related pprof consumes a lot of memory in the Unmarshal of a separate Object (*corev1.Node).

FYI: My guess here is because a large number of Nodes in the kube cluster briefly jittered.

sxllwx · 2022-10-26T09:46:00Z

emm,,, did this use of a lot of memory go on for a long time?

xigang · 2022-10-26T11:43:41Z

did this use of a lot of memory go on for a long time?

This problem lasted for a few days and currently the average cluster kube-controller-manager memory usage is 32Gi

sxllwx · 2022-10-31T06:50:14Z

@xigang

According to the information provided so far (heap-pprof inuse_space):

We can draw the following conclusions:

The memory usage of 2.77GB comes from List: client-go memory leak #102718 Reflector retains initial list result for a long time #113305 Reports the memory usage problem of reflector (client-go) when performing List operations
The memory usage of 135.14GB comes from Watch, of which 129.88GB is used for Node.Unmarshal.

Compared with 2.77GB, I think the memory consumption of 135.14GB should be more concerned.

I noticed that the version of golang used by kube-controller-manager on your side is: GoVersion:"go1.14.2" go1.16 (included) and later versions changed an option to free memory from the previously used MADV_FREE Modified to MADV_DONTNEED. The difference between these two parameters, it is recommended that you refer to: golang/go#42894

Suggest:

(Highly recommended) Upgrade the version of kubernetes to introduce this change, so that you can also get better community support.
If it is really impossible to upgrade in a short time, you can try to set the environment variable GODEBUG=madvdontneed=1 to modify the default memory management behavior of go1.14.2. Then go ahead and see if that fixes your problem.

xigang · 2022-11-02T14:57:47Z

@sxllwx Thank you very much, the memory usage of kube-controller-manager is normal now, but it is not clear what triggers the kube-controller-manager memory increase. The k8s cluster version may not be upgraded in a short time. If the memory usage of kube-controller-manager is abnormal, we will try to use the second suggest.

k8s-triage-robot · 2023-01-31T15:13:23Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

vaibhav2107 · 2023-05-15T10:33:48Z

/remove-lifecycle stale

xigang · 2023-05-20T14:59:49Z

/close

k8s-ci-robot · 2023-05-20T14:59:54Z

@xigang: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

lowang-bh · 2023-10-25T09:47:21Z

@sxllwx Thank you very much, the memory usage of kube-controller-manager is normal now, but it is not clear what triggers the kube-controller-manager memory increase. The k8s cluster version may not be upgraded in a short time. If the memory usage of kube-controller-manager is abnormal, we will try to use the second suggest.

Hi @xigang , I am interest in this prombel. What have you done before it become normal?

xigang · 2023-10-25T13:56:58Z

@sxllwx Thank you very much, the memory usage of kube-controller-manager is normal now, but it is not clear what triggers the kube-controller-manager memory increase. The k8s cluster version may not be upgraded in a short time. If the memory usage of kube-controller-manager is abnormal, we will try to use the second suggest.

Hi @xigang , I am interest in this prombel. What have you done before it become normal?

@lowang-bh The daemonset controller processed Node events slowly, triggering the OOM. We made some optimization.
See: #121474

xigang · 2023-10-25T13:59:03Z

/reopen

k8s-ci-robot · 2023-10-25T13:59:10Z

@xigang: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

xigang · 2023-10-25T14:00:24Z

cc @leilajal @aojea Please take a look. See: #121474

sxllwx · 2023-10-26T03:59:43Z

If it is really impossible to upgrade in a short time, you can try to set the environment variable GODEBUG=madvdontneed=1 to modify the default memory management behavior of go1.14.2. Then go ahead and see if that fixes your problem.

Have you currently adopted this advice I gave?

xigang · 2023-10-26T06:18:29Z

Have you currently adopted this advice I gave

These policies are not currently being adjusted.

xigang · 2024-01-06T03:40:45Z

/close

k8s-ci-robot · 2024-01-06T03:40:50Z

@xigang: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

xigang added the kind/bug Categorizes issue or PR as related to a bug. label Sep 8, 2022

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 8, 2022

k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Sep 8, 2022

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 8, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 31, 2023

kkkkun mentioned this issue Mar 14, 2023

REQUEST: New membership for @kkkkun kubernetes/org#3561

Closed

9 tasks

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 15, 2023

k8s-ci-robot closed this as completed May 20, 2023

xigang mentioned this issue Oct 24, 2023

daemonset pod cannot be created quickly, when some nodes are added to the cluster and controller-manager OOM will be triggered #121474

Closed

k8s-ci-robot reopened this Oct 25, 2023

k8s-ci-robot closed this as completed Jan 6, 2024

xigang mentioned this issue Mar 25, 2024

REQUEST: New membership for xigang kubernetes/org#4843

Closed

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kube-controller-manager memory leak #112319

kube-controller-manager memory leak #112319

xigang commented Sep 8, 2022 •

edited

Loading

xigang commented Sep 8, 2022

xigang commented Sep 8, 2022

xigang commented Sep 8, 2022

leilajal commented Sep 8, 2022

aojea commented Sep 8, 2022

xigang commented Sep 9, 2022 •

edited

Loading

xigang commented Sep 13, 2022 •

edited

Loading

kkkkun commented Oct 6, 2022

xigang commented Oct 7, 2022 •

edited

Loading

sxllwx commented Oct 26, 2022

sxllwx commented Oct 26, 2022

xigang commented Oct 26, 2022

sxllwx commented Oct 31, 2022 •

edited

Loading

xigang commented Nov 2, 2022 •

edited

Loading

k8s-triage-robot commented Jan 31, 2023

vaibhav2107 commented May 15, 2023

xigang commented May 20, 2023

k8s-ci-robot commented May 20, 2023

lowang-bh commented Oct 25, 2023

xigang commented Oct 25, 2023

xigang commented Oct 25, 2023

k8s-ci-robot commented Oct 25, 2023

xigang commented Oct 25, 2023 •

edited

Loading

sxllwx commented Oct 26, 2023

xigang commented Oct 26, 2023

xigang commented Jan 6, 2024

k8s-ci-robot commented Jan 6, 2024

kube-controller-manager memory leak #112319

kube-controller-manager memory leak #112319

Comments

xigang commented Sep 8, 2022 • edited Loading

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kubernetes version

Cloud provider

OS version

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

xigang commented Sep 8, 2022

xigang commented Sep 8, 2022

xigang commented Sep 8, 2022

leilajal commented Sep 8, 2022

aojea commented Sep 8, 2022

xigang commented Sep 9, 2022 • edited Loading

xigang commented Sep 13, 2022 • edited Loading

kkkkun commented Oct 6, 2022

xigang commented Oct 7, 2022 • edited Loading

sxllwx commented Oct 26, 2022

sxllwx commented Oct 26, 2022

xigang commented Oct 26, 2022

sxllwx commented Oct 31, 2022 • edited Loading

xigang commented Nov 2, 2022 • edited Loading

k8s-triage-robot commented Jan 31, 2023

vaibhav2107 commented May 15, 2023

xigang commented May 20, 2023

k8s-ci-robot commented May 20, 2023

lowang-bh commented Oct 25, 2023

xigang commented Oct 25, 2023

xigang commented Oct 25, 2023

k8s-ci-robot commented Oct 25, 2023

xigang commented Oct 25, 2023 • edited Loading

sxllwx commented Oct 26, 2023

xigang commented Oct 26, 2023

xigang commented Jan 6, 2024

k8s-ci-robot commented Jan 6, 2024

xigang commented Sep 8, 2022 •

edited

Loading

xigang commented Sep 9, 2022 •

edited

Loading

xigang commented Sep 13, 2022 •

edited

Loading

xigang commented Oct 7, 2022 •

edited

Loading

sxllwx commented Oct 31, 2022 •

edited

Loading

xigang commented Nov 2, 2022 •

edited

Loading

xigang commented Oct 25, 2023 •

edited

Loading