Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kube-controller-manager memory leak #112319

Closed
xigang opened this issue Sep 8, 2022 · 27 comments
Closed

kube-controller-manager memory leak #112319

xigang opened this issue Sep 8, 2022 · 27 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@xigang
Copy link
Contributor

xigang commented Sep 8, 2022

What happened?

Our cluster has 5600 nodes and kube-controller-manager memory usage 197.5GiB, After restarting kube-controller-manager, the memory usage is only 15GiB, It looks like the controller manager memory is leaking.
image

The same problem issue: #102718 and #102565

What did you expect to happen?

The kube-controller-manager memory usage should be kept at a smooth water level.

How can we reproduce it (as minimally and precisely as possible)?

NONE

Anything else we need to know?

NONE

Kubernetes version

Client Version: version.Info{
Major:"1", 
Minor:"17+", 
GitVersion:"v1.17.4", 
GitCommit:"f769ba94a8435eb3b446c5d39d7504823224a6f4", 
GitTreeState:"clean", 
BuildDate:"2020-06-22T02:50:15Z", 
GoVersion:"go1.14.2", 
Compiler:"gc", 
Platform:"linux/amd64"}

Cloud provider

NONE

OS version

# cat /etc/os-release
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

Install tools

NONE

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

@xigang xigang added the kind/bug Categorizes issue or PR as related to a bug. label Sep 8, 2022
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 8, 2022
@xigang
Copy link
Contributor Author

xigang commented Sep 8, 2022

/sig api-machinery

@k8s-ci-robot k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Sep 8, 2022
@xigang
Copy link
Contributor Author

xigang commented Sep 8, 2022

/cc @thockin @wojtek-t @deads2k

@xigang
Copy link
Contributor Author

xigang commented Sep 8, 2022

pmap command report memory map of kube-controller-manager process:
controllermanager.txt

@leilajal
Copy link
Contributor

leilajal commented Sep 8, 2022

/cc @jpbetz
/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 8, 2022
@aojea
Copy link
Member

aojea commented Sep 8, 2022

GitVersion:"v1.17.4",

that is an old version, you need to move to one of the supported versions.

golang pprof should give you a better understanding of the problem

@xigang
Copy link
Contributor Author

xigang commented Sep 9, 2022

GitVersion:"v1.17.4",

that is an old version, you need to move to one of the supported versions.

golang pprof should give you a better understanding of the problem

@aojea
I understand that golang pprof will have an impact on kube-controller-manager service performance? In a 5k node cluster.

@xigang
Copy link
Contributor Author

xigang commented Sep 13, 2022

GitVersion:"v1.17.4",

that is an old version, you need to move to one of the supported versions.

golang pprof should give you a better understanding of the problem

@aojea When the memory usage of kube-controller-manager reaches 250GiB, kube-controller-manager profile.
I presume there is a memory leak in the node-lifecycle-controller.

use the top command output controller manager memory used.

image

flame graph:
image

heap profile:
out

@kkkkun
Copy link
Member

kkkkun commented Oct 6, 2022

Could you provide CPU profile? It may be usefully, @xigang

@xigang
Copy link
Contributor Author

xigang commented Oct 7, 2022

Could you provide CPU profile? It may be usefully, @xigang

@kkkkun It has not been reproduced recently. At that time, I guess that it was caused by the leak of goroutine, and checked the goroutine pprof.
profile002

@sxllwx
Copy link
Member

sxllwx commented Oct 26, 2022

I don't think this issue is related to issue #102565 issue #102718.

issue #102565 #102718 reflects the use of a lot of memory when doing ListObject.

And from the kube-controller-manager you provided, the memory-related pprof consumes a lot of memory in the Unmarshal of a separate Object (*corev1.Node).

FYI: My guess here is because a large number of Nodes in the kube cluster briefly jittered.

@sxllwx
Copy link
Member

sxllwx commented Oct 26, 2022

emm,,, did this use of a lot of memory go on for a long time?

@xigang
Copy link
Contributor Author

xigang commented Oct 26, 2022

did this use of a lot of memory go on for a long time?

This problem lasted for a few days and currently the average cluster kube-controller-manager memory usage is 32Gi

@sxllwx
Copy link
Member

sxllwx commented Oct 31, 2022

@xigang

According to the information provided so far (heap-pprof inuse_space):

We can draw the following conclusions:

  1. The memory usage of 2.77GB comes from List: client-go memory leak #102718 Reflector retains initial list result for a long time #113305 Reports the memory usage problem of reflector (client-go) when performing List operations
  2. The memory usage of 135.14GB comes from Watch, of which 129.88GB is used for Node.Unmarshal.

Compared with 2.77GB, I think the memory consumption of 135.14GB should be more concerned.

I noticed that the version of golang used by kube-controller-manager on your side is: GoVersion:"go1.14.2" go1.16 (included) and later versions changed an option to free memory from the previously used MADV_FREE Modified to MADV_DONTNEED. The difference between these two parameters, it is recommended that you refer to: golang/go#42894

Suggest:

  • (Highly recommended) Upgrade the version of kubernetes to introduce this change, so that you can also get better community support.
  • If it is really impossible to upgrade in a short time, you can try to set the environment variable GODEBUG=madvdontneed=1 to modify the default memory management behavior of go1.14.2. Then go ahead and see if that fixes your problem.

@xigang
Copy link
Contributor Author

xigang commented Nov 2, 2022

@sxllwx Thank you very much, the memory usage of kube-controller-manager is normal now, but it is not clear what triggers the kube-controller-manager memory increase. The k8s cluster version may not be upgraded in a short time. If the memory usage of kube-controller-manager is abnormal, we will try to use the second suggest.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 31, 2023
@vaibhav2107
Copy link
Member

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 15, 2023
@xigang
Copy link
Contributor Author

xigang commented May 20, 2023

/close

@k8s-ci-robot
Copy link
Contributor

@xigang: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@lowang-bh
Copy link
Member

@sxllwx Thank you very much, the memory usage of kube-controller-manager is normal now, but it is not clear what triggers the kube-controller-manager memory increase. The k8s cluster version may not be upgraded in a short time. If the memory usage of kube-controller-manager is abnormal, we will try to use the second suggest.

Hi @xigang , I am interest in this prombel. What have you done before it become normal?

@xigang
Copy link
Contributor Author

xigang commented Oct 25, 2023

@sxllwx Thank you very much, the memory usage of kube-controller-manager is normal now, but it is not clear what triggers the kube-controller-manager memory increase. The k8s cluster version may not be upgraded in a short time. If the memory usage of kube-controller-manager is abnormal, we will try to use the second suggest.

Hi @xigang , I am interest in this prombel. What have you done before it become normal?

@lowang-bh The daemonset controller processed Node events slowly, triggering the OOM. We made some optimization.
See: #121474

@xigang
Copy link
Contributor Author

xigang commented Oct 25, 2023

/reopen

@k8s-ci-robot
Copy link
Contributor

@xigang: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot reopened this Oct 25, 2023
@xigang
Copy link
Contributor Author

xigang commented Oct 25, 2023

cc @leilajal @aojea Please take a look. See: #121474

@sxllwx
Copy link
Member

sxllwx commented Oct 26, 2023

If it is really impossible to upgrade in a short time, you can try to set the environment variable GODEBUG=madvdontneed=1 to modify the default memory management behavior of go1.14.2. Then go ahead and see if that fixes your problem.

Have you currently adopted this advice I gave?

@xigang
Copy link
Contributor Author

xigang commented Oct 26, 2023

Have you currently adopted this advice I gave

These policies are not currently being adjusted.

@xigang
Copy link
Contributor Author

xigang commented Jan 6, 2024

/close

@k8s-ci-robot
Copy link
Contributor

@xigang: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

9 participants