Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possbile memory leak in API server #8383

Closed
RemcodM opened this issue Jan 2, 2024 · 10 comments
Closed

Possbile memory leak in API server #8383

RemcodM opened this issue Jan 2, 2024 · 10 comments

Comments

@RemcodM
Copy link

RemcodM commented Jan 2, 2024

Over Christmas and the new year, we saw a high amount of memory pressure in our Kubernetes Cluster. After some research, we figured out calico-apiserver was eating away most of the memory on the affected nodes. Based on our memory metrics, we believe there is an memory leak in calico-apiserver.

Scherm­afbeelding 2024-01-02 om 11 44 36

We upgraded to Calico 3.27.0

Expected Behavior

We upgraded to Calico 3.27.0 just before Christmas. Before the upgrade we were using Calico 3.25.x and we were not seeing these kind of RAM usages over time. We expect Calico 3.27.0 to behave in the same fashion.

Current Behavior

As is visible from the metrics shared above, all calico-apiserver replica's across multiple clusters will start eating more and more RAM over time. We see a somewhat similar usage for CPU resources over time, but we suspect that is because the Go garbage collector has to perform more work over time.

Possible Solution

We have not figured out the cause of the memory leak as of yet.

We mitigated the problem by setting up more strict memory limits for calico-apiserver in the first place. That causes calico-apiserver to be periodically OOMKilled but prevents our nodes from running out of memory.

Steps to Reproduce (for bugs)

  1. Deploy calico-apiserver
  2. See the memory usage of the pod grow over time.

Context

Our Calico API server is deployed through the Tigera Operator. I do not think that is relevant for the problem but I wanted to mention this anyway.

Your Environment

  • Calico version: 3.27.0
  • Orchestrator version (e.g. kubernetes, mesos, rkt): Kubernetes (EKS) 1.28
  • Operating System and version: Bottlerocket OS 1.17.0 (aws-k8s-1.28)
    • Containerd: 1.6.25+bottlerocket
    • Kubelet: v1.28.4-eks-d91a302
@lepicodon
Copy link

Hi,

Same here with Kubernetes 1.28.5 and Calico 3.27.0 on Rocky Linux 9.3.
I am observing this issue since my last update:

image

@gareji
Copy link

gareji commented Jan 5, 2024

Hi, same here with kubernetes v1.28.4 installed with kubeadm and calico 3.27.0 installed with helm tigera-operator all default settings.

calico-apiserver calico-apiserver-6b4865847c-l792r 2m 5292Mi
calico-apiserver calico-apiserver-6b4865847c-r4bq8 3m 5292Mi

@tiredpixel
Copy link

Same here for one of my clients running Kubernetes 1.28.4 clusters with Calico 3.27.0 with Containerd 1.6.26 on Ubuntu 22.04.3.

Those clusters, too, were updated just before Christmas, and now nodes need restarting every few days. Calico was installed and upgraded via Helm.

The issue did not occur with Kubernetes 1.27.3 clusters with Calico 3.26.1 with Containerd 1.6.21 (Kubernetes upgraded via 1.27.8 at the same time as the Calico and Containerd upgrades).

@hjiawei
Copy link
Contributor

hjiawei commented Jan 8, 2024

We believe this issue is related to the APIServerTracing feature gate being turned on by default in k8s 1.27. In the dependent k8s apiserver library, when this feature is on, otelgrpc options will be added as part of the tracingOpts. This is part of the newETCD3Client which will be instantiated every 2 seconds (default) within newETCD3Prober for etcd server health probing.

Unfortunately, even with the Noop TracerProvider, we notice growing memory usage by the opentelemetry internal int64/float64 histograms. As an extension apiserver, we don't config external etcd servers so the health check is disabled in #8394 (with some other unsued profiling and metrics). We have been running this fix over the weekend and it seems to stabilize the memory usage.

@caseydavenport
Copy link
Member

Thanks @hjiawei!

For anyone watching, this PR includes the fix for v3.28: #8394
This cherry-pick adds the fix to v3.27.1 (date TBD): #8396

@ankitabhopatkar13
Copy link

Thanks for the fix!

Has the date for releasing this been finalised? Our nodes are running out of memory. We have limited the resources for Calico, but would be nice to get this fix deployed soon.

@BroderPeters
Copy link

Same here.
From the issue description, I assume a downgrade would work around as well (if downgrading is even possible that easily here, haven't really worked that deeply with calico yet). Did anyone try that?

@gareji
Copy link

gareji commented Feb 1, 2024

Same here. From the issue description, I assume a downgrade would work around as well (if downgrading is even possible that easily here, haven't really worked that deeply with calico yet). Did anyone try that?

I have downgraded back to 3.26 without any problem and it resolved the memory issue. I have operator installation

@BroderPeters
Copy link

Same here. From the issue description, I assume a downgrade would work around as well (if downgrading is even possible that easily here, haven't really worked that deeply with calico yet). Did anyone try that?

I have downgraded back to 3.26 without any problem and it resolved the memory issue. I have operator installation

Thanks for the fast reply! I will give it a look if there is no fix release expected soon.

@fasaxc
Copy link
Member

fasaxc commented Feb 20, 2024

The fix for this is now released in v3.27.2 (v3.27.1 was never released publicly due to a build issue). Please open a new issue if you can still hit it on v3.27.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants