Possbile memory leak in API server #8383

RemcodM · 2024-01-02T10:53:49Z

Over Christmas and the new year, we saw a high amount of memory pressure in our Kubernetes Cluster. After some research, we figured out calico-apiserver was eating away most of the memory on the affected nodes. Based on our memory metrics, we believe there is an memory leak in calico-apiserver.

We upgraded to Calico 3.27.0

Expected Behavior

We upgraded to Calico 3.27.0 just before Christmas. Before the upgrade we were using Calico 3.25.x and we were not seeing these kind of RAM usages over time. We expect Calico 3.27.0 to behave in the same fashion.

Current Behavior

As is visible from the metrics shared above, all calico-apiserver replica's across multiple clusters will start eating more and more RAM over time. We see a somewhat similar usage for CPU resources over time, but we suspect that is because the Go garbage collector has to perform more work over time.

Possible Solution

We have not figured out the cause of the memory leak as of yet.

We mitigated the problem by setting up more strict memory limits for calico-apiserver in the first place. That causes calico-apiserver to be periodically OOMKilled but prevents our nodes from running out of memory.

Steps to Reproduce (for bugs)

Deploy calico-apiserver
See the memory usage of the pod grow over time.

Context

Our Calico API server is deployed through the Tigera Operator. I do not think that is relevant for the problem but I wanted to mention this anyway.

Your Environment

Calico version: 3.27.0
Orchestrator version (e.g. kubernetes, mesos, rkt): Kubernetes (EKS) 1.28
Operating System and version: Bottlerocket OS 1.17.0 (aws-k8s-1.28)
- Containerd: 1.6.25+bottlerocket
- Kubelet: v1.28.4-eks-d91a302

The text was updated successfully, but these errors were encountered:

lepicodon · 2024-01-04T16:23:46Z

Hi,

Same here with Kubernetes 1.28.5 and Calico 3.27.0 on Rocky Linux 9.3.
I am observing this issue since my last update:

gareji · 2024-01-05T08:10:15Z

Hi, same here with kubernetes v1.28.4 installed with kubeadm and calico 3.27.0 installed with helm tigera-operator all default settings.

calico-apiserver calico-apiserver-6b4865847c-l792r 2m 5292Mi
calico-apiserver calico-apiserver-6b4865847c-r4bq8 3m 5292Mi

tiredpixel · 2024-01-07T15:32:04Z

Same here for one of my clients running Kubernetes 1.28.4 clusters with Calico 3.27.0 with Containerd 1.6.26 on Ubuntu 22.04.3.

Those clusters, too, were updated just before Christmas, and now nodes need restarting every few days. Calico was installed and upgraded via Helm.

The issue did not occur with Kubernetes 1.27.3 clusters with Calico 3.26.1 with Containerd 1.6.21 (Kubernetes upgraded via 1.27.8 at the same time as the Calico and Containerd upgrades).

hjiawei · 2024-01-08T08:41:48Z

We believe this issue is related to the APIServerTracing feature gate being turned on by default in k8s 1.27. In the dependent k8s apiserver library, when this feature is on, otelgrpc options will be added as part of the tracingOpts. This is part of the newETCD3Client which will be instantiated every 2 seconds (default) within newETCD3Prober for etcd server health probing.

Unfortunately, even with the Noop TracerProvider, we notice growing memory usage by the opentelemetry internal int64/float64 histograms. As an extension apiserver, we don't config external etcd servers so the health check is disabled in #8394 (with some other unsued profiling and metrics). We have been running this fix over the weekend and it seems to stabilize the memory usage.

caseydavenport · 2024-01-08T17:54:38Z

Thanks @hjiawei!

For anyone watching, this PR includes the fix for v3.28: #8394
This cherry-pick adds the fix to v3.27.1 (date TBD): #8396

ankitabhopatkar13 · 2024-01-22T10:11:25Z

Thanks for the fix!

Has the date for releasing this been finalised? Our nodes are running out of memory. We have limited the resources for Calico, but would be nice to get this fix deployed soon.

BroderPeters · 2024-02-01T13:44:18Z

Same here.
From the issue description, I assume a downgrade would work around as well (if downgrading is even possible that easily here, haven't really worked that deeply with calico yet). Did anyone try that?

gareji · 2024-02-01T14:03:13Z

Same here. From the issue description, I assume a downgrade would work around as well (if downgrading is even possible that easily here, haven't really worked that deeply with calico yet). Did anyone try that?

I have downgraded back to 3.26 without any problem and it resolved the memory issue. I have operator installation

BroderPeters · 2024-02-01T14:10:44Z

Same here. From the issue description, I assume a downgrade would work around as well (if downgrading is even possible that easily here, haven't really worked that deeply with calico yet). Did anyone try that?

I have downgraded back to 3.26 without any problem and it resolved the memory issue. I have operator installation

Thanks for the fast reply! I will give it a look if there is no fix release expected soon.

fasaxc · 2024-02-20T17:00:20Z

The fix for this is now released in v3.27.2 (v3.27.1 was never released publicly due to a build issue). Please open a new issue if you can still hit it on v3.27.2

caseydavenport added kind/bug likelihood/low impact/high labels Jan 3, 2024

caseydavenport added likelihood/high and removed likelihood/low labels Jan 4, 2024

hjiawei mentioned this issue Jan 5, 2024

Skip apiserver etcd3 healthcheck and disable profiling #8394

Merged

3 tasks

bordenit mentioned this issue Feb 14, 2024

Update Calico to v3.27.0 rancher/rke2#5422

Closed

fasaxc closed this as completed Feb 20, 2024

mzhaase mentioned this issue Feb 21, 2024

calico-node 3.27.2 fails to start on arm: libpcap.so.0.8: cannot open shared object file #8541

Closed

parsley42 mentioned this issue Mar 11, 2024

Issue when trying to set resource requests / limits on calico API #5841

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possbile memory leak in API server #8383

Possbile memory leak in API server #8383

RemcodM commented Jan 2, 2024

lepicodon commented Jan 4, 2024

gareji commented Jan 5, 2024

tiredpixel commented Jan 7, 2024

hjiawei commented Jan 8, 2024

caseydavenport commented Jan 8, 2024

ankitabhopatkar13 commented Jan 22, 2024

BroderPeters commented Feb 1, 2024

gareji commented Feb 1, 2024

BroderPeters commented Feb 1, 2024

fasaxc commented Feb 20, 2024

Possbile memory leak in API server #8383

Possbile memory leak in API server #8383

Comments

RemcodM commented Jan 2, 2024

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce (for bugs)

Context

Your Environment

lepicodon commented Jan 4, 2024

gareji commented Jan 5, 2024

tiredpixel commented Jan 7, 2024

hjiawei commented Jan 8, 2024

caseydavenport commented Jan 8, 2024

ankitabhopatkar13 commented Jan 22, 2024

BroderPeters commented Feb 1, 2024

gareji commented Feb 1, 2024

BroderPeters commented Feb 1, 2024

fasaxc commented Feb 20, 2024