OpenTelemery Collector keeps crashing with OOMKilled status #4010
Replies: 1 comment 3 replies
-
When the pod OOMs, you can look at the event to see which cgroup limit was hit. Requests don't matter at all for OOM; are you setting container limits on the collector container? Unless you have implemented some form of target sharding (like we are working on in open-telemetry/prometheus-interoperability-spec#60), each replica of your collector is scraping all of the targets in your config. Adding more replicas won't decrease memory usage, unless you also split your scrape configs in some way. Without dynamic target sharding, you can split by:
Both KSM and cAdvisor produce a lot of metrics. If there are any of them you don't need, dropping them with metric_relabel_configs (action == drop) can reduce memory usage significantly. In my experience, the memory_limiter processor is a very effective tool for replacing OOM, which is a complete failure, with dropping metrics, which is a partial failure. I'd recommend using it even after you mitigate your problem so that when you run out of memory in the future, it just causes you to drop a few of them instead of falling over. |
Beta Was this translation helpful? Give feedback.
-
I have opentelemery collector setup to scrape 5 endpoints
skywalking metrics
kubernetes cadvisor
kube-state-metrics
stiod
envoy-stats
I found that the otel collector pod keeps crashing with status OOMKilled status, which is an out of memory issue.
So I figured either there is too much for it to collect or it needs more memory
So I tried the following:
Increase the replicas - I thought more pods to process the data should help
OUTCOME: They all now crash with the same OOMKilled status
Increase their memory request - I increase this from 400Mi to 800Mi
OUTCOME: Nothing changed there they all still crash with OOMKilled
Checking the node resources using k top nodes at the time the pods are crashing its at approx 60% so it has not over shot its memory on the actual machine
Question: Have you seen this issue before and/or any advice on how to handle this?
Do I have the collector setup with too many jobs and maybe should separate them?
I am using the following image on k8s cluster
image: otel/opentelemetry-collector:0.29.0
Thanks
Beta Was this translation helpful? Give feedback.
All reactions