-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metrics memory leak v1.12.0/v0.35.0 and up #3765
Metrics memory leak v1.12.0/v0.35.0 and up #3765
Comments
What telemetry are you generating over these time periods? What instruments are you using and what attributes are you using? |
Every 5s a health/live check is performed, which uses an int64 counter (+1) and an int64 histogram (+duration). Each function call does the same, with differently named counters/histograms. The non-health/live check functions add a few constant attributes, such as package name, but those functions are not currently called frequently - perhaps once a day. The health/live check function, which is the one contributing 99.9% of metrics, adds just the request path (either |
Are you able to obtain and share pprof heap dumps from the process when memory use is elevated? |
I'll need a few days to find some time to enable pprof, update the package version, and capture the heap dumps - stay tuned. |
Over the course of some 6 hours I took 5 heap dumps using pprof, see attachment. In those 6 hours only the live and health checks were called every 5 seconds, which resulted in ~4 counters and ~histograms being updated with constant attributes (namely, a string containing the suffix of the URL - either /livez or /healthz - and a boolean whether the call was a success - which it always was). There are some functions worth looking at it seems: prometheus.MakeLabelPairs and attribute.computeDistinctFixed. |
I deployed a fresh instance and will let it run until Monday so I can collect a new sample to give you even more to work with. If there are other pprof extracts you're interested in let me know, and I'll see what I can do. |
If you could get data with |
My initial thoughts
I'm interested to see what the |
I called the tool with
|
Another update:
|
This seems to be indicating that the memory use is coming from the Prometheus client used by the Prometheus exporter. I'm not sure how much we can do to alleviate that given we do not control that code. We should double check we are supporting as best we can the optimal path of label creation. @dashpole is there anything that seems obvious to you? |
Are you able to provide the contents of curling the prometheus endpoint? Reading the implementation, I don't see anything that would obviously be the culprit for this. |
I think we're getting close to the solution. Curling the The main difference between each repeated block of the "http_server" metrics is in the attributes: The |
What is your current suggestion in my situation? Hold off on updating until #3744 is fixed, or do I need to change something on my end? |
You could create a view for the instrument that filters out the attribute. Otherwise, yeah, I think the fix to #3744 is in order to fully remove the high cardinality attribute. |
@bthomee Would it be possible for you to provide setup code you use in your service which creates exporter and meter provider? |
The following extract shows how we set up the exporter and meter provider:
The new addition we made, based on @MrAlias' suggestion, is adding that filtered view, which has worked like a charm. |
Was this resolved in 1.17.0? |
Description
After upgrading from v1.11.2/v.0.34.0 to v1.12.0/v0.35.0 I have witnessed our pods to have a never-ending increase in memory consumption over time. Restoring to the former version brought stable memory use back. Once v1.13.0/v0.36.0 had been released I tried upgrading to that version, but got the same result. Once again, restoring to v1.11.2/v.0.34.0 stopped the memory increase.
Refer to the screenshot below that shows memory use of one of our pods on AWS EKS, specifically:
Environment
Steps To Reproduce
syncint64
etc. to their correspondinginstrument
versions.Expected behavior
Memory stays stable.
The text was updated successfully, but these errors were encountered: