-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[exporter/datadog] Memory leak in trace stats module #30828
Comments
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
Thanks for reporting. Can you please test this with collector after setting [trace buffer] (
Could also please share the full collector config. |
Our full collector config is massive. Here is the config for the Datadog exporter datadog:
metrics:
resource_attributes_as_tags: true
instrumentation_scope_metadata_as_tags: true
summaries:
mode: noquantiles
traces:
compute_stats_by_span_kind: false
peer_tags_aggregation: false
# Buffer traces to avoid "Payload in channel full" errors
trace_buffer: 10
host_metadata:
enabled: false
sending_queue:
queue_size: 200 |
@sirianni 👋 QQ
What's the version of Datadog Agent that you use? |
We are currently running Datadog Agent |
Thanks @dineshg13 - I will experiment with the change. Can you help me understand how increasing |
I'm actually not able to confirm this is the case. We run the DD Agent with larger k8s memory limits so it's possible it's exhibiting a similar issue and we just haven't noticed. |
@dineshg13 setting |
@sirianni Are you using Datadog connector ? It would be helpful to have share the full collector config. |
Yes we are. The full config is quite complicated because we have applications that dual write traces to both our OTel Collector and the DD Agent. We are doing this to allow for a graceful transition from DD Agent to OTel Collector. What we are trying to accomplish for OpenCensus is:
Effectively we're doing something like this. piplelines:
traces/opencensus:
receivers:
- opencensus
exporters:
- datadog/trace_stats_connector
traces/otel:
receivers:
- otlp
exporters:
- datadog
metrics:
receivers:
- datadog/trace_stats_connector
exporters:
- datadog I noticed that the |
@sirianni Thanks for the details. We have fixed datadogconnector to support compute_stats_by_span_kind or peer_tags_aggregation. It should be available in the next release. Are you using the feature gate in the connector ? We should enable Do you expect to compute stats in ur traces/otel pipeline ? |
Yes, but these are computed by the |
No, but the feature gate looks to make things more efficient by encoding metrics more efficiently as raw bytes. To be clear, the symptoms here look like a memory leak and not just "high usage". So while enabling the feature gate may make it take longer before the heap is exhausted, it would not seem to fix the root cause of the leak itself. Also, in our case the heap was consumed by |
) **Description:** <Describe what has changed.> <!--Ex. Fixing a bug - Describe the bug and how this fixes the issue. Ex. Adding a feature - Explain what this achieves.--> Datadog Connector is creating two instances of Trace Agent, one in the trace-to-metrics pipeline and another in the traces-to-traces pipeline. The PR separates the trace-to-trace connector, simplifying the logic, this avoid un-necessary serialization. **Link to tracking Issue:** <Issue number if applicable> #30828 #30487 **Testing:** <Describe what testing was performed and which tests were added.> **Documentation:** <Describe the documentation added.>
@dineshg13 Are there fixes other than #31026 identified for this issue? |
@kamal-narayan, for now this is the only fix that has been shipped. We are still load testing this PR, and will report back here once we have more results. |
Connector Not in Trace to Trace Pipeline @sirianni the stats code ( We've load tested the collector with multiple cardinalities, but were not able to reproduce the memory leak. The two noteworthy points of the load tests were:
During the slow growth in heap usage from the screenshot, were there any new hosts being spun up sending data to that collector ? I'm trying to understand if there was an increase in cardinality during that time period, which would lead to the stats module using more memory. In the graphs that you shared, the collector memory seems to stabilise at some point. At this point, did the cardinality being sent to this collector reach it's peak ?
Are you able to run a test with the collector by giving it more memory ? I'm interested in whether the memory stabilises. If you experience symptoms of a memory leak, would you be able to generate new profiles and also output traces via the file exporter ? This would allow us to run our load tests using the traces that you shared to try to reproduce the memory leak. Connector IN Trace to Trace Pipeline In this case, we were able to identify and reproduce a memory leak when the connector is used in the trace to trace pipeline. This was fixed in the following following PR, which will be part of the next collector release. @sirianni Based on the config you shared, you are not using connector in trace to trace pipeline so you should not be affected by this memory leak. |
Still happening here too 😢 |
@diogotorres97 please see: #30908 (comment) |
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
This issue has been closed as inactive because it has been stale for 120 days with no activity. |
Component(s)
exporter/datadog
What happened?
Description
We are observing a memory leak in the Datadog Exporter which appears to be in the trace/apm stats component.
The pattern we see is that when the collector restarts, memory grows steadily until
GOMEMLIMIT
is reachedWe consider this a "leak" vs. a large working set because of the slow growth in heap usage.
Looking at a
pprof
of the collector process we see the majority of heap is used by thedatadog-agent/trace/stats
module.We have already disabled these flags in the Datadog Exporter.
Collector version
v0.90.1
Additional context
Seems similar to #15720
When we send the same workload of traces via the Datadog Agent, we do not see a similar memory leak. I thought this was worth noting since my understanding is that the two use the same library to calculate the trace stats.
The text was updated successfully, but these errors were encountered: