-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible "memory-leak" in KafkaStreamsMetrics #2843
Comments
Hi, I wonder about status of this issue - did you manage to figure out the root cause / way the issue could be solved? Any help needed? :) |
@JorgenRingen Thank you very much catching and reporting this, also special thanks for the detailed description and the reproducer. I was able to repro this using the reproducer and I was able to write a test too that can also reproduce the issue, see: 32e2e76 |
@jonatan-ivanov : Thanks for update. Tested 1.7.6, but unfortunately we see the same problem after the application has been running for some time, and after there's been some broker-restarts and rebalances. Haven't had time to look much into it yet. Seems like registeredMeters list has a lot of references to ImmutableTag: |
@JorgenRingen I'm checking this but I hasn't been able to repro this (I will continue investigating this next week). What I found so far:
Could you please double check if you are using Spring Boot 2.5.7 and Micrometer 1.7.6? |
I was able to locate and fix why the scheduler stops, I'm going to merge the fix in next week. |
Great, it definitely was 1.7.6, a little bit unsure of 2.5.7 unfortunately. A little bit weird, but broker restarts actually triggered the memory increase. We were testing rolling restarts of brokers, and the 1.7.6 apps OOM'ed or spiked in memory usage. Don't know if that's a clue, hard to repro as broker restarts doesn't happen very frequently. Broker restarts also causes a lot of rebalances. Will test the fix next week! :) |
@JorgenRingen I fixed the scheduler issue (you can try
By restarting Kafka, I did not see any increase by restarting the other app I can see some increase but meters also get removed and after a point the increase stops (also, it does not really have visible impact on the heap). |
Hi, sorry for late reply. Will test latest version. Upgraded to spring-boot 2.6., so I guess the fix is in micrometer 1.8. as well. It's hard to reproduce locally as mentioned, this happened during a lot of broker restarts actually. Haven't been able to repro locally unfortunately. |
Spring Boot 2.6.1 is the latest, it should bring in Micrometer 1.8.0. 1.8.0 contains the memory leak fix as well as 1.7.6 but could you please try 1.8.1, that contains the scheduler fix (#2879) too. |
Yes, will try 1.8.1 and do some broker restarts in test-env. |
Hi @jonatan-ivanov, Had it running for a couple of days now in test-env, where there is scheduled broker-restarts a couple of times a week. The heap-size increases after a while and gravitates towards 100%. When doing a threaddump It's kind of tricky to repro locally as it takes some time before the memory usage goes up. Our test-env has 12 brokers, which is a little bit hard to get up and running on my mac :) It seems to be Here is some screenshots from our monitoring of pods running 1.8.1: Same app running with kafka-streams metrics disabled for several days: Don't know how to proceed tbh. The "registeredMeters" map was introduced in this issue: https://github.com/micrometer-metrics/micrometer/pull/2424/files |
@JorgenRingen I think it should be checked here, whether the leak is in the way metric objects are handled in the micrometer classes, or whether the leak is in the Kafka client classes themselves. We were also seeing an increase in memory usage due to Metrics objects. In our scenario we are using Kafka consumers that use wildcard topic subscriptions. We are regularly adding and deleting topics. We noticed that the topic-related metrics concerning deleted topics are not cleared in the Kafka consumers. Also Kafka-node specific metrics don't get deleted when a node isn't used anymore. I think it could be helpful to have a closer look at the heap dumps in visualvm.
Getting these numbers on a heap dump from early in the application lifecycle and on a heap dump from later on, could show, whether there is an increase. To gain more insight into the actual metrics in the Kafka client objects, this query can be used:
(Could also be applied for KafkaProducer and KafkaAdminClient objects.) |
thanks for advice @calohmn - we ran consumer/producer queries on heapdump from the same pod on k8s (micrometer version=1.8.2) when it starts (jvm uptime ~4 mins, around 40% heap occupied) and after 11 hours (jvm uptime ~12 hours, 90+% heap occupied) in both dumps number of metrics for particular thread is almost the same for the query But what is evident from dumps - number of instances of ImmutableTag keeps growing: second dump (jvm uptime 11 hours 42 mins) upd:
seems that some "Meter"s are not removed from registeredMeters even though they are removed from registry |
@ghmulti Can you somehow reproduce it locally?
I find this even more interesting since the two should be "symmetrical". I made a change that should improve this a little. |
unfortunately I can not reproduce it locally - but will gladly test it again once your change will be released (it is easy reproducible in our staging environment). seems very promising 👍 |
@ghmulti I released 1.7.9 and 1.8.3, could you please check if you still see the same behavior? |
@ghmulti Thank you! The heap utilization of 1.8.3 definitely seems promising, let's see if it will stay that way. :) @JorgenRingen Could you please check if 1.8.3 fixed anything on your side? |
Actually, me and @ghmulti are on the same "side"/company :-D |
Cool. Let me close the issue then, please let me know if you are still experiencing it and we can reopen. |
This issue might be better to be assigned to the 1.7.9 milestone as it seems that the complete fix has been shipped in the 1.7.9 release. |
Describe the bug
When using KafkaStreamsMetrics the heap-usage seems to be ever-increasing for objects of type
io.micrometer.core.instrument.ImmutableTag
.Internally in KafkaMetrics theres a Set holding "registeredMeters":
https://github.com/micrometer-metrics/micrometer/blob/1.7.x/micrometer-core/src/main/java/io/micrometer/core/instrument/binder/kafka/KafkaMetrics.java#L92
Every minute a scheduler runs
KafkaMetrics#checkAndBindMetrics
, which retrieves all metrics from the metric-supplier (KafkaStreams#metrics) and checks ifcurrentMeters.equals(metrics.keySet())
. If false the metrics returned from the metric-supplier are passed toKafkaMetrics#bindMeter
which registers a new meter and add the meter to theregisteredMeters set
(https://github.com/micrometer-metrics/micrometer/blob/1.7.x/micrometer-core/src/main/java/io/micrometer/core/instrument/binder/kafka/KafkaMetrics.java#L221). Any existing meter with the same name/tags is not removed, so every time this is done, the set will increase.This would not be a problem if the metric-supplier (KafkaStreams#metrics) returned the same metrics on each call, but this can actually vary a lot. After rebalances for example, KafkaStreams#Metrics returns "consumer-fetch-manager" metrics for a while. Every time the call differs from the previous, all kafka-streams metrics (~3-4000 metrics) are added to the set.
Here is a heap-dump of an app which has been running for a couple of days and gone through a lot of rebalances and broker restarts. It eventually OOM's:
I guess the fix here would be to clean the existing registeredMeters set before adding new ones. It was introduced in this issue: #2018
6032646
Environment
To Reproduce
Added sample app. Run multiple instances, restart one instance to trigger rebalances, watch the io.micrometer.core.instrument.binder.kafka.KafkaMetrics.registeredMeters increase after scheduled calls to "checkAndBindMetrics"
https://github.com/JorgenRingen/micrometer_1_7_5_high_heap_usage
Expected behavior
HEAP-usage should not gradually increase when return from KafkaStreams#metrics changes over time, but remain stable.
Any additional context
Don't know why we're suddenly experiencing this issue, we have been using KafkaStreamsMetrics for some time, but an upgrade to kafka-streams 2.8.x and more variance between calls to KafkaStreams#metrics might be the cause.
The text was updated successfully, but these errors were encountered: