Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible "memory-leak" in KafkaStreamsMetrics #2843

Closed
JorgenRingen opened this issue Oct 24, 2021 · 21 comments
Closed

Possible "memory-leak" in KafkaStreamsMetrics #2843

JorgenRingen opened this issue Oct 24, 2021 · 21 comments
Assignees
Labels
bug A general bug module: micrometer-core An issue that is related to our core module performance Issues related to general performance regression A regression from a previous release
Milestone

Comments

@JorgenRingen
Copy link

JorgenRingen commented Oct 24, 2021

Describe the bug
When using KafkaStreamsMetrics the heap-usage seems to be ever-increasing for objects of type io.micrometer.core.instrument.ImmutableTag.

Internally in KafkaMetrics theres a Set holding "registeredMeters":

private final Set<Meter> registeredMeters = ConcurrentHashMap.newKeySet();

https://github.com/micrometer-metrics/micrometer/blob/1.7.x/micrometer-core/src/main/java/io/micrometer/core/instrument/binder/kafka/KafkaMetrics.java#L92

Every minute a scheduler runs KafkaMetrics#checkAndBindMetrics, which retrieves all metrics from the metric-supplier (KafkaStreams#metrics) and checks if currentMeters.equals(metrics.keySet()). If false the metrics returned from the metric-supplier are passed to KafkaMetrics#bindMeter which registers a new meter and add the meter to the registeredMeters set (https://github.com/micrometer-metrics/micrometer/blob/1.7.x/micrometer-core/src/main/java/io/micrometer/core/instrument/binder/kafka/KafkaMetrics.java#L221). Any existing meter with the same name/tags is not removed, so every time this is done, the set will increase.

This would not be a problem if the metric-supplier (KafkaStreams#metrics) returned the same metrics on each call, but this can actually vary a lot. After rebalances for example, KafkaStreams#Metrics returns "consumer-fetch-manager" metrics for a while. Every time the call differs from the previous, all kafka-streams metrics (~3-4000 metrics) are added to the set.

Here is a heap-dump of an app which has been running for a couple of days and gone through a lot of rebalances and broker restarts. It eventually OOM's:

image

image

I guess the fix here would be to clean the existing registeredMeters set before adding new ones. It was introduced in this issue: #2018
6032646

Environment

  • Micrometer version: 1.7.5
  • Micrometer registry: prometheus
  • OS: macOS, ubuntu
  • Java version: openjdk version "1.8.0_302"

To Reproduce
Added sample app. Run multiple instances, restart one instance to trigger rebalances, watch the io.micrometer.core.instrument.binder.kafka.KafkaMetrics.registeredMeters increase after scheduled calls to "checkAndBindMetrics"
https://github.com/JorgenRingen/micrometer_1_7_5_high_heap_usage

Expected behavior
HEAP-usage should not gradually increase when return from KafkaStreams#metrics changes over time, but remain stable.

Any additional context
Don't know why we're suddenly experiencing this issue, we have been using KafkaStreamsMetrics for some time, but an upgrade to kafka-streams 2.8.x and more variance between calls to KafkaStreams#metrics might be the cause.

@JorgenRingen JorgenRingen added the bug A general bug label Oct 24, 2021
@shakuzen shakuzen added the module: micrometer-core An issue that is related to our core module label Oct 26, 2021
@ghmulti
Copy link

ghmulti commented Nov 9, 2021

Hi, I wonder about status of this issue - did you manage to figure out the root cause / way the issue could be solved? Any help needed? :)

@jonatan-ivanov
Copy link
Member

@JorgenRingen Thank you very much catching and reporting this, also special thanks for the detailed description and the reproducer. I was able to repro this using the reproducer and I was able to write a test too that can also reproduce the issue, see: 32e2e76

@jonatan-ivanov jonatan-ivanov added this to the 1.6.13 milestone Nov 9, 2021
@jonatan-ivanov jonatan-ivanov added performance Issues related to general performance regression A regression from a previous release labels Nov 10, 2021
@JorgenRingen
Copy link
Author

@jonatan-ivanov : Thanks for update. Tested 1.7.6, but unfortunately we see the same problem after the application has been running for some time, and after there's been some broker-restarts and rebalances. Haven't had time to look much into it yet.

image

Seems like registeredMeters list has a lot of references to ImmutableTag:
image

image

@jonatan-ivanov
Copy link
Member

@JorgenRingen I'm checking this but I hasn't been able to repro this (I will continue investigating this next week).

What I found so far:

  • Your sample app binds KafkaMetrics twice: (Spring Boot does it once as well as your config)
  • For some reason, after a while (can be pretty quickly) the scheduler in KafkaMetrics stops (it seems this has something to do with restarting the second app)
  • The stopped scheduler can explain why I don't see KafkaMetrics leaking
  • When the scheduler is running, I see that it is removing metrics that are not available anymore

Could you please double check if you are using Spring Boot 2.5.7 and Micrometer 1.7.6?

@jonatan-ivanov
Copy link
Member

I was able to locate and fix why the scheduler stops, I'm going to merge the fix in next week.

@JorgenRingen
Copy link
Author

JorgenRingen commented Nov 25, 2021

Great, it definitely was 1.7.6, a little bit unsure of 2.5.7 unfortunately. A little bit weird, but broker restarts actually triggered the memory increase. We were testing rolling restarts of brokers, and the 1.7.6 apps OOM'ed or spiked in memory usage. Don't know if that's a clue, hard to repro as broker restarts doesn't happen very frequently. Broker restarts also causes a lot of rebalances. Will test the fix next week! :)

@jonatan-ivanov
Copy link
Member

@JorgenRingen I fixed the scheduler issue (you can try 1.7.7-SNAPSHOT) but I'm still not able to repro the issue, here's what I'm doing:

  1. Start Kafka in a Docker container using this docker-compose file: https://github.com/jonatan-ivanov/local-services/blob/main/kafka/docker-compose.yml
  2. I Removed streamsBuilderCustomeizer from your reproducer (so that KafkaMetrics won't be registered twice)
  3. I upgraded Spring Boot to 2.5.7 in your sample (changing the parent pom version)
  4. I'm using a local build of Micrometer (identical to 1.7.7-SNAPSHOT but I added some logging)
  5. I start two instances of the reproducer
  6. I'm monitoring one of them with visualvm and restarting the other every 1-5 minutes, I'm also restarting the Kafka container

By restarting Kafka, I did not see any increase by restarting the other app I can see some increase but meters also get removed and after a point the increase stops (also, it does not really have visible impact on the heap).

@JorgenRingen
Copy link
Author

Hi, sorry for late reply. Will test latest version. Upgraded to spring-boot 2.6., so I guess the fix is in micrometer 1.8. as well. It's hard to reproduce locally as mentioned, this happened during a lot of broker restarts actually. Haven't been able to repro locally unfortunately.

@jonatan-ivanov
Copy link
Member

jonatan-ivanov commented Dec 9, 2021

Spring Boot 2.6.1 is the latest, it should bring in Micrometer 1.8.0. 1.8.0 contains the memory leak fix as well as 1.7.6 but could you please try 1.8.1, that contains the scheduler fix (#2879) too.

@JorgenRingen
Copy link
Author

Yes, will try 1.8.1 and do some broker restarts in test-env.

@JorgenRingen
Copy link
Author

JorgenRingen commented Dec 16, 2021

Hi @jonatan-ivanov,

Had it running for a couple of days now in test-env, where there is scheduled broker-restarts a couple of times a week. The heap-size increases after a while and gravitates towards 100%. When doing a threaddump io.micrometer.core.instrument.ImmutableTag seems to take up around 20% of total memory.

It's kind of tricky to repro locally as it takes some time before the memory usage goes up. Our test-env has 12 brokers, which is a little bit hard to get up and running on my mac :)

It seems to be KafkaStreamsMetrics#registeredMeters that retains all the metrics.
image

image

Here is some screenshots from our monitoring of pods running 1.8.1:
Example1:
image

Example2:
image

Same app running with kafka-streams metrics disabled for several days:
image

Don't know how to proceed tbh. The "registeredMeters" map was introduced in this issue: https://github.com/micrometer-metrics/micrometer/pull/2424/files
Maybe there are some alternative ways to solve this. Be happy to test out a snapshot.

@calohmn
Copy link

calohmn commented Jan 12, 2022

@JorgenRingen I think it should be checked here, whether the leak is in the way metric objects are handled in the micrometer classes, or whether the leak is in the Kafka client classes themselves.

We were also seeing an increase in memory usage due to Metrics objects. In our scenario we are using Kafka consumers that use wildcard topic subscriptions. We are regularly adding and deleting topics. We noticed that the topic-related metrics concerning deleted topics are not cleared in the Kafka consumers. Also Kafka-node specific metrics don't get deleted when a node isn't used anymore.

I think it could be helpful to have a closer look at the heap dumps in visualvm.
The following queries can be used in the OQL Console to check the number of metrics in the Kafka client objects

select { consumer: c.clientId.toString(), metrics: count(c.metrics.metrics.table, "it != null") } from org.apache.kafka.clients.producer.KafkaProducer c

select { consumer: c.clientId.toString(), metrics: count(c.metrics.metrics.table, "it != null") } from org.apache.kafka.clients.consumer.KafkaConsumer c

select { consumer: c.clientId.toString(), metrics: count(c.metrics.metrics.table, "it != null") } from org.apache.kafka.clients.admin.KafkaAdminClient c

Getting these numbers on a heap dump from early in the application lifecycle and on a heap dump from later on, could show, whether there is an increase.

To gain more insight into the actual metrics in the Kafka client objects, this query can be used:

select { consumer: c.clientId.toString(), metrics: map(filter(toArray(c.metrics.metrics.table), "it != null"),
  function (t) {
     var res = t.key.name.toString() + " [";
     /* t.key.tags.table[0] with client-id omitted */
     if (t.key.tags.table[1] != null) {
        res += t.key.tags.table[1].key.toString() + "=" + t.key.tags.table[1].value.toString() + ", ";
     }
     if (t.key.tags.table[2] != null) {
        res += t.key.tags.table[2].key.toString() + "=" + t.key.tags.table[2].value.toString() + ", ";
     }
     res += "]";
     return res;
  }) }
  from org.apache.kafka.clients.consumer.KafkaConsumer c

(Could also be applied for KafkaProducer and KafkaAdminClient objects.)

@ghmulti
Copy link

ghmulti commented Jan 25, 2022

thanks for advice @calohmn - we ran consumer/producer queries on heapdump from the same pod on k8s (micrometer version=1.8.2) when it starts (jvm uptime ~4 mins, around 40% heap occupied) and after 11 hours (jvm uptime ~12 hours, 90+% heap occupied)
heap-screen

in both dumps number of metrics for particular thread is almost the same for the query
select { consumer: c.clientId.toString(), metrics: count(c.metrics.metrics.table, "it != null") } from org.apache.kafka.clients.consumer.KafkaConsumer c
(400-450 metrics per kafkastream consumer thread) - which, I guess, means that kafka consumer/producer metrics is not the problem (since KafkaConsumer->metrics->metrics concurrent map is not growing over time).

But what is evident from dumps - number of instances of ImmutableTag keeps growing:
first dump (jvm uptime 3 min 43sec)
Size: 112,856,814 B
io.micrometer.core.instrument.ImmutableTag 85,978 (4.9%)

second dump (jvm uptime 11 hours 42 mins)
Size: 223,229,433 B
io.micrometer.core.instrument.ImmutableTag 720,358 (16.1%)

upd:
this must be related to registeredMeters in KafkaStreamsMetrics

select count(km.registeredMeters.map.table, "it != null") from io.micrometer.core.instrument.binder.kafka.KafkaStreamsMetrics km
= 21457 (dump1)
= 175712 (dump2)

seems that some "Meter"s are not removed from registeredMeters even though they are removed from registry

jonatan-ivanov added a commit that referenced this issue Feb 3, 2022
@jonatan-ivanov
Copy link
Member

@ghmulti Can you somehow reproduce it locally?
Also, it is interesting on your heap utilization graph that in the first 9 hours, seemingly everything is good, then 4 hours later it is over 90%.

seems that some "Meter"s are not removed from registeredMeters even though they are removed from registry

I find this even more interesting since the two should be "symmetrical". I made a change that should improve this a little.

@ghmulti
Copy link

ghmulti commented Feb 9, 2022

unfortunately I can not reproduce it locally - but will gladly test it again once your change will be released (it is easy reproducible in our staging environment). seems very promising 👍

@jonatan-ivanov
Copy link
Member

@ghmulti I released 1.7.9 and 1.8.3, could you please check if you still see the same behavior?
The expectation is registeredMeters and the registry should be symmetrical; either we have a leak in both or neither.

@ghmulti
Copy link

ghmulti commented Feb 18, 2022

the behaviour in 1.8.3 seems to change
2022-02-18-v
heap is not reaching 100% anymore and seems to be cleaned properly. We'll continue observations and take a closer look at the heapdump 👏

@jonatan-ivanov
Copy link
Member

@ghmulti Thank you! The heap utilization of 1.8.3 definitely seems promising, let's see if it will stay that way. :)

@JorgenRingen Could you please check if 1.8.3 fixed anything on your side?

@JorgenRingen
Copy link
Author

Actually, me and @ghmulti are on the same "side"/company :-D

@jonatan-ivanov
Copy link
Member

Cool. Let me close the issue then, please let me know if you are still experiencing it and we can reopen.

@izeye
Copy link
Contributor

izeye commented Mar 12, 2022

This issue might be better to be assigned to the 1.7.9 milestone as it seems that the complete fix has been shipped in the 1.7.9 release.

@shakuzen shakuzen modified the milestones: 1.6.13, 1.7.9 Mar 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug A general bug module: micrometer-core An issue that is related to our core module performance Issues related to general performance regression A regression from a previous release
Projects
None yet
Development

No branches or pull requests

6 participants