Possible "memory-leak" in KafkaStreamsMetrics #2843

JorgenRingen · 2021-10-24T12:19:23Z

Describe the bug
When using KafkaStreamsMetrics the heap-usage seems to be ever-increasing for objects of type io.micrometer.core.instrument.ImmutableTag.

Internally in KafkaMetrics theres a Set holding "registeredMeters":

private final Set<Meter> registeredMeters = ConcurrentHashMap.newKeySet();

https://github.com/micrometer-metrics/micrometer/blob/1.7.x/micrometer-core/src/main/java/io/micrometer/core/instrument/binder/kafka/KafkaMetrics.java#L92

Every minute a scheduler runs KafkaMetrics#checkAndBindMetrics, which retrieves all metrics from the metric-supplier (KafkaStreams#metrics) and checks if currentMeters.equals(metrics.keySet()). If false the metrics returned from the metric-supplier are passed to KafkaMetrics#bindMeter which registers a new meter and add the meter to the registeredMeters set (https://github.com/micrometer-metrics/micrometer/blob/1.7.x/micrometer-core/src/main/java/io/micrometer/core/instrument/binder/kafka/KafkaMetrics.java#L221). Any existing meter with the same name/tags is not removed, so every time this is done, the set will increase.

This would not be a problem if the metric-supplier (KafkaStreams#metrics) returned the same metrics on each call, but this can actually vary a lot. After rebalances for example, KafkaStreams#Metrics returns "consumer-fetch-manager" metrics for a while. Every time the call differs from the previous, all kafka-streams metrics (~3-4000 metrics) are added to the set.

Here is a heap-dump of an app which has been running for a couple of days and gone through a lot of rebalances and broker restarts. It eventually OOM's:

I guess the fix here would be to clean the existing registeredMeters set before adding new ones. It was introduced in this issue: #2018
6032646

Environment

Micrometer version: 1.7.5
Micrometer registry: prometheus
OS: macOS, ubuntu
Java version: openjdk version "1.8.0_302"

To Reproduce
Added sample app. Run multiple instances, restart one instance to trigger rebalances, watch the io.micrometer.core.instrument.binder.kafka.KafkaMetrics.registeredMeters increase after scheduled calls to "checkAndBindMetrics"
https://github.com/JorgenRingen/micrometer_1_7_5_high_heap_usage

Expected behavior
HEAP-usage should not gradually increase when return from KafkaStreams#metrics changes over time, but remain stable.

Any additional context
Don't know why we're suddenly experiencing this issue, we have been using KafkaStreamsMetrics for some time, but an upgrade to kafka-streams 2.8.x and more variance between calls to KafkaStreams#metrics might be the cause.

The text was updated successfully, but these errors were encountered:

ghmulti · 2021-11-09T07:58:59Z

Hi, I wonder about status of this issue - did you manage to figure out the root cause / way the issue could be solved? Any help needed? :)

jonatan-ivanov · 2021-11-09T23:54:50Z

@JorgenRingen Thank you very much catching and reporting this, also special thanks for the detailed description and the reproducer. I was able to repro this using the reproducer and I was able to write a test too that can also reproduce the issue, see: 32e2e76

JorgenRingen · 2021-11-18T09:18:30Z

@jonatan-ivanov : Thanks for update. Tested 1.7.6, but unfortunately we see the same problem after the application has been running for some time, and after there's been some broker-restarts and rebalances. Haven't had time to look much into it yet.

Seems like registeredMeters list has a lot of references to ImmutableTag:

jonatan-ivanov · 2021-11-25T00:23:52Z

@JorgenRingen I'm checking this but I hasn't been able to repro this (I will continue investigating this next week).

What I found so far:

Your sample app binds KafkaMetrics twice: (Spring Boot does it once as well as your config)
For some reason, after a while (can be pretty quickly) the scheduler in KafkaMetrics stops (it seems this has something to do with restarting the second app)
The stopped scheduler can explain why I don't see KafkaMetrics leaking
When the scheduler is running, I see that it is removing metrics that are not available anymore

Could you please double check if you are using Spring Boot 2.5.7 and Micrometer 1.7.6?

jonatan-ivanov · 2021-11-25T01:47:44Z

I was able to locate and fix why the scheduler stops, I'm going to merge the fix in next week.

JorgenRingen · 2021-11-25T06:18:59Z

Great, it definitely was 1.7.6, a little bit unsure of 2.5.7 unfortunately. A little bit weird, but broker restarts actually triggered the memory increase. We were testing rolling restarts of brokers, and the 1.7.6 apps OOM'ed or spiked in memory usage. Don't know if that's a clue, hard to repro as broker restarts doesn't happen very frequently. Broker restarts also causes a lot of rebalances. Will test the fix next week! :)

jonatan-ivanov · 2021-12-01T00:24:41Z

@JorgenRingen I fixed the scheduler issue (you can try 1.7.7-SNAPSHOT) but I'm still not able to repro the issue, here's what I'm doing:

Start Kafka in a Docker container using this docker-compose file: https://github.com/jonatan-ivanov/local-services/blob/main/kafka/docker-compose.yml
I Removed streamsBuilderCustomeizer from your reproducer (so that KafkaMetrics won't be registered twice)
I upgraded Spring Boot to 2.5.7 in your sample (changing the parent pom version)
I'm using a local build of Micrometer (identical to 1.7.7-SNAPSHOT but I added some logging)
I start two instances of the reproducer
I'm monitoring one of them with visualvm and restarting the other every 1-5 minutes, I'm also restarting the Kafka container

By restarting Kafka, I did not see any increase by restarting the other app I can see some increase but meters also get removed and after a point the increase stops (also, it does not really have visible impact on the heap).

JorgenRingen · 2021-12-09T16:07:24Z

Hi, sorry for late reply. Will test latest version. Upgraded to spring-boot 2.6., so I guess the fix is in micrometer 1.8. as well. It's hard to reproduce locally as mentioned, this happened during a lot of broker restarts actually. Haven't been able to repro locally unfortunately.

jonatan-ivanov · 2021-12-09T22:15:17Z

Spring Boot 2.6.1 is the latest, it should bring in Micrometer 1.8.0. 1.8.0 contains the memory leak fix as well as 1.7.6 but could you please try 1.8.1, that contains the scheduler fix (#2879) too.

JorgenRingen · 2021-12-10T07:43:36Z

Yes, will try 1.8.1 and do some broker restarts in test-env.

JorgenRingen · 2021-12-16T08:02:58Z

Hi @jonatan-ivanov,

Had it running for a couple of days now in test-env, where there is scheduled broker-restarts a couple of times a week. The heap-size increases after a while and gravitates towards 100%. When doing a threaddump io.micrometer.core.instrument.ImmutableTag seems to take up around 20% of total memory.

It's kind of tricky to repro locally as it takes some time before the memory usage goes up. Our test-env has 12 brokers, which is a little bit hard to get up and running on my mac :)

It seems to be KafkaStreamsMetrics#registeredMeters that retains all the metrics.

Here is some screenshots from our monitoring of pods running 1.8.1:
Example1:

Example2:

Same app running with kafka-streams metrics disabled for several days:

Don't know how to proceed tbh. The "registeredMeters" map was introduced in this issue: https://github.com/micrometer-metrics/micrometer/pull/2424/files
Maybe there are some alternative ways to solve this. Be happy to test out a snapshot.

calohmn · 2022-01-12T08:12:45Z

@JorgenRingen I think it should be checked here, whether the leak is in the way metric objects are handled in the micrometer classes, or whether the leak is in the Kafka client classes themselves.

We were also seeing an increase in memory usage due to Metrics objects. In our scenario we are using Kafka consumers that use wildcard topic subscriptions. We are regularly adding and deleting topics. We noticed that the topic-related metrics concerning deleted topics are not cleared in the Kafka consumers. Also Kafka-node specific metrics don't get deleted when a node isn't used anymore.

I think it could be helpful to have a closer look at the heap dumps in visualvm.
The following queries can be used in the OQL Console to check the number of metrics in the Kafka client objects

select { consumer: c.clientId.toString(), metrics: count(c.metrics.metrics.table, "it != null") } from org.apache.kafka.clients.producer.KafkaProducer c

select { consumer: c.clientId.toString(), metrics: count(c.metrics.metrics.table, "it != null") } from org.apache.kafka.clients.consumer.KafkaConsumer c

select { consumer: c.clientId.toString(), metrics: count(c.metrics.metrics.table, "it != null") } from org.apache.kafka.clients.admin.KafkaAdminClient c

Getting these numbers on a heap dump from early in the application lifecycle and on a heap dump from later on, could show, whether there is an increase.

To gain more insight into the actual metrics in the Kafka client objects, this query can be used:

select { consumer: c.clientId.toString(), metrics: map(filter(toArray(c.metrics.metrics.table), "it != null"),
  function (t) {
     var res = t.key.name.toString() + " [";
     /* t.key.tags.table[0] with client-id omitted */
     if (t.key.tags.table[1] != null) {
        res += t.key.tags.table[1].key.toString() + "=" + t.key.tags.table[1].value.toString() + ", ";
     }
     if (t.key.tags.table[2] != null) {
        res += t.key.tags.table[2].key.toString() + "=" + t.key.tags.table[2].value.toString() + ", ";
     }
     res += "]";
     return res;
  }) }
  from org.apache.kafka.clients.consumer.KafkaConsumer c

(Could also be applied for KafkaProducer and KafkaAdminClient objects.)

ghmulti · 2022-01-25T21:42:16Z

thanks for advice @calohmn - we ran consumer/producer queries on heapdump from the same pod on k8s (micrometer version=1.8.2) when it starts (jvm uptime ~4 mins, around 40% heap occupied) and after 11 hours (jvm uptime ~12 hours, 90+% heap occupied)

in both dumps number of metrics for particular thread is almost the same for the query
select { consumer: c.clientId.toString(), metrics: count(c.metrics.metrics.table, "it != null") } from org.apache.kafka.clients.consumer.KafkaConsumer c
(400-450 metrics per kafkastream consumer thread) - which, I guess, means that kafka consumer/producer metrics is not the problem (since KafkaConsumer->metrics->metrics concurrent map is not growing over time).

But what is evident from dumps - number of instances of ImmutableTag keeps growing:
first dump (jvm uptime 3 min 43sec)
Size: 112,856,814 B
io.micrometer.core.instrument.ImmutableTag 85,978 (4.9%)

second dump (jvm uptime 11 hours 42 mins)
Size: 223,229,433 B
io.micrometer.core.instrument.ImmutableTag 720,358 (16.1%)

upd:
this must be related to registeredMeters in KafkaStreamsMetrics

select count(km.registeredMeters.map.table, "it != null") from io.micrometer.core.instrument.binder.kafka.KafkaStreamsMetrics km
= 21457 (dump1)
= 175712 (dump2)

seems that some "Meter"s are not removed from registeredMeters even though they are removed from registry

fixes gh-2991 might be related: gh-2843

jonatan-ivanov · 2022-02-03T03:27:23Z

@ghmulti Can you somehow reproduce it locally?
Also, it is interesting on your heap utilization graph that in the first 9 hours, seemingly everything is good, then 4 hours later it is over 90%.

seems that some "Meter"s are not removed from registeredMeters even though they are removed from registry

I find this even more interesting since the two should be "symmetrical". I made a change that should improve this a little.

ghmulti · 2022-02-09T13:44:14Z

unfortunately I can not reproduce it locally - but will gladly test it again once your change will be released (it is easy reproducible in our staging environment). seems very promising 👍

jonatan-ivanov · 2022-02-16T19:56:43Z

@ghmulti I released 1.7.9 and 1.8.3, could you please check if you still see the same behavior?
The expectation is registeredMeters and the registry should be symmetrical; either we have a leak in both or neither.

ghmulti · 2022-02-18T10:15:46Z

the behaviour in 1.8.3 seems to change

heap is not reaching 100% anymore and seems to be cleaned properly. We'll continue observations and take a closer look at the heapdump 👏

jonatan-ivanov · 2022-02-18T21:22:25Z

@ghmulti Thank you! The heap utilization of 1.8.3 definitely seems promising, let's see if it will stay that way. :)

@JorgenRingen Could you please check if 1.8.3 fixed anything on your side?

JorgenRingen · 2022-02-21T07:19:12Z

Actually, me and @ghmulti are on the same "side"/company :-D

jonatan-ivanov · 2022-02-22T15:38:31Z

Cool. Let me close the issue then, please let me know if you are still experiencing it and we can reopen.

izeye · 2022-03-12T02:27:15Z

This issue might be better to be assigned to the 1.7.9 milestone as it seems that the complete fix has been shipped in the 1.7.9 release.

JorgenRingen added the bug A general bug label Oct 24, 2021

shakuzen added the module: micrometer-core An issue that is related to our core module label Oct 26, 2021

shakuzen assigned jonatan-ivanov Oct 26, 2021

jonatan-ivanov closed this as completed in 32e2e76 Nov 9, 2021

jonatan-ivanov added this to the 1.6.13 milestone Nov 9, 2021

jonatan-ivanov added performance Issues related to general performance regression A regression from a previous release labels Nov 10, 2021

jonatan-ivanov reopened this Nov 25, 2021

jonatan-ivanov mentioned this issue Nov 30, 2021

KafkaMetrics' refresh scheduler can stop in some corner cases #2879

Closed

calohmn mentioned this issue Jan 6, 2022

Command Router: Memory leak concerning Kafka Metrics (with many added/deleted topics) eclipse-hono/hono#3016

Closed

jonatan-ivanov added a commit that referenced this issue Feb 3, 2022

Store the Meter.Id instead of the Meter in KafkaMetrics

73c51e6

fixes gh-2991 might be related: gh-2843

jonatan-ivanov closed this as completed Feb 22, 2022

shakuzen modified the milestones: 1.6.13, 1.7.9 Mar 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible "memory-leak" in KafkaStreamsMetrics #2843

Possible "memory-leak" in KafkaStreamsMetrics #2843

JorgenRingen commented Oct 24, 2021 •

edited

Loading

ghmulti commented Nov 9, 2021

jonatan-ivanov commented Nov 9, 2021

JorgenRingen commented Nov 18, 2021

jonatan-ivanov commented Nov 25, 2021

jonatan-ivanov commented Nov 25, 2021

JorgenRingen commented Nov 25, 2021 •

edited

Loading

jonatan-ivanov commented Dec 1, 2021

JorgenRingen commented Dec 9, 2021

jonatan-ivanov commented Dec 9, 2021 •

edited

Loading

JorgenRingen commented Dec 10, 2021

JorgenRingen commented Dec 16, 2021 •

edited

Loading

calohmn commented Jan 12, 2022 •

edited

Loading

ghmulti commented Jan 25, 2022 •

edited

Loading

jonatan-ivanov commented Feb 3, 2022

ghmulti commented Feb 9, 2022

jonatan-ivanov commented Feb 16, 2022

ghmulti commented Feb 18, 2022

jonatan-ivanov commented Feb 18, 2022

JorgenRingen commented Feb 21, 2022

jonatan-ivanov commented Feb 22, 2022

izeye commented Mar 12, 2022

Possible "memory-leak" in KafkaStreamsMetrics #2843

Possible "memory-leak" in KafkaStreamsMetrics #2843

Comments

JorgenRingen commented Oct 24, 2021 • edited Loading

ghmulti commented Nov 9, 2021

jonatan-ivanov commented Nov 9, 2021

JorgenRingen commented Nov 18, 2021

jonatan-ivanov commented Nov 25, 2021

jonatan-ivanov commented Nov 25, 2021

JorgenRingen commented Nov 25, 2021 • edited Loading

jonatan-ivanov commented Dec 1, 2021

JorgenRingen commented Dec 9, 2021

jonatan-ivanov commented Dec 9, 2021 • edited Loading

JorgenRingen commented Dec 10, 2021

JorgenRingen commented Dec 16, 2021 • edited Loading

calohmn commented Jan 12, 2022 • edited Loading

ghmulti commented Jan 25, 2022 • edited Loading

jonatan-ivanov commented Feb 3, 2022

ghmulti commented Feb 9, 2022

jonatan-ivanov commented Feb 16, 2022

ghmulti commented Feb 18, 2022

jonatan-ivanov commented Feb 18, 2022

JorgenRingen commented Feb 21, 2022

jonatan-ivanov commented Feb 22, 2022

izeye commented Mar 12, 2022

JorgenRingen commented Oct 24, 2021 •

edited

Loading

JorgenRingen commented Nov 25, 2021 •

edited

Loading

jonatan-ivanov commented Dec 9, 2021 •

edited

Loading

JorgenRingen commented Dec 16, 2021 •

edited

Loading

calohmn commented Jan 12, 2022 •

edited

Loading

ghmulti commented Jan 25, 2022 •

edited

Loading