Potential memory leak in resolver-dns #2080

sam-la-compass · 2022-04-05T20:32:41Z

Problem Description
Previously, we had an issue where upgrading from @grpc/grpc-js from 1.3.x to 1.5.x introduced a channelz memory leak (fixed in this issue for 1.5.10)

Upgrading to 1.5.10 locally seems to be fine and I have noticed no issues. However, when we upgraded our staging/production environments, a memory leak seems to come back with the only difference being updating from @grpc/grpc-js 1.3.x to 1.5.10.

Using Datadog's continuous profiler, I wasn't sure if this was the root issue, but there is definitely a growing heap.

Again, we are running a production service with a single grpc-js server that creates multiple grpc-js clients. The clients are created and destroyed using lightning-pool.

Channelz is disabled when we initialize the server/clients with 'grpc.enable_channelz': 0 (for server and clients)

Reproduction Steps
The reproduction steps is still the same as before, except I guess this time the service is under staging/production load?

Create a single grpc-js server that calls grpc-js clients as needed from a pool resource with channelz disabled. In our case, the server is running and when requests are made, we acquire a client via the pool (factory created once as a singleton) to make a request. These should be able to handle concurrent/multiple requests.

Environment

OS Name: macOS (locally testing) and running on AWS EKS clusters (production)
Node Version: 14.16.0
Package Name and Version: @grpc/[email protected]

Additional Context
Checking out the profiler with Heap Live Size, it looks like there is a growing heap size for backoff-timeout.js, resolver-dns.js, load-balancer-child-handler.js, load-balancer-round-robin.js and channel.ts. I let it run for about 2.5 hours and I am comparing the heap profiles from the first 30mins and the last 30 minutes to see what has changed.
When comparing with @grpc/[email protected], these look like they aren't used.

I see that 1.6.x made some updates to some timers, was wondering if it could be related?

Happy to provide more context or help as needed.

NOTE: Clarifying the graph, the start/end time of the problem starts within the highlighted intervals. Everything else is from a different process and rolling the package back.

(Detail view of the other red section from above)

The text was updated successfully, but these errors were encountered:

murgatroid99 · 2022-04-05T21:01:50Z

That heap usage graph doesn't really look like it's growing to me. I see that there was more memory used in region B than region A, but it looks like the usage decreased after that so I wouldn't describe it as a leak.

In fact, I would guess region B covers one or more reconnection events. The allocation you highlighted would then be several subchannels getting allocated corresponding to the list of addresses returned by the DNS resolution, and then the dip after the end of that region would be the unused subchannels getting garbage collected after one has a connection established.

With that being said, if you still have reason to believe this is a leak, figuring this out would be a lot easier with the heap dump information like you shared in the previous issue. In that issue, the big list of ServerHttp2Session plus the whole reference chain for one of them helped a lot with narrowing down the cause of the leak. Alternatively, a self-contained test script that demonstrates a steadily growing heap even when garbage collection is forced would allow me to collect that information myself.

sam-la-compass · 2022-04-05T21:15:00Z

Thanks for getting back so quickly! The usage decreased afterwards since I rolled the change back after this.

I can try to gather heap dump information like I did in the previous issue, that's proving a bit more difficult with how things are currently set up -- will provide an update soon

murgatroid99 · 2022-04-05T21:27:15Z

OK, I think I may have misunderstood the profile information. I interpreted it as allocations that occurred during the highlighted time spans, but I guess it's actually all objects on the heap during that time span, with stack traces of what performed the allocation. I see now why it looks like a memory leak, especially knowing that the graph after the end of B does not profile memory of the same process.

The clients are created and destroyed using lightning-pool.

What exactly are you doing to destroy clients? You should be calling close on the clients when you are done with them.

sam-la-compass · 2022-04-05T21:32:46Z

Sorry, I updated the original so it's more clear. Yes, the affected process is within the time intervals highlighted, everything else is from a different process.

Yep, we are calling close to destroy the clients as we are done. Specifically (with details commented out)

const clientFactory: PoolFactory<OurClient<IOurClient>> =
{
  create: () => {
    return createOurClient(CLIENT_ADDRESS, grpc.credentials.createInsecure(), {
      'grpc.enable_channelz': 0,
    });
  },
  destroy: (client) => {
    (client as any).client.close();
  },
  reset: () => {},
  validate: () => {},
};

sam-la-compass · 2022-04-12T22:12:46Z

@murgatroid99 I set up a test environment on our non-production server (with no traffic) and can confirm growing memory. I tried to get a heap dump to analyze, but due to the infrastructure, it's a bit more limited compared to my other screenshot.
The problem wasn't as obvious since there's no traffic (might have to workaround this or wait for an update if we want traffic) but it looks like there's something going on with TImersList, Timeout or BackoffTimeout (timers related?). Sorry that it's not as detailed as the other one

This one is from our datadog dashboards tracking memory. The purple line is our server running on grpc-js 1.3.8 and the blue line is the server running on grpc-js 1.5.10. Purple line has traffic, but blue line does not. Blue line dies early because I terminated the service before it crashed on itself.

This one is my attempt at getting a heap dump (very limited since I had to go through a lot of workarounds to even get it)

Let me know if you need any more details, I'll try my best but this is what we have to work with/work around

murgatroid99 · 2022-04-13T01:25:15Z

How many of those TimersList items are in that heap dump? Is there anything else in the heap dump that repeats that many times?

If there is no traffic, what gRPC operations actually happen? Does it make any requests using the client pool?

I propose an experiment: modify your client creation code to add the option 'grpc.keepalive_time_ms': 60_000 and grab a similar heap dump.

sam-la-compass · 2022-04-14T17:50:19Z

In the final heapdump, there were about 1500 TimersList and a total of about 3-4k sum (Timeout, TimersList, BackOffTimeout). Other things that stand out are some datadog objects, but that probably is just due to the fact that we are tracing all requests and that'll align with everything else. Another thing is a growing array in ChannelzCallTracker in channelz.js (unsure if this has any impact?). Otherwise, it's not quite as obvious what may be growing with the current dumps I have in their state.

I guess no traffic may be wrong here - it might just be very little traffic. We have some keep alive calls every now and then for pod health and with dd-trace running we have some of those as well. Besides whatever else is going on in the background (unsure what else), our clients are initialized on startup so it may be re-creating and destroying constantly?

I can try adding the option and getting an update.

murgatroid99 · 2022-04-14T19:35:54Z

A couple of things are interesting here. I wouldn't expect to see a large number of BackoffTimeout objects, and ChannelzCallTracker doesn't own an array and I also wouldn't expect to see a large number of them. Plus, those are defined within grpc-js so their information might be more informative. So, can you share the details of both of those?

bartslinger · 2022-04-16T22:24:38Z

I am experiencing a similar problem. I'm still investigating, but what I find interesting in resolver-dns.ts is that the flag this.continueResolving is never set back to false (except in destroy).

murgatroid99 · 2022-04-18T16:55:20Z

@bartslinger Good catch. I don't think that would cause a memory leak, but I'll fix it anyway.

bartslinger · 2022-04-18T20:04:16Z

Thanks for the update. I'm still unable to reproduce my problem reliably in 1.6.5, but it definitely seems to be gone since 1.6.6.

bartslinger · 2022-04-19T06:19:59Z

It looks like (in 1.6.5) when this.continueResolving is set to true and never set back to false, the following happens:

startResolutionWithBackoff() calls itself twice, resulting in sort of viral growth.

Through backoff.runOnce()
Through startNextResolutionTimer()

chrskrchr · 2022-04-19T19:37:26Z

I'm still unable to reproduce my problem reliably in 1.6.5, but it definitely seems to be gone since 1.6.6.

We saw a similar issue when we pushed the @grpc/[email protected] out to our production environment. DNS resolver failures now caused the request to fail fast as expected per #2092, but we saw a steady increase in DNS resolutions over time that ultimately caused the process to OOM and die:

We were able to reproduce the behavior locally (see the dnsResolverLeak branch of this test repo) by issuing a bunch of requests to a bogus address and then keeping the test server alive after those requests had completed. We continued to see Looking up DNS hostname ... entries in the logs well after the test requests had all completed. The good news is that we're no longer able to reproduce this behavior in 1.6.6.

@murgatroid99 - would it be difficult to add tests around the change you made in #2098? i.e., to prevent future regressions

murgatroid99 · 2022-04-22T21:36:41Z

The requested tests have been added in #2105.

@sam-la-compass Can you check if the latest version of grpc-js fixes the original bug for you?

sam-la-compass · 2022-04-25T17:26:22Z

@murgatroid99 Sorry for the delay! Got pulled into a few other things. The latest version of grpc-js looks promising (from my testing) and I'm awaiting testing it in our staging/prod environments.

I tested grpc-js 1.5.10 with the keepalive modification which looks like it just delayed the leak (instead of spiking up, it slowly went up).
Will update once services go onto staging/prod with latest grpc-js

murgatroid99 · 2022-04-25T17:37:13Z

Sorry if I was not clear. The objective of the keepalive change was not to resolve the leak but to get more information by seeing if it affected the details of the heap dump. However, I think it is more likely that the problems discovered since then are more likely to be the cause of the problem.

sam-la-compass · 2022-04-27T22:07:23Z

@murgatroid99 Gotcha, makes sense. I think you're right, I think the issues fixed from this are likely the cause of the issues I ran into.
I've deployed grpc-js 1.6.6 and no longer see any mem leaks! It's been running for several hours now and things look normal (whereas before it would have spiked up already)
Thanks so much again!

sasha7 · 2022-11-13T08:35:31Z

I am experiencing similar memory leaks with this library v1.7.2. Using it indirectly as a dependancy of @google-cloud/recaptcha-enterprise:

─┬ @google-cloud/[email protected]
  └─┬ [email protected]
    └── @grpc/[email protected]

memory just keeps on going up and up without releasing it. seeing a lot of failed dns resolutions for ipv6 protocol

murgatroid99 · 2022-11-14T17:24:25Z

@sasha7 Can you share more details of what you are observing? It would help to include both heap dumps if you have them and the logs that are showing "a lot of failed dns resolutions for ipv6 protocol"?

sasha7 · 2022-11-15T08:22:01Z

@sasha7 Can you share more details of what you are observing? It would help to include both heap dumps if you have them and the logs that are showing "a lot of failed dns resolutions for ipv6 protocol"?

@murgatroid99 Yes, I can share with you some DataDog graphs and live heap in last 24h.

murgatroid99 · 2022-11-15T18:13:45Z

In the third image, the tooltip for "addTrace (channelz.js)" seems to be covering up the information about the top three contributors to the heap size. Can you say what those top three items are or share another screenshot that shows them? The top one in particular seems to be a very large fraction of the heap.

I think I can partially explain the failed DNS requests: those addresses look like they are supposed to be an IPv6 address plus a port, but the syntax is wrong: an IPv6 address needs to be enclosed in square brackets ([]) to use a port with it. For example, the proper syntax to represent the address in the top log is [2a00:1450:400f:801::200a]:443. You can fix that if you know what the source of those addresses is, but I am not sure why gRPC would still not treat it as an IPv6 address anyway.

murgatroid99 mentioned this issue Apr 18, 2022

grpc-js: DNS: unset continueResolving when starting a resolution attempt #2098

Merged

murgatroid99 mentioned this issue Apr 19, 2022

grpc-js: Fix double resolver calls in DNS resolver #2100

Merged

murgatroid99 mentioned this issue Apr 22, 2022

grpc-js: Add regression tests for repeated DNS requests #2105

Merged

sam-la-compass closed this as completed May 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential memory leak in resolver-dns #2080

Potential memory leak in resolver-dns #2080

sam-la-compass commented Apr 5, 2022 •

edited

Loading

murgatroid99 commented Apr 5, 2022

sam-la-compass commented Apr 5, 2022

murgatroid99 commented Apr 5, 2022

sam-la-compass commented Apr 5, 2022

sam-la-compass commented Apr 12, 2022

murgatroid99 commented Apr 13, 2022

sam-la-compass commented Apr 14, 2022

murgatroid99 commented Apr 14, 2022

bartslinger commented Apr 16, 2022

murgatroid99 commented Apr 18, 2022

bartslinger commented Apr 18, 2022

bartslinger commented Apr 19, 2022

chrskrchr commented Apr 19, 2022

murgatroid99 commented Apr 22, 2022

sam-la-compass commented Apr 25, 2022

murgatroid99 commented Apr 25, 2022

sam-la-compass commented Apr 27, 2022

sasha7 commented Nov 13, 2022 •

edited

Loading

murgatroid99 commented Nov 14, 2022

sasha7 commented Nov 15, 2022

murgatroid99 commented Nov 15, 2022

Potential memory leak in resolver-dns #2080

Potential memory leak in resolver-dns #2080

Comments

sam-la-compass commented Apr 5, 2022 • edited Loading

murgatroid99 commented Apr 5, 2022

sam-la-compass commented Apr 5, 2022

murgatroid99 commented Apr 5, 2022

sam-la-compass commented Apr 5, 2022

sam-la-compass commented Apr 12, 2022

murgatroid99 commented Apr 13, 2022

sam-la-compass commented Apr 14, 2022

murgatroid99 commented Apr 14, 2022

bartslinger commented Apr 16, 2022

murgatroid99 commented Apr 18, 2022

bartslinger commented Apr 18, 2022

bartslinger commented Apr 19, 2022

chrskrchr commented Apr 19, 2022

murgatroid99 commented Apr 22, 2022

sam-la-compass commented Apr 25, 2022

murgatroid99 commented Apr 25, 2022

sam-la-compass commented Apr 27, 2022

sasha7 commented Nov 13, 2022 • edited Loading

murgatroid99 commented Nov 14, 2022

sasha7 commented Nov 15, 2022

murgatroid99 commented Nov 15, 2022

sam-la-compass commented Apr 5, 2022 •

edited

Loading

sasha7 commented Nov 13, 2022 •

edited

Loading