Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential memory leak in resolver-dns #2080

Closed
sam-la-compass opened this issue Apr 5, 2022 · 21 comments
Closed

Potential memory leak in resolver-dns #2080

sam-la-compass opened this issue Apr 5, 2022 · 21 comments

Comments

@sam-la-compass
Copy link

sam-la-compass commented Apr 5, 2022

Problem Description
Previously, we had an issue where upgrading from @grpc/grpc-js from 1.3.x to 1.5.x introduced a channelz memory leak (fixed in this issue for 1.5.10)

Upgrading to 1.5.10 locally seems to be fine and I have noticed no issues. However, when we upgraded our staging/production environments, a memory leak seems to come back with the only difference being updating from @grpc/grpc-js 1.3.x to 1.5.10.

Using Datadog's continuous profiler, I wasn't sure if this was the root issue, but there is definitely a growing heap.

Again, we are running a production service with a single grpc-js server that creates multiple grpc-js clients. The clients are created and destroyed using lightning-pool.

Channelz is disabled when we initialize the server/clients with 'grpc.enable_channelz': 0 (for server and clients)

Reproduction Steps
The reproduction steps is still the same as before, except I guess this time the service is under staging/production load?

Create a single grpc-js server that calls grpc-js clients as needed from a pool resource with channelz disabled. In our case, the server is running and when requests are made, we acquire a client via the pool (factory created once as a singleton) to make a request. These should be able to handle concurrent/multiple requests.

Environment

  • OS Name: macOS (locally testing) and running on AWS EKS clusters (production)
  • Node Version: 14.16.0
  • Package Name and Version: @grpc/[email protected]

Additional Context
Checking out the profiler with Heap Live Size, it looks like there is a growing heap size for backoff-timeout.js, resolver-dns.js, load-balancer-child-handler.js, load-balancer-round-robin.js and channel.ts. I let it run for about 2.5 hours and I am comparing the heap profiles from the first 30mins and the last 30 minutes to see what has changed.
When comparing with @grpc/[email protected], these look like they aren't used.

I see that 1.6.x made some updates to some timers, was wondering if it could be related?

Happy to provide more context or help as needed.

NOTE: Clarifying the graph, the start/end time of the problem starts within the highlighted intervals. Everything else is from a different process and rolling the package back.

Screen Shot 2022-04-05 at 3 29 42 PM

(Detail view of the other red section from above)
Screen Shot 2022-04-05 at 3 41 34 PM

@murgatroid99
Copy link
Member

That heap usage graph doesn't really look like it's growing to me. I see that there was more memory used in region B than region A, but it looks like the usage decreased after that so I wouldn't describe it as a leak.

In fact, I would guess region B covers one or more reconnection events. The allocation you highlighted would then be several subchannels getting allocated corresponding to the list of addresses returned by the DNS resolution, and then the dip after the end of that region would be the unused subchannels getting garbage collected after one has a connection established.

With that being said, if you still have reason to believe this is a leak, figuring this out would be a lot easier with the heap dump information like you shared in the previous issue. In that issue, the big list of ServerHttp2Session plus the whole reference chain for one of them helped a lot with narrowing down the cause of the leak. Alternatively, a self-contained test script that demonstrates a steadily growing heap even when garbage collection is forced would allow me to collect that information myself.

@sam-la-compass
Copy link
Author

Thanks for getting back so quickly! The usage decreased afterwards since I rolled the change back after this.

I can try to gather heap dump information like I did in the previous issue, that's proving a bit more difficult with how things are currently set up -- will provide an update soon

@murgatroid99
Copy link
Member

OK, I think I may have misunderstood the profile information. I interpreted it as allocations that occurred during the highlighted time spans, but I guess it's actually all objects on the heap during that time span, with stack traces of what performed the allocation. I see now why it looks like a memory leak, especially knowing that the graph after the end of B does not profile memory of the same process.

The clients are created and destroyed using lightning-pool.

What exactly are you doing to destroy clients? You should be calling close on the clients when you are done with them.

@sam-la-compass
Copy link
Author

Sorry, I updated the original so it's more clear. Yes, the affected process is within the time intervals highlighted, everything else is from a different process.

Yep, we are calling close to destroy the clients as we are done. Specifically (with details commented out)

const clientFactory: PoolFactory<OurClient<IOurClient>> =
{
  create: () => {
    return createOurClient(CLIENT_ADDRESS, grpc.credentials.createInsecure(), {
      'grpc.enable_channelz': 0,
    });
  },
  destroy: (client) => {
    (client as any).client.close();
  },
  reset: () => {},
  validate: () => {},
};

@sam-la-compass
Copy link
Author

@murgatroid99 I set up a test environment on our non-production server (with no traffic) and can confirm growing memory. I tried to get a heap dump to analyze, but due to the infrastructure, it's a bit more limited compared to my other screenshot.
The problem wasn't as obvious since there's no traffic (might have to workaround this or wait for an update if we want traffic) but it looks like there's something going on with TImersList, Timeout or BackoffTimeout (timers related?). Sorry that it's not as detailed as the other one

This one is from our datadog dashboards tracking memory. The purple line is our server running on grpc-js 1.3.8 and the blue line is the server running on grpc-js 1.5.10. Purple line has traffic, but blue line does not. Blue line dies early because I terminated the service before it crashed on itself.
image

This one is my attempt at getting a heap dump (very limited since I had to go through a lot of workarounds to even get it)
Screen Shot 2022-04-12 at 6 06 03 PM

Let me know if you need any more details, I'll try my best but this is what we have to work with/work around

@murgatroid99
Copy link
Member

How many of those TimersList items are in that heap dump? Is there anything else in the heap dump that repeats that many times?

If there is no traffic, what gRPC operations actually happen? Does it make any requests using the client pool?

I propose an experiment: modify your client creation code to add the option 'grpc.keepalive_time_ms': 60_000 and grab a similar heap dump.

@sam-la-compass
Copy link
Author

In the final heapdump, there were about 1500 TimersList and a total of about 3-4k sum (Timeout, TimersList, BackOffTimeout). Other things that stand out are some datadog objects, but that probably is just due to the fact that we are tracing all requests and that'll align with everything else. Another thing is a growing array in ChannelzCallTracker in channelz.js (unsure if this has any impact?). Otherwise, it's not quite as obvious what may be growing with the current dumps I have in their state.

I guess no traffic may be wrong here - it might just be very little traffic. We have some keep alive calls every now and then for pod health and with dd-trace running we have some of those as well. Besides whatever else is going on in the background (unsure what else), our clients are initialized on startup so it may be re-creating and destroying constantly?

I can try adding the option and getting an update.

@murgatroid99
Copy link
Member

A couple of things are interesting here. I wouldn't expect to see a large number of BackoffTimeout objects, and ChannelzCallTracker doesn't own an array and I also wouldn't expect to see a large number of them. Plus, those are defined within grpc-js so their information might be more informative. So, can you share the details of both of those?

@bartslinger
Copy link
Contributor

I am experiencing a similar problem. I'm still investigating, but what I find interesting in resolver-dns.ts is that the flag this.continueResolving is never set back to false (except in destroy).

@murgatroid99
Copy link
Member

@bartslinger Good catch. I don't think that would cause a memory leak, but I'll fix it anyway.

@bartslinger
Copy link
Contributor

Thanks for the update. I'm still unable to reproduce my problem reliably in 1.6.5, but it definitely seems to be gone since 1.6.6.

@bartslinger
Copy link
Contributor

It looks like (in 1.6.5) when this.continueResolving is set to true and never set back to false, the following happens:

startResolutionWithBackoff() calls itself twice, resulting in sort of viral growth.

  1. Through backoff.runOnce()
  2. Through startNextResolutionTimer()

@chrskrchr
Copy link

I'm still unable to reproduce my problem reliably in 1.6.5, but it definitely seems to be gone since 1.6.6.

We saw a similar issue when we pushed the @grpc/[email protected] out to our production environment. DNS resolver failures now caused the request to fail fast as expected per #2092, but we saw a steady increase in DNS resolutions over time that ultimately caused the process to OOM and die:

Screen Shot 2022-04-19 at 2 30 45 PM

We were able to reproduce the behavior locally (see the dnsResolverLeak branch of this test repo) by issuing a bunch of requests to a bogus address and then keeping the test server alive after those requests had completed. We continued to see Looking up DNS hostname ... entries in the logs well after the test requests had all completed. The good news is that we're no longer able to reproduce this behavior in 1.6.6.

@murgatroid99 - would it be difficult to add tests around the change you made in #2098? i.e., to prevent future regressions

@murgatroid99
Copy link
Member

The requested tests have been added in #2105.

@sam-la-compass Can you check if the latest version of grpc-js fixes the original bug for you?

@sam-la-compass
Copy link
Author

@murgatroid99 Sorry for the delay! Got pulled into a few other things. The latest version of grpc-js looks promising (from my testing) and I'm awaiting testing it in our staging/prod environments.

I tested grpc-js 1.5.10 with the keepalive modification which looks like it just delayed the leak (instead of spiking up, it slowly went up).
Will update once services go onto staging/prod with latest grpc-js

@murgatroid99
Copy link
Member

Sorry if I was not clear. The objective of the keepalive change was not to resolve the leak but to get more information by seeing if it affected the details of the heap dump. However, I think it is more likely that the problems discovered since then are more likely to be the cause of the problem.

@sam-la-compass
Copy link
Author

@murgatroid99 Gotcha, makes sense. I think you're right, I think the issues fixed from this are likely the cause of the issues I ran into.
I've deployed grpc-js 1.6.6 and no longer see any mem leaks! It's been running for several hours now and things look normal (whereas before it would have spiked up already)
Thanks so much again!

@sasha7
Copy link

sasha7 commented Nov 13, 2022

I am experiencing similar memory leaks with this library v1.7.2. Using it indirectly as a dependancy of @google-cloud/recaptcha-enterprise:

─┬ @google-cloud/[email protected]
  └─┬ [email protected]
    └── @grpc/[email protected]

memory just keeps on going up and up without releasing it. seeing a lot of failed dns resolutions for ipv6 protocol

@murgatroid99
Copy link
Member

@sasha7 Can you share more details of what you are observing? It would help to include both heap dumps if you have them and the logs that are showing "a lot of failed dns resolutions for ipv6 protocol"?

@sasha7
Copy link

sasha7 commented Nov 15, 2022

@sasha7 Can you share more details of what you are observing? It would help to include both heap dumps if you have them and the logs that are showing "a lot of failed dns resolutions for ipv6 protocol"?

@murgatroid99 Yes, I can share with you some DataDog graphs and live heap in last 24h.
Screenshot 2022-11-15 at 09 11 13

Screenshot 2022-11-15 at 09 01 22

Screenshot 2022-11-15 at 09 09 06

@murgatroid99
Copy link
Member

In the third image, the tooltip for "addTrace (channelz.js)" seems to be covering up the information about the top three contributors to the heap size. Can you say what those top three items are or share another screenshot that shows them? The top one in particular seems to be a very large fraction of the heap.

I think I can partially explain the failed DNS requests: those addresses look like they are supposed to be an IPv6 address plus a port, but the syntax is wrong: an IPv6 address needs to be enclosed in square brackets ([]) to use a port with it. For example, the proper syntax to represent the address in the top log is [2a00:1450:400f:801::200a]:443. You can fix that if you know what the source of those addresses is, but I am not sure why gRPC would still not treat it as an IPv6 address anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants