-
Notifications
You must be signed in to change notification settings - Fork 652
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Potential memory leak in resolver-dns #2080
Comments
That heap usage graph doesn't really look like it's growing to me. I see that there was more memory used in region B than region A, but it looks like the usage decreased after that so I wouldn't describe it as a leak. In fact, I would guess region B covers one or more reconnection events. The allocation you highlighted would then be several subchannels getting allocated corresponding to the list of addresses returned by the DNS resolution, and then the dip after the end of that region would be the unused subchannels getting garbage collected after one has a connection established. With that being said, if you still have reason to believe this is a leak, figuring this out would be a lot easier with the heap dump information like you shared in the previous issue. In that issue, the big list of |
Thanks for getting back so quickly! The usage decreased afterwards since I rolled the change back after this. I can try to gather heap dump information like I did in the previous issue, that's proving a bit more difficult with how things are currently set up -- will provide an update soon |
OK, I think I may have misunderstood the profile information. I interpreted it as allocations that occurred during the highlighted time spans, but I guess it's actually all objects on the heap during that time span, with stack traces of what performed the allocation. I see now why it looks like a memory leak, especially knowing that the graph after the end of B does not profile memory of the same process.
What exactly are you doing to destroy clients? You should be calling |
Sorry, I updated the original so it's more clear. Yes, the affected process is within the time intervals highlighted, everything else is from a different process. Yep, we are calling close to destroy the clients as we are done. Specifically (with details commented out)
|
@murgatroid99 I set up a test environment on our non-production server (with no traffic) and can confirm growing memory. I tried to get a heap dump to analyze, but due to the infrastructure, it's a bit more limited compared to my other screenshot. This one is from our datadog dashboards tracking memory. The purple line is our server running on grpc-js 1.3.8 and the blue line is the server running on grpc-js 1.5.10. Purple line has traffic, but blue line does not. Blue line dies early because I terminated the service before it crashed on itself. This one is my attempt at getting a heap dump (very limited since I had to go through a lot of workarounds to even get it) Let me know if you need any more details, I'll try my best but this is what we have to work with/work around |
How many of those If there is no traffic, what gRPC operations actually happen? Does it make any requests using the client pool? I propose an experiment: modify your client creation code to add the option |
In the final heapdump, there were about 1500 I guess no traffic may be wrong here - it might just be very little traffic. We have some keep alive calls every now and then for pod health and with dd-trace running we have some of those as well. Besides whatever else is going on in the background (unsure what else), our clients are initialized on startup so it may be re-creating and destroying constantly? I can try adding the option and getting an update. |
A couple of things are interesting here. I wouldn't expect to see a large number of |
I am experiencing a similar problem. I'm still investigating, but what I find interesting in resolver-dns.ts is that the flag |
@bartslinger Good catch. I don't think that would cause a memory leak, but I'll fix it anyway. |
Thanks for the update. I'm still unable to reproduce my problem reliably in 1.6.5, but it definitely seems to be gone since 1.6.6. |
It looks like (in 1.6.5) when
|
We saw a similar issue when we pushed the We were able to reproduce the behavior locally (see the @murgatroid99 - would it be difficult to add tests around the change you made in #2098? i.e., to prevent future regressions |
The requested tests have been added in #2105. @sam-la-compass Can you check if the latest version of grpc-js fixes the original bug for you? |
@murgatroid99 Sorry for the delay! Got pulled into a few other things. The latest version of grpc-js looks promising (from my testing) and I'm awaiting testing it in our staging/prod environments. I tested grpc-js 1.5.10 with the keepalive modification which looks like it just delayed the leak (instead of spiking up, it slowly went up). |
Sorry if I was not clear. The objective of the keepalive change was not to resolve the leak but to get more information by seeing if it affected the details of the heap dump. However, I think it is more likely that the problems discovered since then are more likely to be the cause of the problem. |
@murgatroid99 Gotcha, makes sense. I think you're right, I think the issues fixed from this are likely the cause of the issues I ran into. |
I am experiencing similar memory leaks with this library v1.7.2. Using it indirectly as a dependancy of
memory just keeps on going up and up without releasing it. seeing a lot of failed dns resolutions for ipv6 protocol |
@sasha7 Can you share more details of what you are observing? It would help to include both heap dumps if you have them and the logs that are showing "a lot of failed dns resolutions for ipv6 protocol"? |
@murgatroid99 Yes, I can share with you some DataDog graphs and live heap in last 24h. |
In the third image, the tooltip for "addTrace (channelz.js)" seems to be covering up the information about the top three contributors to the heap size. Can you say what those top three items are or share another screenshot that shows them? The top one in particular seems to be a very large fraction of the heap. I think I can partially explain the failed DNS requests: those addresses look like they are supposed to be an IPv6 address plus a port, but the syntax is wrong: an IPv6 address needs to be enclosed in square brackets ( |
Problem Description
Previously, we had an issue where upgrading from @grpc/grpc-js from 1.3.x to 1.5.x introduced a channelz memory leak (fixed in this issue for 1.5.10)
Upgrading to 1.5.10 locally seems to be fine and I have noticed no issues. However, when we upgraded our staging/production environments, a memory leak seems to come back with the only difference being updating from @grpc/grpc-js 1.3.x to 1.5.10.
Using Datadog's continuous profiler, I wasn't sure if this was the root issue, but there is definitely a growing heap.
Again, we are running a production service with a single grpc-js server that creates multiple grpc-js clients. The clients are created and destroyed using lightning-pool.
Channelz is disabled when we initialize the server/clients with
'grpc.enable_channelz': 0
(for server and clients)Reproduction Steps
The reproduction steps is still the same as before, except I guess this time the service is under staging/production load?
Create a single grpc-js server that calls grpc-js clients as needed from a pool resource with channelz disabled. In our case, the server is running and when requests are made, we acquire a client via the pool (factory created once as a singleton) to make a request. These should be able to handle concurrent/multiple requests.
Environment
Additional Context
Checking out the profiler with
Heap Live Size
, it looks like there is a growing heap size forbackoff-timeout.js
,resolver-dns.js
,load-balancer-child-handler.js
,load-balancer-round-robin.js
andchannel.ts
. I let it run for about 2.5 hours and I am comparing the heap profiles from the first 30mins and the last 30 minutes to see what has changed.When comparing with @grpc/[email protected], these look like they aren't used.
I see that 1.6.x made some updates to some timers, was wondering if it could be related?
Happy to provide more context or help as needed.
NOTE: Clarifying the graph, the start/end time of the problem starts within the highlighted intervals. Everything else is from a different process and rolling the package back.
(Detail view of the other red section from above)
The text was updated successfully, but these errors were encountered: