-
Notifications
You must be signed in to change notification settings - Fork 651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
grpc-js 1.4.1 - Channelz tracing uses unbounded memory which causes OOM #1941
Comments
It's not supposed to do traces on a per-request basis. The traces are all about connection state, which shouldn't change more quickly when there are more requests. Because of that, if anything, caching the client should decrease the number of traces that occur, so I wouldn't recommend changing that. It might help to see what information is actually getting stored in channelz in your case. You can run a channelz server with code like this: const server = new grpc.Server();
grpc.addAdminServicesToServer(server);
// Modify the first argument to suit your environment
server.bindAsync('localhost:0', ServerCredentials.createInsecure(), (error, port) => {
console.log(`Serving channelz on port ${port}`);
server.start();
}); Then you can access it using the In the meantime, I will push out an update that rotates out old channelz trace events on a per-channel/subchannel/server basis. |
Thank you for the swift and helpful response! We have had issues with pushing too many concurrent requests down the same channel which seems to (at least implicitly) affect channel status, and we force round-robin on as well - I don't suppose that may be adding to the volume of trace events? I will find some time to inspect the channelz logs. |
I would be interested to see a trace log for one of your tests, which can be output by setting the environment variables |
This took a little longer than I expected, but grpc-js version 1.4.2 is out with two relevant changes: Each channel, subchannel, and server will now store only a limited number of trace events at any one time, and channelz stats tracking can be disabled entirely for a channel and its subchannels, or a server, with the channel option |
If that change does solve the problem, it would still help to see a log to understand what caused the problem in the first place. |
Posting this here as a potential FYI as it may be a bug with Google Ads API (or us messing something up) but setting the Not setting the
|
No, now that you pointed it out, I found a bug with the handling of that option in grpc-js. I will fix that. In the meantime, without that option, do you still see the unbounded memory growth? |
We are still seeing significant additional memory usage over the baseline (so far about 2.5x memory usage - 800MiB to 2GiB), but we do now utilise a static pool of grpc clients so there may be many multiples of channels and subchannels logging. I'm just waiting to see if it tops out or keeps going in prod but we may hit our memory limit still so bear with me on that. I'll try and get a sample of some logs in the meantime. |
The bug with that option is fixed in version 1.4.3. |
Hi, we're in the same boat but we see also an increase in memory with 1.4.3 (also node 16). Our case is an express server that serves graphql and for each graphql will call an internal grpc service. when we used I tried to specify @murgatroid99 logs from what would be helpful for you? |
@murgatroid99 I am casting around in the dark so I really apologise if this is way off and a distraction but is this line the culprit? grpc-node/packages/grpc-js/src/subchannel.ts Line 558 in cb29b6a
There's a guarded invocation in the same function that assigns the output to a ref, but then this unguarded invocation that does nothing with the output at the end of the function. I'm wondering if the unguarded version should be there at all?
I don't seem to get a memory leak if I comment out the unguarded line when |
You're right, that line shouldn't be there at all, which explains why I missed it when adding the guards, and the output never gets recorded, so it never gets unregistered. I have published a new version 1.4.4 that removes it. @Gilwe A lot of Even with the extra line mentioned in the last comment, setting the channel option to disable channelz should have had a significant impact on the rate of the memory leak, if you set the option If you still have problems with the new version, as stated in my earlier comment, the logs that would help here are client trace logs, which you can get by running your code with the environment variables |
1.4.4 has completely fixed the memory leak for me with channelz tracing disabled, thank you! Not sure if the above was to me or @Gilwe but I had been setting In terms of the actual channelz tracing itself, in our case it looks like it might be frequent TRANSIENT_FAILURE state transitions that fill up the logs. This is a set of entries from a completely idle application I left running by accident on my local machine just now with no traffic going through it:
Is that expected behaviour? From my side, given that we no longer experience a memory leak and we have the option of disabling channelz tracing, I consider my issue completely resolved - but I'm more than happy to continue to contribute logs or testing if there is more that can be done. Thank you again for your help @murgatroid99 |
It looks like the server is telling the client to go away after the connection is idle for 75 seconds. The client behavior when handling that is normal, specifically for the However, the total channelz memory usage should be bounded for the duration of that log. There's no mention of any channel construction, so the number of channels is constant. I see 4 instances of subchannels getting constructed, and 4 instances of them getting removed (when the refcount goes to 0), so the total number of subchannels is approximately constant. And each subchannel should register a socket with channelz when it transitions to CONNECTING, and it should unregister it when it transitions to TRANSIENT_FAILURE or IDLE, so the number of registered sockets should be constant as well. And with the changes in 1.4.3, the number of log statements retained for each is strictly limited. Have you checked whether you still see the memory leak with channelz enabled as of 1.4.4, and if you do, can you share a log with channelz enabled of a timespan in which memory significantly increases? |
Problem description
Up front: I wouldn't necessarily call this a bug but it has caused an issue in our code.
We use a single grpc-js client to service calls to the Google Ads API. This is instantiated when the server is initialized and used throughout the life of the program to try and reduce memory leakage. This works quite well, memory use is relatively stable, but when grpc-js is upgraded to 1.4.1 (from 1.3.7), memory usage starts to increase linearly even though our implementation is the same. This seems to be caused by Date objects (and others) being created when events are appended to the ChannelzTrace object associated with our channel.
This may otherwise be a small amount of memory but we service a very large amount of traffic through this server and it eventually causes an OOM event.
Should we cease caching our grpc client? What would be the recommended lifecycle of a client in that case? Or is there a way to opt out of tracing?
Reproduction steps
Create a single grpc Client object and fire traffic to it at a rate of 5 requests per second over 10 minutes and observe memory usage climbing without reaching a cap.
Environment
Additional context
I'm not sure what would be most helpful but I am happy to provide anything required.
The text was updated successfully, but these errors were encountered: