-
Notifications
You must be signed in to change notification settings - Fork 30.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
After upgrade from v8 to v12.16.1 heap usage becomes erratic #32737
Comments
I infer that you enforce some kind of hard memory limit? Does it keep running when you start node with The garbage collector can grow the JS heap at will (time/space trace-off) unless you clamp it. There is a hard-coded limit but it's way beyond 500 MB. |
We are setting up a load-test environment to see if we can reproduce this and try out solutions. |
V8 might have changed how their garbage collector works. Between Node.js v8 and v12, V8 had 14 major versions, with lots of changes. At least one change in GC is that now it runs on a separate thread. If you want to see how GC behavior changed over time you can use --trace-gc (not in production though), it will output a summary every time GC runs. |
I used E.g. this is at 1084810 ms:
then the next one is at 1191919 ms (a whole lot Scavenge in between):
So this took 107109 ms between sweeps. If I look around the same time with the
And the next sweep:
Just 4796ms in between. Thanks, this reinforces that we should definitely use |
I'll close this out but I can move it to nodejs/help if you have follow-up questions. Cheers. |
Sure, thanks for your help. |
Version:
v12.16.1 (but v12.13 on production)
Platform:
MacOS Catalina 10.15.4 / Darwin 19.4.0 Darwin Kernel Version 19.4.0: Wed Mar 4 22:28:40 PST 2020; root:xnu-6153.101.6~15/RELEASE_X86_64 x86_64
(but docker with node:12.13.0-alpine on production)
Subsystem:
? runtime, heap, garbage collection
Description:
We recently upgrade our production servers with docker containers with node v8 to docker containers with node v12.10 (node:12.13.0-alpine). At first all seems fine, but then we started noticing pod restarts by Kubernetes being OOM Killed. Since the upgrade, memory usage seems to increase over time sometimes in steep inclines until reaching ~500MB at which time they are killed by Kuberenetes. For example:
In the image above you can see some 'staircase' increases in pod memory usage, these are multiple instances of the same service. It turns out that also our other services start to increase memory usage, though this depends on the number requests that they serve.
What steps will reproduce the bug?
To find out what is going on I did the following:
v12.16.1:
v13.12.0:
Observations:
In v12.16.1 heapUsed it is quite stable but suddenly it starts spiking even though the load remains the same.
In v13.12.0 I do not see a similar behaviour and the heapUsed seems to remain quite stable and well below the 500MB mark.
I hope this (it showing in one nodejs version but not the other) shows that it is not a leak/erratic behaviour in our code.
How often does it reproduce? Is there a required condition?
On production we see it on all service upgraded to v12 and after each restart the pattern repeats, and I can repro the graphs using the methods described above.
What is the expected behavior?
A stable heapUsed like in v13 and no spikes in memory usage causing OOM kills
What do you see instead?
On production we see staircase behaviour, on the test we see huge spikes and erratic heap used.
Additional information
I seek guidance as to where to go from here. Since v12 is a LTS version, the behaviour above could affect more people and v13 should not be used in production.
Our service is quite complicated at the moment (grpc + redis + dynamodb +zipkin + metrics for prometheus), but I could try to remove each one by one till I have something small to reproduce.
On the other hand, there might be someone that already recognises this pattern and first wants to try a workaround/patch or a heap dump, etc.
Please let me know how I can help further,
Kind regards,
Wilfred
The text was updated successfully, but these errors were encountered: