-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
.NET Core applications get oom killed on Kubernetes/OpenShift #10739
Comments
@tmds thank you for the investigation! I just got back from my vacation and I'll look into fixing it soon. |
@janvorli I am also back from vacation :) If you want, you can assign the issue to me. |
@tmds thank you, I'm gladly accepting your offer :-) |
@janvorli can we backport this to the 2.1? Should I do a PR targeting |
Yes please @tmds ? And if I may be so bold, I'm looking for assistance here: https://stackoverflow.com/questions/51983312/net-core-on-linux-lldb-sos-plugin-diagnosing-memory-issue .. would this be a worthy issue on dotnet/coreclr just yet? |
PR to backport to 2.1: dotnet/coreclr#19650 @kierenj , yes, you can create an issue for that in coreclr repo. |
Excellent, this will be great for me in 2.1. On that issue - in fact I was using 2.0 and memory usage is way, way down on 2.1 so no need there. Thank you! |
I tested the recent PR for 2.1 (19650) in our application and saw a significant memory use reduction. The charts here are from Amazon ECS, and relative to the soft memory limit of 384MB (which is why it can show more than 100%). The hard memory limit for the cgroup is 1024MB. The background memory use has remained stable at around 300MB for the last 12h or so, compared to the unpatched application which uses around 420MB. The difference is more pronounced under load, where in production we are bouncing close to the 2048MB cgroup limit regularly at the moment (we do significant logging and other I/O so roughly half our prod memory use is for page caching). For ages we thought we had a memory leak, but after scratching our heads for some time trying to find one, I finally found this ticket, which seems to fix our issue. 🍾 🎆 Thanks very much for your work! 👍 👍 👍 |
We have been investigating why .net core applications are killed by OpenShift because they exceed their assigned memory.
OpenShift/Kubernetes informs the app via the sysfs
limit_in_bytes
. This is detected by .NET Core:https://github.com/dotnet/coreclr/blob/08d39ddf02c81c99bd49c19b808c855235cbabdc/src/pal/src/misc/cgroup.cpp#L25
Then memory is monitored by the oom killer based on sysfs
usage_in_bytes
..NET Core is using
statm
for this:https://github.com/dotnet/coreclr/blob/08d39ddf02c81c99bd49c19b808c855235cbabdc/src/pal/src/misc/cgroup.cpp#L24
usage_in_bytes
includes RSS and CACHE, whilestatm
is only RSS.So memory in cache is a reason to get oom killed, but .NET Core doesn't use it to detect when to do a GC.
We should change the implementation so it also is aware of
usage_in_bytes
to measure the memory load of the system.CC @janvorli
The text was updated successfully, but these errors were encountered: