.NET Core applications get oom killed on Kubernetes/OpenShift #10739

tmds · 2018-07-20T13:52:48Z

We have been investigating why .net core applications are killed by OpenShift because they exceed their assigned memory.

OpenShift/Kubernetes informs the app via the sysfs limit_in_bytes. This is detected by .NET Core:

https://github.com/dotnet/coreclr/blob/08d39ddf02c81c99bd49c19b808c855235cbabdc/src/pal/src/misc/cgroup.cpp#L25

Then memory is monitored by the oom killer based on sysfs usage_in_bytes.
.NET Core is using statm for this:

https://github.com/dotnet/coreclr/blob/08d39ddf02c81c99bd49c19b808c855235cbabdc/src/pal/src/misc/cgroup.cpp#L24

usage_in_bytes includes RSS and CACHE, while statm is only RSS.
So memory in cache is a reason to get oom killed, but .NET Core doesn't use it to detect when to do a GC.

We should change the implementation so it also is aware of usage_in_bytes to measure the memory load of the system.

CC @janvorli

The text was updated successfully, but these errors were encountered:

jkotas · 2018-07-20T15:07:05Z

Related to https://github.com/dotnet/coreclr/issues/18971.

cc @MichaelSimons @richlander

janvorli · 2018-08-13T12:51:21Z

@tmds thank you for the investigation! I just got back from my vacation and I'll look into fixing it soon.

tmds · 2018-08-14T11:28:47Z

@janvorli I am also back from vacation :) If you want, you can assign the issue to me.

janvorli · 2018-08-14T13:02:00Z

@tmds thank you, I'm gladly accepting your offer :-)

tmds · 2018-08-17T13:42:05Z

@janvorli can we backport this to the 2.1? Should I do a PR targeting dotnet:release/2.1 branch?

kierenj · 2018-08-24T06:23:38Z

Yes please @tmds ? And if I may be so bold, I'm looking for assistance here: https://stackoverflow.com/questions/51983312/net-core-on-linux-lldb-sos-plugin-diagnosing-memory-issue .. would this be a worthy issue on dotnet/coreclr just yet?

tmds · 2018-08-24T07:27:33Z

PR to backport to 2.1: dotnet/coreclr#19650

@kierenj , yes, you can create an issue for that in coreclr repo.

kierenj · 2018-08-24T08:36:19Z

Excellent, this will be great for me in 2.1. On that issue - in fact I was using 2.0 and memory usage is way, way down on 2.1 so no need there. Thank you!

chrisgilbert · 2018-08-31T10:35:48Z

I tested the recent PR for 2.1 (19650) in our application and saw a significant memory use reduction. The charts here are from Amazon ECS, and relative to the soft memory limit of 384MB (which is why it can show more than 100%). The hard memory limit for the cgroup is 1024MB.

The background memory use has remained stable at around 300MB for the last 12h or so, compared to the unpatched application which uses around 420MB.

The difference is more pronounced under load, where in production we are bouncing close to the 2048MB cgroup limit regularly at the moment (we do significant logging and other I/O so roughly half our prod memory use is for page caching).

For ages we thought we had a memory leak, but after scratching our heads for some time trying to find one, I finally found this ticket, which seems to fix our issue. 🍾 🎆

Thanks very much for your work! 👍 👍 👍

janvorli self-assigned this Aug 13, 2018

janvorli assigned tmds Aug 14, 2018

janvorli closed this as completed in dotnet/coreclr#19518 Aug 16, 2018

msftgits transferred this issue from dotnet/coreclr Jan 31, 2020

ghost locked as resolved and limited conversation to collaborators Dec 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.NET Core applications get oom killed on Kubernetes/OpenShift #10739

.NET Core applications get oom killed on Kubernetes/OpenShift #10739

tmds commented Jul 20, 2018

jkotas commented Jul 20, 2018

janvorli commented Aug 13, 2018

tmds commented Aug 14, 2018

janvorli commented Aug 14, 2018

tmds commented Aug 17, 2018

kierenj commented Aug 24, 2018 •

edited

Loading

tmds commented Aug 24, 2018

kierenj commented Aug 24, 2018

chrisgilbert commented Aug 31, 2018 •

edited

Loading

.NET Core applications get oom killed on Kubernetes/OpenShift #10739

.NET Core applications get oom killed on Kubernetes/OpenShift #10739

Comments

tmds commented Jul 20, 2018

jkotas commented Jul 20, 2018

janvorli commented Aug 13, 2018

tmds commented Aug 14, 2018

janvorli commented Aug 14, 2018

tmds commented Aug 17, 2018

kierenj commented Aug 24, 2018 • edited Loading

tmds commented Aug 24, 2018

kierenj commented Aug 24, 2018

chrisgilbert commented Aug 31, 2018 • edited Loading

kierenj commented Aug 24, 2018 •

edited

Loading

chrisgilbert commented Aug 31, 2018 •

edited

Loading