-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Process keeps getting killed for violating docker cgroup mem constraint with server GC #852
Comments
A similar, but very simple, case would be to run the following code in docker with a 11 MB memory constraint:
It also gets killed because the cgroup runs out of memory in GC Server mode:
|
The log below show some real life dotnet docker container (with different memory limits) running in Docker swarm being killed because of cgroup viloation. I suspect that something like cache is a part of the cgroup limit, but that dotnet does not account for any cache when calculating the target heap size for server GC.
|
what happens if you set COMPlus_gcConcurrent=0? |
@Maoni0 COMPlus_gcConcurrent=0 looks similar to not setting gcServer to 1. It looks like I have to disable swapping (which we have done on our production servers), otherwise the extra pages for the cgroup will just be swapped out. I guess the difference is that gcServer waits almost until the cgroup is maxed out before is startes GC, and in some cases it is too late so that it violates the cgroup memory limit. As far as I know there is no way to set a target heap size without using the cgroup (which will actually kill the process..). |
sorry, I meant if you set COMPlus_gcConcurrent=0 and COMPlus_gcServer=1. do you see any difference from just setting COMPlus_gcServer=1? |
@Maoni0 It crashes just the same way (120 MiB limit shown below, but I also got it to crash with 158 MiB):
And then the log says:
|
do you know if you did any GCs at all or it gets killed when it's just at start up stage? if you have gcConcurrent enabled, since server GC on 64-bit has such large segment sizes, even the GC's own datastructure would take 10's of MB. but without concurrent it should be able to run (unless pinning is prevalent and we cannot retract the heap in time). it would make debugging a lot more productive if you could collect a trace: https://github.com/dotnet/coreclr/blob/master/Documentation/project-docs/linux-performance-tracing.md. without any perf data it's hard for me to say what's happening. |
Hi @Maoni0. The GC runs for all three generations. I was not able to install the profiler due to a missing dependency. However, by modifying the program below to:
The last output from each of the runs are:
And
This suggests that there is no problem of running the GC, but it waits too long before it is started. I also see from the log that no GC is ran after iteration 1000 (actually iteration ~700 was the last GC), which is the iteration where it starts to free old buffers. Note that I have to disable swapping (swapoff -a) to trigger the OOM. Are you able to reproduce? |
@Maoni0 A more to-the-point repro code is shown below:
When the code looks like:
|
This issue is now very old. Do people still see this behavior with 3.x? We know that memory limits were a problem with 1.x and 2.x. |
We are facing some similar problems at the moment. I`m not a dotnet dev, but i am quite sure we switched to version 3.x and are still facing these issues. Any updates or hints on this? |
We face the same issue. Switching from .NET core 2.2 to 3.1 made it even worse for our application. |
I'm sorry I didn't follow up on this... this would call for some investigation on our side and I will get back to you next week how soon we can start doing the investigation. thanks for your patience. |
FYI: Simple way to run this test: https://gist.github.com/richlander/12df3cf119acf9b5d423b935e0d8f819 |
thanks @richlander. update: @ivdiazsa is working on a repro. |
Believe we have a repro and will investigate |
I am getting started on this issue. I am wondering if I am reproducing the issue correctly because I am seeing a normal crash out of a normal out of memory scenario when I tried to run @richlander code. This is suspicious. When I launched the docker command this is the last line in the output before I get the prompt.
I moved on and ran the scenario, and I have got this:
Alright, so we ran out of memory and crashed. Now let's take a look at the stack trace:
This looks very much like a standard stack backtrace when a program crashed with out of memory exception. Am I missing something here? |
the first comment on this issue says it doesn't OOM with WKS but does with SVR. so that's the issue that needs investigation. |
I have repeated the experiment using server GC. For both the workstation GC case and the server GC case, the process crashed normally like it would with a |
its fine to close the issue based on your findings, @hpbieker please reopen if this still repros on 5.0 |
I am fine with that since I no longer work on this. I think the main point was that the server GC did not schedule the garbage collection collection because it was not aware of the cgroup constraint set by Docker. Throwing System.OutOfMemoryException is better than getting a OOM by the kernel, but I do think that the server GC should be able to garbage collect the memory in time to prevent such an exception in this particular case. |
Ok, moving to 6.0 for now. @cshung can you continue looking into whether GC can be aggressive in collecting to avoid OOMs? |
I experience same issue when the ASP.NET Core container with Server GC gets OOMKilled, so that's still an issue. |
the first step to investigate a perf issue is to collect data. would it be possible that you could collect a trace for us to look at? |
We have not heard from anyone that this is reproducing recently, we will close this issue for now. Please let us know if there are dumps or traces that we could use to investigate if this happen to reproduce again for .NET 5+. |
When running the small program below with the environment variable COMPlus_gcServer=1 and a memory constraint on 90M, it keeps getting killed by the kernel because it is violating the cgroup memory constraint. When running with COMPlus_gcServer=0, it works fine. It looks to me like the server GC tries to match the cgroup memory limit, but misses by a few pages (unmanaged or kernel memory?) for some reason.
If I do not include the FileStream creation, the program is not killed.
The kernel log is listed below.
###Some details###
Docker image: microsoft/dotnet:2.0.5-runtime
The text was updated successfully, but these errors were encountered: