Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Process keeps getting killed for violating docker cgroup mem constraint with server GC #852

Closed
hpbieker opened this issue Mar 13, 2018 · 26 comments
Assignees
Milestone

Comments

@hpbieker
Copy link

When running the small program below with the environment variable COMPlus_gcServer=1 and a memory constraint on 90M, it keeps getting killed by the kernel because it is violating the cgroup memory constraint. When running with COMPlus_gcServer=0, it works fine. It looks to me like the server GC tries to match the cgroup memory limit, but misses by a few pages (unmanaged or kernel memory?) for some reason.

        static void Main(string[] args)
        {
            int num = 1000;
            var list = new List<object>(num);
            for (int i = 0; ; i++)
            {
                var length = 80 * 1024;
                var buffer = new byte[length];
                using (var stream = new FileStream("/tmp/test", FileMode.CreateNew))
                {
                }
                File.Delete("/tmp/test");

                if (list.Count < num)
                     list.Add(buffer);
                else
                     list[num - 1] = buffer;
            }
        }

If I do not include the FileStream creation, the program is not killed.

The kernel log is listed below.

dotnet invoked oom-killer: gfp_mask=0x24000c0, order=0, oom_score_adj=0
dotnet cpuset=033312b058a18b97fb43a0f7c76cc16c4a1233023fc12d20b13cf62896ab3422 mems_allowed=0
CPU: 3 PID: 19800 Comm: dotnet Tainted: G           O    4.4.104-boot2docker dotnet/coreclr#1
Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
 0000000000000001 ffffffff812d68a4 ffff880282410c00 ffffffff817abaf5
 ffffffff81152675 0000000000000086 0000000000000002 0000000000016700
 0000000000000046 ffff880282410c00 ffffffff817abaf5 ffff8800db277e38
Call Trace:
 [<ffffffff812d68a4>] ? dump_stack+0x5a/0x6f
 [<ffffffff81152675>] ? dump_header+0x5c/0x1b9
 [<ffffffff8110ba48>] ? oom_kill_process+0x83/0x34a
 [<ffffffff8114d661>] ? mem_cgroup_iter+0x1b/0x1a1
 [<ffffffff8114ecab>] ? mem_cgroup_out_of_memory+0x20b/0x22c
 [<ffffffff8158f703>] ? _raw_spin_unlock_irqrestore+0xb/0xc
 [<ffffffff8114f54f>] ? mem_cgroup_oom_synchronize+0x25a/0x26c
 [<ffffffff8114b9f3>] ? mem_cgroup_is_descendant+0x1d/0x1d
 [<ffffffff8110bff9>] ? pagefault_out_of_memory+0x1a/0x73
 [<ffffffff81591b78>] ? page_fault+0x28/0x30
Task in /docker/033312b058a18b97fb43a0f7c76cc16c4a1233023fc12d20b13cf62896ab3422 killed as a result of limit of /docker/033312b058a18b97fb43a0f7c76cc1
6c4a1233023fc12d20b13cf62896ab3422
memory: usage 91944kB, limit 92160kB, failcnt 12371
memory+swap: usage 184320kB, limit 184320kB, failcnt 22
kmem: usage 3240kB, limit 9007199254740988kB, failcnt 0
Memory cgroup stats for /docker/033312b058a18b97fb43a0f7c76cc16c4a1233023fc12d20b13cf62896ab3422: cache:0KB rss:88704KB rss_huge:10240KB mapped_file:0
KB dirty:0KB writeback:0KB swap:92376KB inactive_anon:44532KB active_anon:44168KB inactive_file:0KB active_file:0KB unevictable:4KB
[ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
[19800]     0 19800  5102336    27141     170      11    23109             0 dotnet
Memory cgroup out of memory: Kill process 19800 (dotnet) score 1094 or sacrifice child
Killed process 19800 (dotnet) total-vm:20409344kB, anon-rss:88380kB, file-rss:20184kB

###Some details###

$ uname -a
Linux default 4.4.104-boot2docker dotnet/coreclr#1 SMP Fri Dec 8 19:23:27 UTC 2017 x86_64 GNU/Linux
$ docker info
Containers: 2
 Running: 2
 Paused: 0
 Stopped: 0
Images: 247
Server Version: 17.09.1-ce
Storage Driver: aufs
 Root Dir: /mnt/sda1/var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 228
 Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: active
 NodeID: 896gr7p74gkp0b7f8wpm2cyjw
 Is Manager: true
 ClusterID: ln33cteihinqb7y34715b37un
 Managers: 1
 Nodes: 1
 Orchestration:
  Task History Retention Limit: 5
 Raft:
  Snapshot Interval: 10000
  Number of Old Snapshots to Retain: 0
  Heartbeat Tick: 1
  Election Tick: 3
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
  Force Rotate: 0
 Autolock Managers: false
 Root Rotation In Progress: false
 Node Address: 192.168.99.100
 Manager Addresses:
  192.168.99.100:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 06b9cb35161009dcb7123345749fef02f7cea8e0
runc version: 3f2f8b84a77f73d38244dd690525642a72156c64
init version: 949e6fa
Security Options:
 seccomp
  Profile: default
Kernel Version: 4.4.104-boot2docker
Operating System: Boot2Docker 17.09.1-ce (TCL 7.2); HEAD : e7de9ae - Fri Dec  8 19:41:36 UTC 2017
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 9.514GiB
Name: default
ID: L7OH:SH27:OOUQ:ATGD:H75B:AAYK:ZWUR:3ZHK:55ZR:JLHM:MTL5:XC6Z
Docker Root Dir: /mnt/sda1/var/lib/docker
Debug Mode (client): false
Debug Mode (server): true
 File Descriptors: 52
 Goroutines: 162
 System Time: 2018-03-13T11:25:04.790227243Z
 EventsListeners: 0
Registry: https://index.docker.io/v1/
Labels:
 provider=virtualbox
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

Docker image: microsoft/dotnet:2.0.5-runtime

@hpbieker
Copy link
Author

A similar, but very simple, case would be to run the following code in docker with a 11 MB memory constraint:

        static void Main(string[] args)
        {
            for (int i = 0; ; i++)
            {
                var length = 80 * 1024;
                var buffer = new byte[length];
            }
        }

It also gets killed because the cgroup runs out of memory in GC Server mode:

dotnet invoked oom-killer: gfp_mask=0x24000c0, order=0, oom_score_adj=0
dotnet cpuset=ba3e9b629c3a45b6c2aa5a86e5f8013176f536525f83f797c2c7a60f43c28f88 mems_allowed=0
CPU: 2 PID: 31598 Comm: dotnet Tainted: G           O    4.4.104-boot2docker dotnet/coreclr#1
Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
 0000000000000001 ffffffff812d68a4 ffff8800da1de2c0 ffffffff817abaf5
 ffffffff81152675 0000000000000082 0000000000000000 0000000000016700
 0000000000000046 ffff8800da1de2c0 ffffffff817abaf5 ffff88006384fe38
Call Trace:
 [<ffffffff812d68a4>] ? dump_stack+0x5a/0x6f
 [<ffffffff81152675>] ? dump_header+0x5c/0x1b9
 [<ffffffff8110ba48>] ? oom_kill_process+0x83/0x34a
 [<ffffffff8114d661>] ? mem_cgroup_iter+0x1b/0x1a1
 [<ffffffff8114ecab>] ? mem_cgroup_out_of_memory+0x20b/0x22c
 [<ffffffff8158f703>] ? _raw_spin_unlock_irqrestore+0xb/0xc
 [<ffffffff8114f54f>] ? mem_cgroup_oom_synchronize+0x25a/0x26c
 [<ffffffff8114b9f3>] ? mem_cgroup_is_descendant+0x1d/0x1d
 [<ffffffff8110bff9>] ? pagefault_out_of_memory+0x1a/0x73
 [<ffffffff81591b78>] ? page_fault+0x28/0x30
Task in /docker/ba3e9b629c3a45b6c2aa5a86e5f8013176f536525f83f797c2c7a60f43c28f88 killed as a result of limit of /docker/ba3e9b629c3a45b6c2aa5a86e5f801
3176f536525f83f797c2c7a60f43c28f88
memory: usage 9708kB, limit 11264kB, failcnt 83
memory+swap: usage 22528kB, limit 22528kB, failcnt 68
kmem: usage 2380kB, limit 9007199254740988kB, failcnt 0
Memory cgroup stats for /docker/ba3e9b629c3a45b6c2aa5a86e5f8013176f536525f83f797c2c7a60f43c28f88: cache:20KB rss:7308KB rss_huge:0KB mapped_file:16KB
dirty:0KB writeback:0KB swap:12820KB inactive_anon:3748KB active_anon:3520KB inactive_file:0KB active_file:0KB unevictable:4KB
[ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
[31598]     0 31598  5135102     6364      88      10     3199             0 dotnet
Memory cgroup out of memory: Kill process 31598 (dotnet) score 1715 or sacrifice child
Killed process 31598 (dotnet) total-vm:20540408kB, anon-rss:6988kB, file-rss:18468kB

@hpbieker
Copy link
Author

The log below show some real life dotnet docker container (with different memory limits) running in Docker swarm being killed because of cgroup viloation. I suspect that something like cache is a part of the cgroup limit, but that dotnet does not account for any cache when calculating the target heap size for server GC.

Mar 12 18:09:55 machine-1 kernel: [615694.511492] Memory cgroup stats for /docker/7e29cfc63f603344cea233e2c2f74d8a55ea9ed7ad3c6db0ae8c042cc75f2af8: cache:5664KB rss:249452KB rss_huge:30720KB mapped_file:5660KB dirty:0KB writeback:0KB inactive_anon:5660KB active_anon:249448KB inactive_file:0KB active_file:4KB unevictable:4KB
Mar 12 18:09:55 machine-1 kernel: [615694.511487] memory: usage 262144kB, limit 262144kB, failcnt 2472
Mar 12 18:02:34 machine-1 kernel: [615253.943304] memory: usage 262144kB, limit 262144kB, failcnt 3078
Mar 12 18:02:34 machine-1 kernel: [615253.943308] Memory cgroup stats for /docker/43bf7b26f090e575d16f38d7a96e5d59fa4ee372cf91a495e609fb1a33a49435: cache:5660KB rss:247920KB rss_huge:172032KB mapped_file:5660KB dirty:0KB writeback:0KB inactive_anon:5660KB active_anon:247896KB inactive_file:0KB active_file:0KB unevictable:4KB
Mar 12 08:14:10 machine-2 kernel: [605042.258969] Memory cgroup stats for /docker/33f177ae268f4f3061d4e5c2fda0003cfe1f1c2e70290c9bf66624491e5de7b5: cache:17836KB rss:234968KB rss_huge:129024KB mapped_file:17796KB dirty:0KB writeback:0KB inactive_anon:17796KB active_anon:234964KB inactive_file:40KB active_file:0KB unevictable:4KB
Mar 12 08:14:10 machine-2 kernel: [605042.258966] memory: usage 262144kB, limit 262144kB, failcnt 2097
Mar  9 11:04:37 machine-2 kernel: [356083.473516] memory: usage 262144kB, limit 262144kB, failcnt 3241
Mar  9 11:04:37 machine-2 kernel: [356083.473523] Memory cgroup stats for /docker/60de2d1bd1e61c8e0842727067e22d633824d6aa2f4bceb23b77495d4e3f3c5a: cache:2828KB rss:247784KB rss_huge:194560KB mapped_file:2708KB dirty:0KB writeback:0KB inactive_anon:2708KB active_anon:247780KB inactive_file:112KB active_file:8KB unevictable:4KB
Mar  9 09:34:29 machine-2 kernel: [350675.900071] Memory cgroup stats for /docker/892fd62eff1b2a5233baf5b537cf3fcc6d75c0802e2fed51fdac101e21478974: cache:2708KB rss:248628KB rss_huge:198656KB mapped_file:2708KB dirty:0KB writeback:0KB inactive_anon:2708KB active_anon:248572KB inactive_file:0KB active_file:0KB unevictable:4KB
Mar  9 09:34:29 machine-2 kernel: [350675.900060] memory: usage 262144kB, limit 262144kB, failcnt 113
Mar  9 08:33:41 machine-2 kernel: [347028.032878] memory: usage 786432kB, limit 786432kB, failcnt 869
Mar  9 08:33:41 machine-2 kernel: [347028.032882] Memory cgroup stats for /docker/0d3e55c193a21ad56cfc1fe3dc40931c421063bafe90e979119cd16b5dec882a: cache:2388KB rss:772760KB rss_huge:356352KB mapped_file:2388KB dirty:0KB writeback:0KB inactive_anon:2384KB active_anon:772744KB inactive_file:0KB active_file:0KB unevictable:4KB
Mar  8 11:38:53 machine-2 kernel: [271744.658403] memory: usage 524288kB, limit 524288kB, failcnt 6329
Mar  8 11:38:53 machine-2 kernel: [271744.658406] Memory cgroup stats for /docker/a74ced8a45f75bfe0dd6a02b73aae0f0c1644bd944ef2cf04fb985027a825784: cache:9040KB rss:510360KB rss_huge:126976KB mapped_file:9040KB dirty:0KB writeback:0KB inactive_anon:9040KB active_anon:510304KB inactive_file:0KB active_file:0KB unevictable:4KB
Mar  8 09:36:03 machine-1 kernel: [239283.676509] Memory cgroup stats for /docker/abcba4e60c1ccf801d0bbe26e2a6174a9999f5a7e1297b8cb335cc99f3699bc4: cache:10456KB rss:505120KB rss_huge:157696KB mapped_file:10396KB dirty:0KB writeback:0KB inactive_anon:10396KB active_anon:505068KB inactive_file:0KB active_file:0KB unevictable:4KB
Mar  8 09:36:03 machine-1 kernel: [239283.676506] memory: usage 524288kB, limit 524288kB, failcnt 15329
Mar  8 09:15:10 machine-2 kernel: [263122.367478] Memory cgroup stats for /docker/1cba62acf2205cbbdc87fedefb6642d6108fdd4b03f7171e227191193015039a: cache:1544KB rss:187024KB rss_huge:145408KB mapped_file:1528KB dirty:0KB writeback:0KB inactive_anon:1528KB active_anon:186996KB inactive_file:0KB active_file:0KB unevictable:4KB
Mar  8 09:15:10 machine-2 kernel: [263122.367471] memory: usage 196608kB, limit 196608kB, failcnt 4931
Mar  8 09:14:10 machine-2 kernel: [263062.196489] memory: usage 131072kB, limit 131072kB, failcnt 1456
Mar  8 09:14:10 machine-2 kernel: [263062.196500] Memory cgroup stats for /docker/9f2ab973a1079cad5f58d37a372004fc36c2e64bb652e25193a91539bc88b46c: cache:1528KB rss:121488KB rss_huge:81920KB mapped_file:1528KB dirty:0KB writeback:0KB inactive_anon:1528KB active_anon:121472KB inactive_file:0KB active_file:0KB unevictable:4KB
Mar  7 22:08:45 machine-1 kernel: [198048.172532] Memory cgroup stats for /docker/4012195d36fde5e6246c884f9e26f9ded11a48367554f1505ee535ca90c84270: cache:7796KB rss:50268KB rss_huge:8192KB mapped_file:7796KB dirty:0KB writeback:0KB inactive_anon:7796KB active_anon:50260KB inactive_file:0KB active_file:0KB unevictable:4KB
Mar  7 22:08:45 machine-1 kernel: [198048.172528] memory: usage 131072kB, limit 131072kB, failcnt 12012
Mar  7 20:45:29 machine-2 kernel: [218143.659049] Memory cgroup stats for /docker/a7213e002c5d3c29811fb7bb12e3ea20bfef959e6918a0bcd402ab0cf94f20d7: cache:10460KB rss:506860KB rss_huge:14336KB mapped_file:10396KB dirty:0KB writeback:0KB inactive_anon:10396KB active_anon:506856KB inactive_file:56KB active_file:8KB unevictable:4KB
Mar  7 20:45:29 machine-2 kernel: [218143.659042] memory: usage 524288kB, limit 524288kB, failcnt 67599
Mar  7 20:27:53 machine-1 kernel: [191996.507980] memory: usage 524288kB, limit 524288kB, failcnt 12928
Mar  7 20:27:53 machine-1 kernel: [191996.507984] Memory cgroup stats for /docker/76e09b63c6aa20b7d08bdf685ed0242d61f9682eea6252ed275d7406065747cc: cache:10436KB rss:506632KB rss_huge:61440KB mapped_file:10340KB dirty:0KB writeback:0KB inactive_anon:10340KB active_anon:506628KB inactive_file:68KB active_file:28KB unevictable:4KB
Mar  7 15:48:38 machine-1 kernel: [175242.202768] memory: usage 524288kB, limit 524288kB, failcnt 24611
Mar  7 15:48:38 machine-1 kernel: [175242.202772] Memory cgroup stats for /docker/afc4671650a8f1da4ce20ef9227422b3a2f42d0093d98439f3d4f776938f8fb8: cache:9896KB rss:506724KB rss_huge:212992KB mapped_file:9892KB dirty:0KB writeback:0KB inactive_anon:9892KB active_anon:506676KB inactive_file:0KB active_file:0KB unevictable:4KB
Mar  7 15:23:24 machine-2 kernel: [198819.529854] Memory cgroup stats for /docker/98357916fcfeb84b0a8510e1b2a865a680a16c42f79cf0838693bae185388aa0: cache:17788KB rss:237108KB rss_huge:174080KB mapped_file:17788KB dirty:0KB writeback:0KB inactive_anon:17788KB active_anon:237088KB inactive_file:0KB active_file:0KB unevictable:4KB
Mar  7 15:23:24 machine-2 kernel: [198819.529851] memory: usage 262144kB, limit 262144kB, failcnt 116
Mar  7 14:48:18 machine-2 kernel: [196713.488994] Memory cgroup stats for /docker/2d41cf7b433614c7b05d542635095d8c26333a73a2cce6603d6a0373357f448d: cache:2708KB rss:250724KB rss_huge:151552KB mapped_file:2708KB dirty:0KB writeback:0KB inactive_anon:2708KB active_anon:250684KB inactive_file:0KB active_file:0KB unevictable:4KB
Mar  7 14:48:18 machine-2 kernel: [196713.488987] memory: usage 262144kB, limit 262144kB, failcnt 362
Mar  7 14:17:31 machine-2 kernel: [194866.928944] Memory cgroup stats for /docker/c176e6ae8e36d0e9fab93a5461834fff882053c33c88d5af770bc482fd19c544: cache:2384KB rss:777012KB rss_huge:296960KB mapped_file:2384KB dirty:0KB writeback:0KB inactive_anon:2384KB active_anon:777004KB inactive_file:0KB active_file:0KB unevictable:4KB
Mar  7 14:17:31 machine-2 kernel: [194866.928940] memory: usage 786432kB, limit 786432kB, failcnt 12
Mar  7 12:52:48 machine-1 kernel: [164692.754829] memory: usage 524288kB, limit 524288kB, failcnt 11175
Mar  7 12:52:48 machine-1 kernel: [164692.754833] Memory cgroup stats for /docker/13648822c79d570a7fc41186a50a8cba23431bcf12a0e0d10238cec1d83d10a9: cache:9272KB rss:509240KB rss_huge:34816KB mapped_file:9152KB dirty:0KB writeback:0KB inactive_anon:9152KB active_anon:509236KB inactive_file:112KB active_file:8KB unevictable:4KB
Mar  7 12:31:40 machine-2 kernel: [188515.960230] Memory cgroup stats for /docker/6f21a3608aaf28729fc5b6927798dc85d6bdd800cebdeaf9c36c2d569df63e5e: cache:2724KB rss:247244KB rss_huge:122880KB mapped_file:2708KB dirty:0KB writeback:0KB inactive_anon:2708KB active_anon:247240KB inactive_file:16KB active_file:0KB unevictable:4KB
Mar  7 12:31:40 machine-2 kernel: [188515.960222] memory: usage 262108kB, limit 262144kB, failcnt 2068
Mar  7 09:22:58 machine-2 kernel: [177194.970573] memory: usage 524288kB, limit 524288kB, failcnt 8701
Mar  7 09:22:58 machine-2 kernel: [177194.970577] Memory cgroup stats for /docker/6b3ed1b54da38c542375595d0de9bca9ae9bcdcdae564ba8edd2beeb34f37bd1: cache:9276KB rss:509576KB rss_huge:167936KB mapped_file:9156KB dirty:0KB writeback:0KB inactive_anon:9156KB active_anon:509572KB inactive_file:120KB active_file:0KB unevictable:4KB
Mar  7 09:21:10 machine-2 kernel: [177086.963299] Memory cgroup stats for /docker/54f641ef1fc34d4d18f56ac95433398c221dfde6f5acb0f8b9847e493df36780: cache:2232KB rss:251312KB rss_huge:202752KB mapped_file:2232KB dirty:0KB writeback:0KB inactive_anon:2232KB active_anon:251300KB inactive_file:0KB active_file:0KB unevictable:4KB
Mar  7 09:21:10 machine-2 kernel: [177086.963292] memory: usage 262144kB, limit 262144kB, failcnt 39
Mar  7 03:18:13 machine-1 kernel: [130219.828702] Memory cgroup stats for /docker/60d84b45ae9a7a89cb7811a2d504a873c56f23a5245944664c47cfe2e511ee64: cache:7796KB rss:58656KB rss_huge:14336KB mapped_file:7796KB dirty:0KB writeback:0KB inactive_anon:7796KB active_anon:58652KB inactive_file:0KB active_file:0KB unevictable:4KB
Mar  7 03:18:13 machine-1 kernel: [130219.828698] memory: usage 131072kB, limit 131072kB, failcnt 20259

@Maoni0
Copy link
Member

Maoni0 commented Mar 14, 2018

what happens if you set COMPlus_gcConcurrent=0?

@hpbieker
Copy link
Author

@Maoni0 COMPlus_gcConcurrent=0 looks similar to not setting gcServer to 1.

It looks like I have to disable swapping (which we have done on our production servers), otherwise the extra pages for the cgroup will just be swapped out. I guess the difference is that gcServer waits almost until the cgroup is maxed out before is startes GC, and in some cases it is too late so that it violates the cgroup memory limit. As far as I know there is no way to set a target heap size without using the cgroup (which will actually kill the process..).

@Maoni0
Copy link
Member

Maoni0 commented Mar 14, 2018

sorry, I meant if you set COMPlus_gcConcurrent=0 and COMPlus_gcServer=1. do you see any difference from just setting COMPlus_gcServer=1?

@hpbieker
Copy link
Author

@Maoni0 It crashes just the same way (120 MiB limit shown below, but I also got it to crash with 158 MiB):

$ cat Program.cs
using System;
using System.Collections.Generic;
using System.IO;
using System.Threading;

namespace test
{
    class Program
    {
        static void Main(string[] args)
        {
            int num = 1000;
            var list = new List<object>(num);
            for (int i = 0; ; i++)
            {
                var length = 80 * 1024;
                var buffer = new byte[length];
                using (var stream = new FileStream("/tmp/test", FileMode.CreateNew))
                {
                }
                File.Delete("/tmp/test");

                if (list.Count < num)
                     list.Add(buffer);
                else
                     list[num - 1] = buffer;
            }
        }
    }
}

$ cat Dockerfile
FROM microsoft/dotnet:2.0-sdk AS build-env
WORKDIR /app

# Copy csproj and restore as distinct layers
COPY *.csproj ./
RUN dotnet restore

# Copy everything else and build
COPY . ./
RUN dotnet publish -c Release -o out

# Build runtime image
FROM microsoft/dotnet-nightly:2.1.0-preview1-runtime-stretch-slim
WORKDIR /app
COPY --from=build-env /app/out .
ENTRYPOINT ["dotnet", "test.dll"]
$ docker build . -t test && docker run -it -e COMPlus_gcServer=1 -e COMPlus_gcConcurrent=0 -m 120M test
Sending build context to Docker daemon  555.5kB
Step 1/10 : FROM microsoft/dotnet:2.0-sdk AS build-env
 ---> 9969575612df
Step 2/10 : WORKDIR /app
 ---> Using cache
 ---> 3c9e127e122c
Step 3/10 : COPY *.csproj ./
 ---> Using cache
 ---> 215d77811b59
Step 4/10 : RUN dotnet restore
 ---> Using cache
 ---> 375d7efe8c1f
Step 5/10 : COPY . ./
 ---> Using cache
 ---> 7feb46800f5d
Step 6/10 : RUN dotnet publish -c Release -o out
 ---> Using cache
 ---> 1244e49321c1
Step 7/10 : FROM microsoft/dotnet-nightly:2.1.0-preview1-runtime-stretch-slim
 ---> d35ff734c498
Step 8/10 : WORKDIR /app
 ---> Using cache
 ---> e000ea409ba5
Step 9/10 : COPY --from=build-env /app/out .
 ---> Using cache
 ---> c680bbb27d96
Step 10/10 : ENTRYPOINT dotnet test.dll
 ---> Using cache
 ---> cc713efd0f87
Successfully built cc713efd0f87
Successfully tagged test:latest
SECURITY WARNING: You are building a Docker image from Windows against a non-Windows Docker host. All files and directories added to build context wil
l have '-rwxr-xr-x' permissions. It is recommended to double check and reset permissions for sensitive files and directories.

And then the log says:

dotnet invoked oom-killer: gfp_mask=0x24000c0, order=0, oom_score_adj=0
dotnet cpuset=987c2dbab2794d3838d4ee9fbed726cf4abf9a1ebee3c528cf566510ae7457a4 mems_allowed=0
CPU: 2 PID: 25566 Comm: dotnet Tainted: G           O    4.4.104-boot2docker dotnet/coreclr#1
Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
 0000000000000001 ffffffff812d68a4 ffff8800dbb1b040 ffffffff817abaf5
 ffffffff81152675 ffffffff81090d61 ffff880290d16700 ffff8800dbb1b040
 ffff880281d02080 ffff8800dbb1b040 ffffffff817abaf5 ffff8802825d7e38
Call Trace:
 [<ffffffff812d68a4>] ? dump_stack+0x5a/0x6f
 [<ffffffff81152675>] ? dump_header+0x5c/0x1b9
 [<ffffffff81090d61>] ? finish_task_switch+0x107/0x190
 [<ffffffff8110ba48>] ? oom_kill_process+0x83/0x34a
 [<ffffffff8114d661>] ? mem_cgroup_iter+0x1b/0x1a1
 [<ffffffff8114ecab>] ? mem_cgroup_out_of_memory+0x20b/0x22c
 [<ffffffff8158f703>] ? _raw_spin_unlock_irqrestore+0xb/0xc
 [<ffffffff8114f54f>] ? mem_cgroup_oom_synchronize+0x25a/0x26c
 [<ffffffff8114b9f3>] ? mem_cgroup_is_descendant+0x1d/0x1d
 [<ffffffff8110bff9>] ? pagefault_out_of_memory+0x1a/0x73
 [<ffffffff81591b78>] ? page_fault+0x28/0x30
Task in /docker/987c2dbab2794d3838d4ee9fbed726cf4abf9a1ebee3c528cf566510ae7457a4 killed as a result of limit of /docker/987c2dbab2794d3838d4ee9fbed726
cf4abf9a1ebee3c528cf566510ae7457a4
memory: usage 122880kB, limit 122880kB, failcnt 58
memory+swap: usage 122880kB, limit 245760kB, failcnt 0
kmem: usage 2600kB, limit 9007199254740988kB, failcnt 0
Memory cgroup stats for /docker/987c2dbab2794d3838d4ee9fbed726cf4abf9a1ebee3c528cf566510ae7457a4: cache:24KB rss:120256KB rss_huge:22528KB mapped_file
:0KB dirty:0KB writeback:0KB swap:0KB inactive_anon:4KB active_anon:116136KB inactive_file:0KB active_file:4KB unevictable:4KB
[ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
[25566]     0 25566  5093943    35182     134       9        0             0 dotnet
Memory cgroup out of memory: Kill process 25566 (dotnet) score 1149 or sacrifice child
Killed process 25566 (dotnet) total-vm:20375772kB, anon-rss:119976kB, file-rss:20752kB

@Maoni0
Copy link
Member

Maoni0 commented Mar 14, 2018

do you know if you did any GCs at all or it gets killed when it's just at start up stage? if you have gcConcurrent enabled, since server GC on 64-bit has such large segment sizes, even the GC's own datastructure would take 10's of MB. but without concurrent it should be able to run (unless pinning is prevalent and we cannot retract the heap in time).

it would make debugging a lot more productive if you could collect a trace: https://github.com/dotnet/coreclr/blob/master/Documentation/project-docs/linux-performance-tracing.md. without any perf data it's hard for me to say what's happening.

@hpbieker
Copy link
Author

hpbieker commented Mar 16, 2018

Hi @Maoni0. The GC runs for all three generations. I was not able to install the profiler due to a missing dependency. However, by modifying the program below to:

        static void Main(string[] args)
        {
            int num = 1000;
            var list = new List<object>(num);
            for (int i = 0; ; i++)
            {
                Console.WriteLine("Itr: {0}, Gen2: {1}, Gen1: {2}, Gen0: {3}, {4}", i, GC.CollectionCount(2), GC.CollectionCount(1), GC.CollectionCount(0), GC.GetTotalMemory(false));
                var length = 80 * 1024;
                var buffer = new byte[length];
                using (var stream = new FileStream("/tmp/test", FileMode.CreateNew))
                {
                }
                File.Delete("/tmp/test");

                if (list.Count < num)
                     list.Add(buffer);
                else
                     list[num - 1] = buffer;
            }
        }

The last output from each of the runs are:

$ docker build . -t test && docker run -it -e COMPlus_gcServer=1 -e COMPlus_gcConcurrent=1 -m 120M test
[..]
Itr: 1121, Gen2: 1, Gen1: 2, Gen0: 4, 92130624
Itr: 1122, Gen2: 1, Gen1: 2, Gen0: 4, 92212592
Itr: 1123, Gen2: 1, Gen1: 2, Gen0: 4, 92294560
Itr: 1124, Gen2: 1, Gen1: 2, Gen0: 4, 92376528
Itr: 1125, Gen2: 1, Gen1: 2, Gen0: 4, 92458496
Itr: 1126, Gen2: 1, Gen1: 2, Gen0: 4, 92540464

And

$ docker build . -t test && docker run -it -e COMPlus_gcServer=1 -e COMPlus_gcConcurrent=0 -m 120M test
[..]
Itr: 1168, Gen2: 1, Gen1: 2, Gen0: 4, 95994360
Itr: 1169, Gen2: 1, Gen1: 2, Gen0: 4, 96076328
Itr: 1170, Gen2: 1, Gen1: 2, Gen0: 4, 96158296
Itr: 1171, Gen2: 1, Gen1: 2, Gen0: 4, 96240264
Itr: 1172, Gen2: 1, Gen1: 2, Gen0: 4, 96322232
Itr: 1173, Gen2: 1, Gen1: 2, Gen0: 4, 96404200

This suggests that there is no problem of running the GC, but it waits too long before it is started. I also see from the log that no GC is ran after iteration 1000 (actually iteration ~700 was the last GC), which is the iteration where it starts to free old buffers. Note that I have to disable swapping (swapoff -a) to trigger the OOM. Are you able to reproduce?
(or course -- the number of iterations differs from run to run because of the timing of the GC.)

@hpbieker
Copy link
Author

@Maoni0 A more to-the-point repro code is shown below:

$ docker build . -t test && docker run -it -e COMPlus_gcServer=1 -e COMPlus_gcConcurrent=1 -m 120M test
[...]
Itr: 1132, Gen2: 1, Gen1: 2, Gen0: 6, 92975712
Itr: 1133, Gen2: 1, Gen1: 2, Gen0: 6, 93057680
Itr: 1134, Gen2: 1, Gen1: 2, Gen0: 6, 93139648
Itr: 1135, Gen2: 1, Gen1: 2, Gen0: 6, 93221616
Itr: 1136, Gen2: 1, Gen1: 2, Gen0: 6, 93307056
Itr: 1137, Gen2: 1, Gen1: 2, Gen0: 6, 93389024
[killed with OOM]
memory: usage 122880kB, limit 122880kB, failcnt 54
memory+swap: usage 122880kB, limit 245760kB, failcnt 0
kmem: usage 1140kB, limit 9007199254740988kB, failcnt 0
Memory cgroup stats for /docker/e27f77ee502e02a079695ae281c84854e3f33ba8a83812deb1349aa1c5440750: cache:20KB rss:121720KB rss_huge:22528KB mapped_file
:0KB dirty:0KB writeback:0KB swap:0KB inactive_anon:4KB active_anon:119656KB inactive_file:0KB active_file:0KB unevictable:4KB

When the code looks like:

        static void Main(string[] args)
        {
            int num = 1000;
            var list = new List<object>(num);
            for (int i = 0; i < 10200; i++)
            {
                Console.WriteLine("Itr: {0}, Gen2: {1}, Gen1: {2}, Gen0: {3}, {4}", i, GC.CollectionCount(2), GC.CollectionCount(1), GC.CollectionCount(0), GC.GetTotalMemory(false));
                var length = 80 * 1024;
                var buffer = new byte[length];

                if (list.Count < num)
                     list.Add(buffer);
                else
                     list[num - 1] = buffer;
            }
        }

@jkotas jkotas transferred this issue from dotnet/coreclr Dec 13, 2019
@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Dec 13, 2019
@richlander
Copy link
Member

This issue is now very old. Do people still see this behavior with 3.x? We know that memory limits were a problem with 1.x and 2.x.

@chucky2305
Copy link

We are facing some similar problems at the moment. I`m not a dotnet dev, but i am quite sure we switched to version 3.x and are still facing these issues. Any updates or hints on this?

@effyteva
Copy link

effyteva commented Feb 2, 2020

We face the same issue. Switching from .NET core 2.2 to 3.1 made it even worse for our application.
Are there any recommendations for temporary solutions?
Waiting for .NET core 5.0 (November 2020) isn't a great solution...

@Gladskih
Copy link

Related

@Maoni0 Maoni0 removed the untriaged New issue has not been triaged by the area owner label Apr 4, 2020
@Maoni0
Copy link
Member

Maoni0 commented Apr 4, 2020

I'm sorry I didn't follow up on this... this would call for some investigation on our side and I will get back to you next week how soon we can start doing the investigation. thanks for your patience.

@richlander
Copy link
Member

FYI: Simple way to run this test:

https://gist.github.com/richlander/12df3cf119acf9b5d423b935e0d8f819

@Maoni0
Copy link
Member

Maoni0 commented Apr 8, 2020

thanks @richlander.

update: @ivdiazsa is working on a repro.

@mangod9 mangod9 added this to the 5.0.0 milestone Jul 13, 2020
@mangod9
Copy link
Member

mangod9 commented Jul 29, 2020

Believe we have a repro and will investigate

@cshung cshung self-assigned this Aug 21, 2020
@cshung
Copy link
Member

cshung commented Aug 21, 2020

I am getting started on this issue. I am wondering if I am reproducing the issue correctly because I am seeing a normal crash out of a normal out of memory scenario when I tried to run @richlander code.

This is suspicious. When I launched the docker command this is the last line in the output before I get the prompt.

WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap.

I moved on and ran the scenario, and I have got this:

...
Itr: 1255, Gen2: 1, Gen1: 2, Gen0: 5, 92711976
Out of memory.
Aborted (core dumped)

Alright, so we ran out of memory and crashed. Now let's take a look at the stack trace:

(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007ffff7ae5535 in __GI_abort () at abort.c:79
#2  0x00007ffff715edce in PROCAbort () from /usr/share/dotnet/shared/Microsoft.NETCore.App/5.0.0-preview.7.20364.11/libcoreclr.so
#3  0x00007ffff715ed1c in PROCEndProcess(void*, unsigned int, int) ()
   from /usr/share/dotnet/shared/Microsoft.NETCore.App/5.0.0-preview.7.20364.11/libcoreclr.so
#4  0x00007ffff6eafaed in UnwindManagedExceptionPass1(PAL_SEHException&, _CONTEXT*) ()
   from /usr/share/dotnet/shared/Microsoft.NETCore.App/5.0.0-preview.7.20364.11/libcoreclr.so
#5  0x00007ffff6eafb76 in DispatchManagedException(PAL_SEHException&, bool) ()
   from /usr/share/dotnet/shared/Microsoft.NETCore.App/5.0.0-preview.7.20364.11/libcoreclr.so
#6  0x00007ffff6e29be2 in JIT_NewArr1(CORINFO_CLASS_STRUCT_*, long) ()
   from /usr/share/dotnet/shared/Microsoft.NETCore.App/5.0.0-preview.7.20364.11/libcoreclr.so
#7  0x00007fff7daf3da1 in ?? ()
#8  0x6c6962697461706d in ?? ()
#9  0x000003e841797469 in ?? ()
#10 0x00007fff43596db0 in ?? ()
#11 0x00007fff435aae68 in ?? ()
#12 0x00000000058d9678 in ?? ()
#13 0x0000000400000000 in ?? ()
#14 0x00007fff435aae50 in ?? ()
#15 0x0000000300000004 in ?? ()
#16 0x00007fff435aae38 in ?? ()
#17 0x0000000200000001 in ?? ()
#18 0x00007fff435aae20 in ?? ()
#19 0x0000000100000000 in ?? ()
#20 0x00007fff435aae08 in ?? ()
#21 0x00007fff435aadc8 in ?? ()
#22 0x00007fff42009c38 in ?? ()
#23 0x00007fff42009cb0 in ?? ()
#24 0x0000000000000001 in ?? ()
#25 0x00007fff43596db0 in ?? ()
#26 0x0000046f00014000 in ?? ()
#27 0x00007fff42009cb0 in ?? ()
#28 0x000003e800000000 in ?? ()
#29 0x00007fff42009050 in ?? ()
#30 0x00007fffffffdd70 in ?? ()
#31 0x00007ffff6ebc1d3 in CallDescrWorkerInternal ()
   from /usr/share/dotnet/shared/Microsoft.NETCore.App/5.0.0-preview.7.20364.11/libcoreclr.so
Backtrace stopped: frame did not save the PC
(gdb)

This looks very much like a standard stack backtrace when a program crashed with out of memory exception.

Am I missing something here?

@Maoni0
Copy link
Member

Maoni0 commented Aug 23, 2020

the first comment on this issue says it doesn't OOM with WKS but does with SVR. so that's the issue that needs investigation.

@cshung
Copy link
Member

cshung commented Aug 27, 2020

I have repeated the experiment using server GC. For both the workstation GC case and the server GC case, the process crashed normally like it would with a System.OutOfMemoryException. I cannot reproduce the case where the process is terminated by the OOM killer.

@mangod9
Copy link
Member

mangod9 commented Aug 27, 2020

its fine to close the issue based on your findings, @hpbieker please reopen if this still repros on 5.0

@hpbieker
Copy link
Author

I am fine with that since I no longer work on this. I think the main point was that the server GC did not schedule the garbage collection collection because it was not aware of the cgroup constraint set by Docker. Throwing System.OutOfMemoryException is better than getting a OOM by the kernel, but I do think that the server GC should be able to garbage collect the memory in time to prevent such an exception in this particular case.

@mangod9
Copy link
Member

mangod9 commented Aug 28, 2020

Ok, moving to 6.0 for now. @cshung can you continue looking into whether GC can be aggressive in collecting to avoid OOMs?

@mangod9 mangod9 modified the milestones: 5.0.0, 6.0.0 Aug 28, 2020
@andrii-litvinov
Copy link

I experience same issue when the ASP.NET Core container with Server GC gets OOMKilled, so that's still an issue.

@Maoni0
Copy link
Member

Maoni0 commented Feb 10, 2021

the first step to investigate a perf issue is to collect data. would it be possible that you could collect a trace for us to look at?

MichalStrehovsky added a commit to MichalStrehovsky/runtime that referenced this issue Apr 9, 2021
@cshung
Copy link
Member

cshung commented Jul 16, 2021

We have not heard from anyone that this is reproducing recently, we will close this issue for now. Please let us know if there are dumps or traces that we could use to investigate if this happen to reproduce again for .NET 5+.

@cshung cshung closed this as completed Jul 16, 2021
@ghost ghost locked as resolved and limited conversation to collaborators Aug 15, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests