Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correct crun's behavior when it runs inside a cgroup v2 container #923

Closed
skepticoitusInteruptus opened this issue May 19, 2022 · 34 comments · Fixed by #931
Closed

Correct crun's behavior when it runs inside a cgroup v2 container #923

skepticoitusInteruptus opened this issue May 19, 2022 · 34 comments · Fixed by #931

Comments

@skepticoitusInteruptus
Copy link

skepticoitusInteruptus commented May 19, 2022

Description

When ran inside a cgroup v2 container, crun attempts an apparently cgroup v1-like operation.

My spike:

  • Evaluate the behavior parity of the runc and crun container runtimes for cgroup v2 support.1

My tests:

my@host $ docker run -it -m 1024m --memory-max 1024m --privileged skepticoital/system.me:hmmm

/system.me # podman run -it skepticoital/mem_limit:hmmm

/system.me # dmesg | grep -i killed

Steps to reproduce the issue:

  1. Using Docker from a physical host machine with cgroup v2 support2, run a rootful Alpine container that is, itself, also configured with cgroup v2 support:3
docker run -it -m 1024m --memory-swap 1024m --privileged skepticoital/system.me:hmmm
  1. Inside the container, observe the container's enabled controllers:
cat /sys/fs/cgroup/cgroup.subtree_control
cpuset cpu pids
  1. Observe the container's cgroup.type:
cat /sys/fs/cgroup/cgroup.type
domain threaded
  1. Observe there are controller interface files in the container's /sys/fs/cgroup dir:4
ls /sys/fs/cgroup/
...cpu.max...hugeTLB.1GB.max...io.stat...memory.max...pids.max...rdma.max
  1. Observe that the value in bytes of the container's /sys/fs/cgroup/memory.max file equals the value in megabytes of the host's docker run -m switch:
cat /sys/fs/cgroup/memory.max
1073741824
  1. Exercise crun's interaction with the container's cgroup v2 memory controller:
podman run -it skepticoital/mem_limit:hmmm

  1. Observe that crun chokes; reporting an EOPNOTSUPP (guessing: it wants to mod its parent cgroup?):5
WARN[0005] Failed to add conmon to cgroupfs sandbox cgroup: error creating cgroup path /libpod_parent/conmon: write /sysfs/cgroup/cgroup.subtree_control: operation not supported

Describe the results you received:

Error: OCI runtime error: crun: writing file `/sysfs/cgroup/cgroup.subtree_control`: Not supported

Describe the results you expected:

  • To behave — out of the box — the way runc behaves under identical constraints (see Workaround)

Additional information you deem important:

  • The root of cgroup v2's hierarchy is the host machine (i.e., where the kernel is)
  • The kernel considers my system.me container process to be an immediate child of the root cgroup
  • Control Group v2 — The Linux Kernel Admin Guide:
    • "When a process forks a child process, the new process is born into the cgroup that the forking process belongs to at the time of the operation…" — Processes

    • "Marking a cgroup threaded makes it join the resource domain of its parent as a threaded cgroup…The root…serves as the resource domain for the entire subtree…" — Threads

    • "Enabling a controller in a cgroup indicates that the distribution of the target resource across its immediate children will be controlled…" — Enabling and Disabling

    • "…the controller interface files - anything which doesn't start with ‚cgroup.‘ are owned by the parent rather than the cgroup itself" — Enabling and Disabling

    • "Resources are distributed top-down…" — Top-down Constraint

    • "…only domain cgroups which don't contain any processes can have domain controllers enabled in their ‚cgroup.subtree_control‘ files" — No Internal Process Constraint

    • "To control resource distribution of a cgroup, the cgroup must create children and transfer all its processes to the children before enabling controllers in its ‚cgroup.subtree_control‘ file" — No Internal Process Constraint

  • My Alpine distro is not inited by systemd6
  • My spike is not for a rootless container use case

Workaround

  • By configuring podman to replace crun with runc in the container's /etc/containers/containers.conf, the container's podman run... test behaves as expected:
podman run -it skepticoital/mem_limit:hmmm

Allocated = 0 to 1 MB
Allocated = 1 to 2 MB
...
Allocated = 420 to 421 MB
...
Allocated = 933 to 934 MB
<Killed>
...
dmesg | grep -i killed
...
Memory cgroup out of memory: Killed process 42 (mem_limit) total-vm:962004kB...

Output of podman version:

podman version 4.1.0

Output of podman info --debug:

host:
  arch: amd64
  buildahVersion: 1.26.1
  cgroupControllers:
  - cpuset
  - cpu
  - io
  - memory
  - hugetlb
  - pids
  - rdma
  cgroupManager: cgroupfs
  cgroupVersion: v2
  conmon:
    package: conmon-2.1.0-r1
    path: /usr/bin/conmon
    version: 'conmon version 2.1.0, commit: ad24dda9f2b11fd974e510713e0923f810ea19c6'
  cpuUtilization:
    idlePercent: 99.67
    systemPercent: 0.22
    userPercent: 0.12
  cpus: 4
  distribution:
    distribution: alpine
    version: 3.16.0_alpha20220328
  eventLogger: file
  hostname: 76aab6ccf14e
  idMappings:
    gidmap: null
    uidmap: null
  kernel: 5.10.102.1-microsoft-standard-WSL2
  linkmode: dynamic
  logDriver: k8s-file
  memFree: 10450141184
  memTotal: 12926758912
  networkBackend: netavark
  ociRuntime:
    name: crun
    package: crun-1.4.5-r0
    path: /usr/bin/crun
    version: |-
      crun version 1.4.5
      commit: c381048530aa750495cf502ddb7181f2ded5b400
      spec: 1.0.0
      +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +YAJL
  os: linux
  remoteSocket:
    path: /run/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: false
    seccompEnabled: true
    seccompProfilePath: /etc/containers/seccomp.json
    selinuxEnabled: false
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: slirp4netns-1.2.0-r0
    version: |-
      slirp4netns version 1.2.0
      commit: 656041d45cfca7a4176f6b7eed9e4fe6c11e8383
      libslirp: 4.7.0
      SLIRP_CONFIG_VERSION_MAX: 4
      libseccomp: 2.5.2
  swapFree: 4294967296
  swapTotal: 4294967296
  uptime: 19h 41m 19.05s (Approximately 0.79 days)
plugins:
  log:
  - k8s-file
  - none
  - passthrough
  network:
  - bridge
  - macvlan
  volume:
  - local
registries:
  search:
  - docker.io
store:
  configFile: /etc/containers/storage.conf
  containerStore:
    number: 0
    paused: 0
    running: 0
    stopped: 0
  graphDriverName: overlay
  graphOptions:
    overlay.mountopt: nodev
  graphRoot: /var/lib/containers/storage
  graphRootAllocated: 269490393088
  graphRootUsed: 4095729664
  graphStatus:
    Backing Filesystem: overlayfs
    Native Overlay Diff: "false"
    Supports d_type: "true"
    Using metacopy: "false"
  imageCopyTmpDir: /var/tmp
  imageStore:
    number: 0
  runRoot: /run/containers/storage
  volumePath: /var/lib/containers/storage/volumes
version:
  APIVersion: 4.1.0
  Built: 1652313655
  BuiltTime: Thu May 12 00:00:55 2022
  GitCommit: 6c6d79e5cd2b9dd69b78913a88c062126ff5e11c
  GoVersion: go1.18.2
  Os: linux
  OsArch: linux/amd64
  Version: 4.1.0

Package info:

apk list podman
podman-4.1.0-r1 x86_64 {podman} (Apache-2.0) [installed]

Additional environment details:

uname -a

Linux 10bald424b38 5.10.102.1-microsoft-standard-WSL2 #1 SMP Wed Mar 2 00:30:59 UTC 2022 x86_64 Linux
cat /etc/os-release
NAME="Alpine Linux"
ID=alpine
VERSION_ID=3.16.0_alpha20220328
PRETTY_NAME="Alpine Linux edge"
...




1 Relates to Podman issue #14236

2 The host's root cgroup MUST have all 6 cgroup v2 controllers enabled

3 The child cgroup will inherit hugetlb io memory rdma from the host

4 "…enabling [a controller] creates the controller's interface files in the child cgroups…" — The Linux Kernel Control Group v2

5 "Operations which fail due to invalid topology use EOPNTSUPP as the errno…" — Threads

6 Considering systemd as a dependency is off the table

@skepticoitusInteruptus
Copy link
Author

skepticoitusInteruptus commented May 24, 2022

Hey @giuseppe, @n1hility, @mheon, @rhatdan 👋  #nudge

If there's anything you fellahs can fill me in on (insights, corrections, advice?), please holler.

If there are any questions I need to answer, shoot.

I look forward to being able to close whatever gaps there might be in my understanding of the issue I'm observing.

TIA.

@giuseppe
Copy link
Member

runc has an additional check to not enable cgroup v2 controllers that do not support the threaded cgroup type (that is the memory controller).

Not sure what the entrypoint in your image is doing and why you need it, but if I do something like:

# podman run --entrypoint /bin/sh --rm -it -m 1024m --memory 1024m --privileged skepticoital/system.me:hmmm
# mkdir /sys/fs/cgroup/init
# echo 1 > /sys/fs/cgroup/init/cgroup.procs
# podman run -it skepticoital/mem_limit:hmmm
Allocated 1049 to 1050 MB
Done!

giuseppe added a commit to giuseppe/crun that referenced this issue May 26, 2022
if moving a process fails with EOPNOTSUPP, then change the target
cgroup type to threaded and attempt the migration again.

Closes: containers#923

Signed-off-by: Giuseppe Scrivano <[email protected]>
giuseppe added a commit to giuseppe/crun that referenced this issue May 26, 2022
if moving a process fails with EOPNOTSUPP, then change the target
cgroup type to threaded and attempt the migration again.

Closes: containers#923

Signed-off-by: Giuseppe Scrivano <[email protected]>
@giuseppe
Copy link
Member

giuseppe commented May 26, 2022

opened a PR:

@skepticoitusInteruptus
Copy link
Author

skepticoitusInteruptus commented May 26, 2022

Thanks for looking into this @giuseppe 👍

"...runc has an additional check to not enable cgroup v2 controllers that do not support the threaded cgroup type (that is the memory controller)..."

I'm sorry. It's not clear to me right now how that is applicable to my spike. If you have handy a URL to a resource you could share that explains that feature, I'd appreciate that. TIA.

"…Not sure what the entrypoint in your image is doing…"

It's a kind of very simple init. It's doing two things:

  1. Execute a script that does the same thing my Step 6 here does: "Switch the stock Alpine 3.15's default cgroup filesystem from it's original v1 support, to v2"
  2. Drop into /bin/sh in the container

"…why you need it…"

I need it for the setup step for my test. I need my test to…

  • "Evaluate the behavior parity of the runc and crun container runtimes for cgroup v2 support."1

The way my test does that evaluation is to execute a simple C program that mocks memory load.2

For my test, a PASS would be if the out-of-memory killer kills the mem_limit process before mem_limit allocated more than the amount of memory specified by the -m values of the outermost docker run command; 1024m in my original example above.

...Allocated 1049 to 1050 MB...

As described in the "Workaround" section of my OP above, the expected outcome is that process is expected to have been killed and disallowed by cgroup v2 resource control to ever reach Done!

In other words, the expectation is that the -m 1024m limit set on your outermost podman run (and my original, docker run) is expected to have been applied to the nested podman run that attempts to allocate more than the prescribed 1024m.

# podman run --entrypoint /bin/sh --rm -it -m 1024m --memory 1024m --privileged skepticoital/system.me:hmmm

Done!

The outcome you're reporting there would be a FAIL in the test scenario described in my OP; given that1050 MB is greater than -m 1024m.

So my questions at this point are:

  1. How unreasonable are the above expectations?
  2. Is the use case that my reproducer attempts to model, atypical in your opinion?

TIA for your answers @giuseppe.





 1 Apologies for being redundant and quoting myself

 2 The skepticoital:mem_limit container allocates memory up to a hard coded total of 1050 MB

giuseppe added a commit to giuseppe/crun that referenced this issue May 26, 2022
if moving a process fails with EOPNOTSUPP, then change the target
cgroup type to threaded and attempt the migration again.

Closes: containers#923

Signed-off-by: Giuseppe Scrivano <[email protected]>
@giuseppe
Copy link
Member

how is the memory allocated by the C program?

@skepticoitusInteruptus
Copy link
Author

"...how is the memory allocated by the C program?..."

To see the complete, original implementation, do a Ctrl+F for mem-limit.c on this page.1

# podman run --entrypoint /bin/sh --rm -it -m 1024m --memory 1024m --privileged skepticoital/system.me:hmmm

Also, I would expect that with your --entrypoint /bin/sh there, the Alpine container that's instantiated would be running with Alpine's default cgroup v1. My spike is about evaluating v2.




 1 The skepticoital:mem_limit container allocates memory up to a hard coded total of 1050 MB instead of 50

@n1hility
Copy link
Member

n1hility commented May 26, 2022

# podman run --entrypoint /bin/sh --rm -it -m 1024m --memory 1024m --privileged skepticoital/system.me:hmmm

Also, I would expect that with your --entrypoint /bin/sh there, the Alpine container that's instantiated would be running with Alpine's default cgroup v1. My spike is about evaluating v2.

The kernel and mount on the host is what determines cgroupv1 vs v2 not the container. The container bootstrap just creates a cgroups namespace of whatever the host has.

@skepticoitusInteruptus
Copy link
Author

"...The kernel and mount on the host is what determines cgroupv1 vs v2 not the container..."

TIL 🎓

U Rawk, @n1hility 🎸

@skepticoitusInteruptus
Copy link
Author

Say @flouthoc, @n1hility? 👋

To give @giuseppe a break from my nagging questions, I'd be cool with either of you fellahs fielding, on his behalf, these questions that are still outstanding…

  1. How unreasonable are the above expectations?
  2. Is the use case that my reproducer attempts to model, atypical in your opinion?

TIA.

@flouthoc
Copy link
Collaborator

Hi @skepticoitusInteruptus

How unreasonable are the above expectations?

After reading the context in the issue above I think yes in any case the memory usage of the nested container should be capped by parent container although I doubt that cgroup will be mounted correctly inside the nested container with the example you have shared but still in worst case I think the max memory will be always capped by what is provided by the parent container.

A small example should verify this

sudo podman run --memory 500m --memory-swap 500m --rm -it --privileged quay.io/containers/podman:latest bash
# Inside the container
[root@7e0a58f2e066 /]# podman run --rm -it progrium/stress --vm 1 --vm-bytes 600M --timeout 1s
stress: info: [1] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd
stress: dbug: [1] using backoff sleep of 3000us
stress: dbug: [1] setting timeout to 1s
stress: dbug: [1] --> hogvm worker 1 [2] forked
stress: dbug: [2] allocating 629145600 bytes ...
stress: dbug: [2] touching bytes in strides of 4096 bytes ...
stress: FAIL: [1] (416) <-- worker 2 got signal 9
stress: WARN: [1] (418) now reaping child worker processes
stress: FAIL: [1] (422) kill error: No such process
stress: FAIL: [1] (452) failed run completed in 1s

But since I am not entirely sure about the nested cgroup v2 behavior here I'll wait for others to confirm.

@giuseppe
Copy link
Member

For my test, a PASS would be if the out-of-memory killer kills the mem_limit process before mem_limit allocated more than the amount of memory specified by the -m values of the outermost docker run command; 1024m in my original example above.

sorry my mistake. I forgot to specify the --memory-swap option.

If I specify that, as in the original report, then I get your expected result:

$ podman run -it skepticoital/mem_limit:hmmm || echo failed
...
Allocated 986 to 987 MB
Allocated 987 to 988 MB
Allocated 988 to 989 MB
Allocated 989 to 990 MB
failed

@skepticoitusInteruptus
Copy link
Author

Hey @flouthoc 👋

Ahhh! So this comment must be what you were referring to in our discussion? I didn't see this until two minutes ago. My apologies for my confusion 😕

although I doubt that cgroup will be mounted correctly inside the nested container with the example you have shared

Even given this (from my OP)?


2. Inside the container, observe the container's enabled controllers:

cat /sys/fs/cgroup/cgroup.subtree_control
cpuset cpu pids
  1. Observe the container's cgroup.type:
cat /sys/fs/cgroup/cgroup.type
domain threaded
  1. Observe there are controller interface files in the container's /sys/fs/cgroup dir:4
ls /sys/fs/cgroup/
...cpu.max...hugeTLB.1GB.max...io.stat...memory.max...pids.max...rdma.max
  1. Observe that the value in bytes of the container's /sys/fs/cgroup/memory.max file equals the value in megabytes of the ?host's docker run -m switch:
cat /sys/fs/cgroup/memory.max
1073741824

That's just to note the evidence that convinced me, at least, that the nested container in my OP is correctly configured for cgroup v2.

_A small example should verify this-

sudo podman run --memory 500m --memory-swap 500m --rm -it --privileged quay.io/containers/podman:latest bash
…

Awesome! I will try that myself at some point. For now though I'll note, for the sake of completeness, that my original command above ran docker run… instead.

As for my second question:

"Is the use case that my reproducer attempts to model, atypical in your opinion?"

I guess I'll just have to be happy with my own speculative answer: No, it's not atypical.

@skepticoitusInteruptus
Copy link
Author

Hey 👋

If I specify that, as in the original report, then I get your expected result:

$ podman run -it skepticoital/mem_limit:hmmm || echo failed
...
Allocated 986 to 987 MB
Allocated 987 to 988 MB
Allocated 988 to 989 MB
Allocated 989 to 990 MB
failed

That's awesome, @giuseppe! I will try that (podman run -m 1024…) myself at some point.

In the meantime though I'll note, for the sake of completeness, that my original reproducer above ran docker run -m 1024… instead.

Might that difference be enough to result in me getting the error I reported above and you not getting that same error with podman run -m 1024… as the outer container?

Error: OCI runtime error: crun: writing file `/sysfs/cgroup/cgroup.subtree_control`: Not supported

@n1hility
Copy link
Member

Hey @flouthoc 👋

Ahhh! So this comment must be what you were referring to in our discussion? I didn't see this until two minutes ago. My apologies for my confusion 😕

although I doubt that cgroup will be mounted correctly inside the nested container with the example you have shared

Even given this (from my OP)?

Per earlier discussion, there is no need to unmount and remount /sys/fs/cgroup: It's a cgroup namespace, so its already mounted for you. BTW The reason you end up with a threaded domain is because your script enables the cpu controller without moving the cgroup your init process is in, triggering the 'no internal process constraint'.

"Is the use case that my reproducer attempts to model, atypical in your opinion?"

I guess I'll just have to be happy with my own speculative answer: No, it's not atypical.

Hard to answer this one. I don't follow what your use-case is. I understand that you are nesting containers and testing memory limiting when nested, but you haven't mentioned the use-case behind it.

@skepticoitusInteruptus
Copy link
Author

Hey 👋

"…BTW The reason you end up with a threaded domain is because your script enables the cpu controller without moving the cgroup your init process is in, triggering the 'no internal process constraint'…"

TIL even more 🎓

"…I don't follow what your use-case is…"

OK if I refer you to item number 7 on my menu of reasons to keep schtum?

I'm contractually obligated to limit what I reveal in public forums about my org's uses cases, to only what is strictly sufficient to resolve an issue.

I sincerely want to reciprocate and be as helpful to you @n1hility as you have been to me, though.

That's why I feel bad that "cgroup v2-controlled nested containers" isn't sufficient enough for you.

So hopefully sharing this link with you will suffice. That issue lists docker and podman commands that are more or less similar to ours.

I just happened to stumble across that issue without even looking for it.

I imagine if I were to proactively search for them, I might find more convincing evidence that similar "cgroup v2-controlled nested containers" use cases are not all that novel after all.

"…you haven't mentioned the use-case behind it…"

You'll just have to take my word for it, @n1hility. The abstractions I shared in my reproducers are pretty decent representative models of the problems we need to solve; in my opinion they are, anyway.

@n1hility
Copy link
Member

n1hility commented May 28, 2022

Hey 👋

"…BTW The reason you end up with a threaded domain is because your script enables the cpu controller without moving the cgroup your init process is in, triggering the 'no internal process constraint'…"

TIL even more 🎓

"…I don't follow what your use-case is…"

OK if I refer you to item number 7 on my menu of reasons to keep schtum?

I'm contractually obligated to limit what I reveal in public forums about my org's uses cases, to only what is strictly sufficient to resolve an issue.

Ah sure, been there. In that case I can give a general answer. While there are certainly legitimate cases to nest container engines, life is simpler if you can avoid it and stick to a single flat engine namespace on your host. An example where it can make sense is a CI infrastructure where your host OS image is locked down, but you want a different container engine to orchestrate your tests. One thing to keep in mind is that nested cgroup tree limits overcommit, so setting a limit on the parent can lead to surprising results (containers can get killed that are within their individual limit because the total exceeds the parent limit).

@skepticoitusInteruptus
Copy link
Author

Thanks @giuseppe, @n1hility and @flouthoc (cc: @rhatdan)

I've now reread all of your responses with a lot more attention to detail than I had time to do yesterday afternoon when I first read them.

Other follow-on questions arose after reading your responses. I will spare you and not bombard you with all of those questions today though.

Instead, I will just ask you all one single follow-on question. Before I ask the question though:

Context

Given that I...

  • Refactor the skeptocoital/system.me container to not do anything whatsoever regarding configuring|inititalizing|unmounting|mounting cgroups
  • Run the refactored skeptocoital/system.me container with the following very specific command:
docker run -it -m 1024m --memory-swap 1024m --privileged skepticoital/system.me:hmmm
  • Inside the Docker-instantiated container constructed from skeptocoital/system.me, run the following very specific command:
/system.me # podman run --rm -it progrium/stress --vm 1 --vm-bytes 2048M --timeout 1s

One, single, follow-on question

  1. What outcome should I expect if I execute the above very specific commands?

TIA.

@n1hility
Copy link
Member

n1hility commented May 28, 2022

Thanks @giuseppe, @n1hility and @flouthoc (cc: @rhatdan)

your welcome!

docker run -it -m 1024m --memory-swap 1024m --privileged skepticoital/system.me:hmmm


 * Inside the _**Docker**_-instantiated container constructed from  `skeptocoital/system.me`, run the following _**very specific**_ command:

/system.me # podman run --rm -it progrium/stress --vm 1 --vm-bytes 2048M --timeout 1

Change this to add the following before your podman command

mkdir /sys/fs/cgroup/init
echo $$ > /sys/fs/cgroup/init/cgroup.procs
/system.me # podman run --rm -it progrium/stress --vm 1 --vm-bytes 2048M --timeout 1s

The first two commands move your process from the namespace root to a leaf node, which prevents the internal process constraint from flipping your cgroup to domain threaded. Domain is really what you want here, and it has the added benefit of not requiring a release with #931 (which allows crun to operate when domain threaded is in use)

Alternatively instead of relocating your process you can disable cgroups usage by podman for the nested podman since the parent is enforcing the 2048 in this policy configuration. It's the creation of the cgroup and enabling subtree controllers that triggers the flip (when the process is in the root namespace)

podman run --cgroups disabled --rm -it progrium/stress --vm 1 --vm-bytes 2048M --timeout 1s

In this mode you will observe that allocating over 2048 will kill the container, since its part of the docker group that has the 2048 limit.

@skepticoitusInteruptus
Copy link
Author

Change this to add the following before your podman command

echo $$ > /sys/fs/cgroup/init/cgroup.procs
/system.me # podman run --rm -it progrium/stress --vm 1 --vm-bytes 2048M --timeout 1s

That's fantabulous @n1hility 👍

I will give that a shot at some point later.

🎓 In the meantime, please can I get you to edu-muh-cate me on this:

One, single, follow-on question

  1. What outcome should I expect if I execute the above very specific commands?

TIA.

@skepticoitusInteruptus
Copy link
Author

Ahhh. So sorry. I need to clean my glasses.

Just saw this...

"…In this mode you will observe that allocating over 2048 will kill the container, since its part of the docker group that has the 2048 limit…"

@n1hility
Copy link
Member

Change this to add the following before your podman command

echo $$ > /sys/fs/cgroup/init/cgroup.procs
/system.me # podman run --rm -it progrium/stress --vm 1 --vm-bytes 2048M --timeout 1s

That's fantabulous @n1hility 👍

I will give that a shot at some point later.

🎓 In the meantime, please can I get you to edu-muh-cate me on this:

One, single, follow-on question

  1. What outcome should I expect if I execute the above very specific commands?

TIA.

Without either of the options I mentioned (note I posted an edit adding an alternative of disabling cgroups with podman i forgot to mention), you will get a failure because without #931 crun will fail attempting to create a domain child under a domain threaded root (cgroups disallows this). After #931 it should work but will be less ideal since you will be using domain threaded when you don't really need it - it adds additional restrictions / semantics.

@skepticoitusInteruptus
Copy link
Author

skepticoitusInteruptus commented May 28, 2022

Howdy do, @n1hility 👋

"…In this mode you will observe that allocating over 2048 1024 will kill the container, since its part of the docker group that has the 2048 1024 limit…"

I fixed (what I presume is) a typo for you there.

Given, verbatim, all of the refactors,1 preconditions, setup and very specific commands I listed in that Context section above, this is the actual outcome I observe…

my@host $ docker run -it -m 1024m --memory-swap 1024m --privileged skepticoital/system.me:hmmmm
…
/system.me # mkdir /sys/fs/cgroup/init
/system.me # cat /sys/fs/cgroup/init/cgroup.procs
/system.me # echo $$ > /sys/fs/cgroup/init/cgroup.procs
…
/system.me # podman run --rm -it progrium/stress --vm 1 --vm-bytes 2048M --timeout 1s
…
stress: info: [1] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd
stress: dbug: [1] using backoff sleep of 3000us
stress: dbug: [1] setting timeout to 1s
stress: dbug: [1] --> hogvm worker 1 [2] forked
stress: dbug: [2] allocating 2147483648 bytes ...
stress: dbug: [2] touching bytes in strides of 4096 bytes ...
stress: dbug: [1] <-- worker 2 signalled normally
stress: info: [1] successful run completed in 1s
/system.me # dmesg | grep -i killed
/system.me # 

TL;DR: Given that the nested Podman's --vm-bytes 2048M is greater than the outer Docker's -m 1024m I expected to observe something like @flouthoc's kill error

…
stress: FAIL: [1] (416) <-- worker 2 got signal 9
stress: WARN: [1] (418) now reaping child worker processes
stress: FAIL: [1] (422) kill error: No such process
stress: FAIL: [1] (452) failed run completed in 1s

And/or something like…

…
dmesg | grep -i killed
…
Memory cgroup out of memory: Killed process 42 (mem_limit) total-vm:962004kB...

Or are my expectations mistaken?

crun issue 923 screencast #0




 1 The tag for the refactored reproducer image has four ms: skepticoital/system.me:hmmmm

@n1hility
Copy link
Member

Howdy do, @n1hility 👋

"…In this mode you will observe that allocating over 2048 1024 will kill the container, since its part of the docker group that has the 2048 1024 limit…"

I fixed (what I presume is) a typo for you there.

Yes sorry, I should have said 1024

Given, verbatim, all of the refactors,1 preconditions, setup and very specific commands I listed in that Context section above, this is the actual outcome I observe…

my@host $ docker run -it -m 1024m --memory-swap 1024m --privileged skepticoital/system.me:hmmmm
…
/system.me # mkdir /sys/fs/cgroup/init
/system.me # cat /sys/fs/cgroup/init/cgroup.procs
/system.me # echo $$ > /sys/fs/cgroup/init/cgroup.procs
…
/system.me # podman run --rm -it progrium/stress --vm 1 --vm-bytes 2048M --timeout 1s
…
stress: info: [1] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd
stress: dbug: [1] using backoff sleep of 3000us
stress: dbug: [1] setting timeout to 1s
stress: dbug: [1] --> hogvm worker 1 [2] forked
stress: dbug: [2] allocating 2147483648 bytes ...
stress: dbug: [2] touching bytes in strides of 4096 bytes ...
stress: dbug: [1] <-- worker 2 signalled normally
stress: info: [1] successful run completed in 1s
/system.me # dmesg | grep -i killed
/system.me # 

TL;DR: Given that the nested Podman's --vm-bytes 2048M is greater than the outer Docker's -m 1024m I expected to observe something like @flouthoc's kill error

It's probably not running long enough for it to get killed. Try bumping timeout to 10s or something like that. If a process temporarily allocates over but releases quickly it may survive. That other mem_limit container you had earlier in the thread that allocates and holds should be more reliable at demonstrating a limit kill.

@skepticoitusInteruptus
Copy link
Author

Hey @giuseppe 👋

"…runc has an additional check to not enable cgroup v2 controllers that do not support the threaded cgroup type…"

If you have handy a URL to a resource you could share that explains that feature

One of my favorite old Greek sayings is, "The gods helps those who helps themselves" 🇬🇷

Please correct me if I've guessed wrong that one of these is what you were referring to:

@n1hility
Copy link
Member

@skepticoitusInteruptus did bumping the timeout and trying mem_limit work for you?

@skepticoitusInteruptus
Copy link
Author

skepticoitusInteruptus commented May 31, 2022

"…did bumping the timeout and trying mem_limit work for you?…"

TL;DR

If I don't do anything whatsoever to initialize|configure|umount|mount cgroups in the skepticoital/system.me image, the -m 1048m and --memory-swap 1048m limits are totally ignored by any process running inside the outer Docker container.1


I have a hunch2 about what might be preventing those limits from being applied. But, I want to be careful not to (mis)lead y'all by the power of suggestion.

So if you and @giuseppe or @flouthoc were to independently arrive at that same hunch from each of your own investigations, that would be super helpful. Not just to me, personally; to the entire community!

Personally though, you all's help would certainly increase my confidence about what the next test cases of my investigation might be.

Instead of overworking y'all with reading, I'll share this recording; if it's any help to you fellahs at all:

crun issue 923 demo #1

Speaking of help, I gotta say: I hope Red Hat appreciate how unique your helpfulness in the containers issue trackers is @n1hility and are paying you your much deserved big bugs bucks 🥇

I know I can't express my ❤️-felt appreciation of your help, often enough. On this and 14236.

Muchas Thankyas es millionas 💯




 1 On an Alpine host in WSL2 on Windows 10; configured like here

 2 Based on the output of cat /proc/self/mounts I highlight in the recording

@n1hility
Copy link
Member

n1hility commented May 31, 2022

@skepticoitusInteruptus Ah ha! I see the problem (thanks for the animated walkthrough that was helpful - and the kind words). Here is what is happening. When you run docker and it complains about the swap limit, what's happening is that, much like the podman issue in containers/podman#14236, it can't detect whether or not swap limiting should be employed, and falls-back to not adjusting it, leaving it at max. Then when you run anything that exceeds the limit, it will just use swap instead of killing the process. (Note that once a podman releases is available that includes containers/podman#14308, podman will correctly detect swap in spite of being in the root cgroup). To work around the docker scenario, you need a similar workaround as discussed in containers/podman#14236, create an initial cgroup of some kind on the host and run the docker command there.

From that point on, once in the container, you should see that both /sys/fs/cgroup/memory.max and memory.swap.max reflect the values you are passing to docker.

@skepticoitusInteruptus
Copy link
Author

Hey @n1hility 👋

"…create an initial cgroup of some kind on the host and run the docker command there…"

Correct'O'mondo!

That's a step I would've had to have done in order to have observed the original outcome I reported in my OP above.

It was my oversight not listing it as a "Step to reproduce". So much for my "very specific commands". Right? 😅

Also, so much for my hunch.1 That cgroup /sys/fs/cgroup cgroup2 … mount looks sketchy to me. I intend to take that one up with the WSL2 team at some point.

…If I don't do anything whatsoever to initialize|configure|umount|mount cgroups in the skepticoital/system.me image…

I think there is something worth sharing about that…

  • If I do initialize|configure|umount|mount cgroups in the skepticoital/system.me image (when I remember to run docker in a non-root cgroup before-hand), then when running in the outer Docker container inited by system.me, I've observed that I do get the expected Killed outcome; even though the nested Podman is running in a domain threaded subcgroup.2

"…When you run docker and it complains about the swap limit … it can't detect whether or not swap limiting should be employed, and falls-back to not adjusting it, leaving it at max…"

That "WARNING:…" from Docker about swap is the next little duckie in my sights 🦆 🦆 🦆 🦆 🔫

"…podman will correctly detect swap in spite of being in the root cgroup…"

I'll spare you Podman bros my questions about how reasonable it would be to expect Docker to do that too.

Brace yerselves @kolyshkin, @AkihiroSuda, @thaJeztah, … and the rest of you Docker bros and sisters 😁





 1 Based on the output of cat /proc/self/mounts I highlight in the recording

 2 Only works if Podman's runtime is runc though

@n1hility
Copy link
Member

n1hility commented May 31, 2022

If I do initialize|configure|umount|mount cgroups in the skepticoital/system.me image (when I remember to run docker in a non-root cgroup before-hand), then when running in the outer Docker container inited by system.me, I've observed that I do get the expected Killed outcome; even though the nested Podman is running in a domain threaded subcgroup.2

@skepticoitusInteruptus I just checked this and want to check we are seeing the same thing. With all of the above podman on crun does work with limits. Does this work for you?:

PS C:\Users\jason> wsl --shutdown
PS C:\Users\jason> wsl -d Alpine
WIN10PC:/mnt/c/Users/jason# umount /sys/fs/cgroup/unified/
WIN10PC:/mnt/c/Users/jason# umount /sys/fs/cgroup
WIN10PC:/mnt/c/Users/jason# mount -t cgroup2 cgroup /sys/fs/cgroup
WIN10PC:/mnt/c/Users/jason# mkdir /sys/fs/cgroup/init
WIN10PC:/mnt/c/Users/jason# echo +memory > /sys/fs/cgroup/cgroup.subtree_control
WIN10PC:/mnt/c/Users/jason# echo $$ > /sys/fs/cgroup/init/cgroup.procs
WIN10PC:/mnt/c/Users/jason# dockerd > /dev/null 2>&1 &
WIN10PC/mnt/c/Users/jason# docker run -it -m 100M --memory-swap 100M --privileged --entrypoint /bin/sh skeptico
ital/system.me:hmmm

No warning since we created the cgroup, and values are now what we expect (note that cgroup swap max = container memory-swap - memory)

/system.me # cat /sys/fs/cgroup/memory.max
104857600
/system.me # cat /sys/fs/cgroup/memory.swap.max
0

setup our cgroup to ensure that we dont get converted to domain threaded

/system.me # mkdir /sys/fs/cgroup/init
/system.me # echo $$ > /sys/fs/cgroup/init/cgroup.procs

Now run nested podman using your mem limit container:

 /system.me # podman run skepticoital/mem_limit:hmmm
-snipped-
Allocated 44 to 45 MB
Allocated 45 to 46 MB
Allocated 46 to 47 MB
Allocated /system.me # echo $?
137
 dmesg | grep kill
[  178.622193] podman invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
[  178.622248]  oom_kill_process.cold+0xb/0x10
[  178.622371] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=init,mems_allowed=0,oom_memcg=/docker/1fd0c04db8fe22ff55da0a157e626582f6302a1cf5a1e8579c721da54cef0822,task_memcg=/docker/1fd0c04db8fe22ff55da0a157e626582f6302a1cf5a1e8579c721da54cef0822/libpod_parent/libpod-eaa7d0e0ecb25fa0da4d48fe6bb840e1bec91065dd40d679ac36ec0e3d621e48,task=mem_limit,pid=490,uid=0

Rerun nested mem limit with a shell and double check the cgroup type

/system.me # podman run -it --entrypoint /bin/sh skepticoital/mem_limit:hmmm
/ # cat /sys/fs/cgroup/cgroup.type
domain
/ # /mem_limit
Starting ...
Allocated 0 to 1 MB
-snipped-
Allocated 73 to 74 MB
Allocated 74 to 75 MB
Killed
/ # cat /sys/fs/cgroup/cgroup.type
domain
/ # exit

Double check we used crun

/system.me # podman ps -a
CONTAINER ID  IMAGE                                  COMMAND     CREATED             STATUS                       PORTS       NAMES
70581260adac  docker.io/skepticoital/mem_limit:hmmm              About a minute ago  Exited (127) 33 seconds ago              nostalgic_darwin

/system.me # podman inspect nostalgic_darwin | grep OCIRuntime
          "OCIRuntime": "crun",

@skepticoitusInteruptus
Copy link
Author

"…Does this work for you?:…"

Nah. But I have no idea why it doesn't 😕 If anything in this first recording jumps out at you, please holler…

crun issue 923 demo #2

What does work for me are the steps I listed and reported in my OP.1

This second recording demonstrates those steps and the expected outcome…2

crun issue 923 demo #3

And last, but not least, after deleting the cgroup dir I created in the previous recording, I create it again afresh.3

Then I follow all your steps you just listed…4

crun issue 923 demo #4





 1 With the Workaround of replacing crun with runc

 2 Using the 1st DockerHub skepticoital/system.me:hmmm image; I do umount|mount cgroups

 3 I don't umount|mount cgroups in this local skepticoital/system.hmmm:me image

 4 Worked without replacing crun with runc

@n1hility
Copy link
Member

n1hility commented Jun 1, 2022

"…Does this work for you?:…"

Nah. But I have no idea why it doesn't 😕 If anything in this first recording jumps out at you, please holler…

Ah looks like an early step echo +memory > /sys/fs/cgroup/cgroup.subtree_control somehow got transposed to echo +memory > /sys/fs/cgroup/init/cgroup.subtree_control

What does work for me are the steps I listed and reported in my OP.1

This second recording demonstrates those steps and the expected outcome…2

Cool. So once #931 lands in a release threaded will work on crun. BTW to add more color to the limitations. Once you have a threaded controller you can not create cgroups below it that reference non-threaded controllers like the memory controller, so anything that might create another container like construct, or some manual cgroup usage in a container might not behave as expected or error.

And last, but not least, after deleting the cgroup dir I created in the previous recording, I create it again afresh.3

Then I follow all your steps you just listed…4

Excellent 🎉

@skepticoitusInteruptus
Copy link
Author

"…Ah looks like an early step echo +memory > /sys/fs/cgroup/cgroup.subtree_control somehow got transposed to echo +memory > /sys/fs/cgroup/init/cgroup.subtree_control"

Oops! "I see!", said the blind man 👓

…
WIN10PC:/mnt/c/Users/jason# mount -t cgroup2 **cgroup** /sys/fs/cgroup
…

Q: What is the intent of specifying cgroup there instead of cgroup2?

I'm sure there must be an advantage of doing it that way instead of the way I do it (mount -t cgroup2 cgroup2 … ),

It's surprising to me that they're not both cgroup2. What's the effective difference?

TIA.

@skepticoitusInteruptus
Copy link
Author

skepticoitusInteruptus commented Jun 1, 2022

Q: What is the intent of specifying cgroup there instead of cgroup2?

Eventually got around to reading the man pages for mount(8)

"…The proc filesystem is not associated with a special device, and when mounting it, an arbitrary keyword - for example, proc - can be used instead of a device specification…"

So I suppose that presuming the cgroup2 file system type is a so-called pseudo file system type (like proc is), I'm gonna go out and a limb and guess …

A: Same difference.

Six of one, half a dozen of the other type deal ❔

@n1hility
Copy link
Member

n1hility commented Jun 1, 2022

Q: What is the intent of specifying cgroup there instead of cgroup2?

Eventually got around to reading the man pages for mount(8)

"…The proc filesystem is not associated with a special device, and when mounting it, an arbitrary keyword - for example, proc - can be used instead of a device specification…"

So I suppose that presuming the cgroup2 file system type is a so-called pseudo file system type (like proc is), I'm gonna go out and a limb and guess …

A: Same difference.

Six of one, half a dozen of the other type deal ❔

Yes thats right. The device spec can be named anything for the same reason. The FS mount location (/sys/fs/cgroup) is the contract/api point that everything looks for.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants