-
Notifications
You must be signed in to change notification settings - Fork 318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Correct crun's behavior when it runs inside a cgroup v2 container #923
Comments
Hey @giuseppe, @n1hility, @mheon, @rhatdan 👋 #nudge If there's anything you fellahs can fill me in on (insights, corrections, advice?), please holler. If there are any questions I need to answer, shoot. I look forward to being able to close whatever gaps there might be in my understanding of the issue I'm observing. TIA. |
runc has an additional check to not enable cgroup v2 controllers that do not support the threaded cgroup type (that is the memory controller). Not sure what the entrypoint in your image is doing and why you need it, but if I do something like:
|
if moving a process fails with EOPNOTSUPP, then change the target cgroup type to threaded and attempt the migration again. Closes: containers#923 Signed-off-by: Giuseppe Scrivano <[email protected]>
if moving a process fails with EOPNOTSUPP, then change the target cgroup type to threaded and attempt the migration again. Closes: containers#923 Signed-off-by: Giuseppe Scrivano <[email protected]>
opened a PR: |
Thanks for looking into this @giuseppe 👍
I'm sorry. It's not clear to me right now how that is applicable to my spike. If you have handy a URL to a resource you could share that explains that feature, I'd appreciate that. TIA.
It's a kind of very simple
I need it for the setup step for my test. I need my test to…
The way my test does that evaluation is to execute a simple C program that mocks memory load.2 For my test, a
As described in the "Workaround" section of my OP above, the expected outcome is that process is expected to have been killed and disallowed by cgroup v2 resource control to ever reach In other words, the expectation is that the
The outcome you're reporting there would be a So my questions at this point are:
TIA for your answers @giuseppe. — 1 Apologies for being redundant and quoting myself 2 The |
if moving a process fails with EOPNOTSUPP, then change the target cgroup type to threaded and attempt the migration again. Closes: containers#923 Signed-off-by: Giuseppe Scrivano <[email protected]>
how is the memory allocated by the C program? |
To see the complete, original implementation, do a
Also, I would expect that with your — 1 The |
The kernel and mount on the host is what determines cgroupv1 vs v2 not the container. The container bootstrap just creates a cgroups namespace of whatever the host has. |
TIL 🎓 U Rawk, @n1hility 🎸 |
To give @giuseppe a break from my nagging questions, I'd be cool with either of you fellahs fielding, on his behalf, these questions that are still outstanding…
TIA. |
After reading the context in the issue above I think yes in any case the memory usage of the nested container should be capped by parent container although I doubt that cgroup will be mounted correctly inside the nested container with the example you have shared but still in worst case I think the max memory will be always capped by what is provided by the parent container. A small example should verify this sudo podman run --memory 500m --memory-swap 500m --rm -it --privileged quay.io/containers/podman:latest bash
# Inside the container
[root@7e0a58f2e066 /]# podman run --rm -it progrium/stress --vm 1 --vm-bytes 600M --timeout 1s
stress: info: [1] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd
stress: dbug: [1] using backoff sleep of 3000us
stress: dbug: [1] setting timeout to 1s
stress: dbug: [1] --> hogvm worker 1 [2] forked
stress: dbug: [2] allocating 629145600 bytes ...
stress: dbug: [2] touching bytes in strides of 4096 bytes ...
stress: FAIL: [1] (416) <-- worker 2 got signal 9
stress: WARN: [1] (418) now reaping child worker processes
stress: FAIL: [1] (422) kill error: No such process
stress: FAIL: [1] (452) failed run completed in 1s But since I am not entirely sure about the nested cgroup v2 behavior here I'll wait for others to confirm. |
sorry my mistake. I forgot to specify the If I specify that, as in the original report, then I get your expected result:
|
Hey @flouthoc 👋 Ahhh! So this comment must be what you were referring to in our discussion? I didn't see this until two minutes ago. My apologies for my confusion 😕
Even given this (from my OP)?
That's just to note the evidence that convinced me, at least, that the nested container in my OP is correctly configured for cgroup v2.
Awesome! I will try that myself at some point. For now though I'll note, for the sake of completeness, that my original command above ran As for my second question:
I guess I'll just have to be happy with my own speculative answer: No, it's not atypical. |
Hey 👋
That's awesome, @giuseppe! I will try that ( In the meantime though I'll note, for the sake of completeness, that my original reproducer above ran Might that difference be enough to result in me getting the error I reported above and you not getting that same error with
|
Per earlier discussion, there is no need to unmount and remount /sys/fs/cgroup: It's a cgroup namespace, so its already mounted for you. BTW The reason you end up with a threaded domain is because your script enables the cpu controller without moving the cgroup your init process is in, triggering the 'no internal process constraint'.
Hard to answer this one. I don't follow what your use-case is. I understand that you are nesting containers and testing memory limiting when nested, but you haven't mentioned the use-case behind it. |
Hey 👋
TIL even more 🎓
OK if I refer you to item number 7 on my menu of reasons to keep schtum? I'm contractually obligated to limit what I reveal in public forums about my org's uses cases, to only what is strictly sufficient to resolve an issue. I sincerely want to reciprocate and be as helpful to you @n1hility as you have been to me, though. That's why I feel bad that "cgroup v2-controlled nested containers" isn't sufficient enough for you. So hopefully sharing this link with you will suffice. That issue lists I just happened to stumble across that issue without even looking for it. I imagine if I were to proactively search for them, I might find more convincing evidence that similar "cgroup v2-controlled nested containers" use cases are not all that novel after all.
You'll just have to take my word for it, @n1hility. The abstractions I shared in my reproducers are pretty decent representative models of the problems we need to solve; in my opinion they are, anyway. |
Ah sure, been there. In that case I can give a general answer. While there are certainly legitimate cases to nest container engines, life is simpler if you can avoid it and stick to a single flat engine namespace on your host. An example where it can make sense is a CI infrastructure where your host OS image is locked down, but you want a different container engine to orchestrate your tests. One thing to keep in mind is that nested cgroup tree limits overcommit, so setting a limit on the parent can lead to surprising results (containers can get killed that are within their individual limit because the total exceeds the parent limit). |
Thanks @giuseppe, @n1hility and @flouthoc (cc: @rhatdan) I've now reread all of your responses with a lot more attention to detail than I had time to do yesterday afternoon when I first read them. Other follow-on questions arose after reading your responses. I will spare you and not bombard you with all of those questions today though. Instead, I will just ask you all one single follow-on question. Before I ask the question though: ContextGiven that I...
One, single, follow-on question
TIA. |
your welcome!
Change this to add the following before your podman command
The first two commands move your process from the namespace root to a leaf node, which prevents the internal process constraint from flipping your cgroup to domain threaded. Domain is really what you want here, and it has the added benefit of not requiring a release with #931 (which allows crun to operate when domain threaded is in use) Alternatively instead of relocating your process you can disable cgroups usage by podman for the nested podman since the parent is enforcing the 2048 in this policy configuration. It's the creation of the cgroup and enabling subtree controllers that triggers the flip (when the process is in the root namespace)
In this mode you will observe that allocating over 2048 will kill the container, since its part of the docker group that has the 2048 limit. |
That's fantabulous @n1hility 👍 I will give that a shot at some point later. 🎓 In the meantime, please can I get you to edu-muh-cate me on this:
TIA. |
Ahhh. So sorry. I need to clean my glasses. Just saw this...
|
Without either of the options I mentioned (note I posted an edit adding an alternative of disabling cgroups with podman i forgot to mention), you will get a failure because without #931 crun will fail attempting to create a domain child under a domain threaded root (cgroups disallows this). After #931 it should work but will be less ideal since you will be using domain threaded when you don't really need it - it adds additional restrictions / semantics. |
Howdy do, @n1hility 👋
I fixed (what I presume is) a typo for you there. Given, verbatim, all of the refactors,1 preconditions, setup and very specific commands I listed in that Context section above, this is the actual outcome I observe…
TL;DR: Given that the nested Podman's
And/or something like…
Or are my expectations mistaken? — 1 The tag for the refactored reproducer image has four |
Yes sorry, I should have said 1024
It's probably not running long enough for it to get killed. Try bumping timeout to 10s or something like that. If a process temporarily allocates over but releases quickly it may survive. That other mem_limit container you had earlier in the thread that allocates and holds should be more reliable at demonstrating a limit kill. |
Hey @giuseppe 👋
One of my favorite old Greek sayings is, "The gods helps those who helps themselves" 🇬🇷 Please correct me if I've guessed wrong that one of these is what you were referring to: |
@skepticoitusInteruptus did bumping the timeout and trying mem_limit work for you? |
TL;DRIf I don't do anything whatsoever to initialize|configure|umount|mount cgroups in the I have a hunch2 about what might be preventing those limits from being applied. But, I want to be careful not to (mis)lead y'all by the power of suggestion. So if you and @giuseppe or @flouthoc were to independently arrive at that same hunch from each of your own investigations, that would be super helpful. Not just to me, personally; to the entire community! Personally though, you all's help would certainly increase my confidence about what the next test cases of my investigation might be. Instead of overworking y'all with reading, I'll share this recording; if it's any help to you fellahs at all: Speaking of help, I gotta say: I hope Red Hat appreciate how unique your helpfulness in the containers issue trackers is @n1hility and are paying you your much deserved big I know I can't express my ❤️-felt appreciation of your help, often enough. On this and 14236. Muchas Thankyas es millionas 💯 — 1 On an Alpine host in WSL2 on Windows 10; configured like here 2 Based on the output of |
@skepticoitusInteruptus Ah ha! I see the problem (thanks for the animated walkthrough that was helpful - and the kind words). Here is what is happening. When you run docker and it complains about the swap limit, what's happening is that, much like the podman issue in containers/podman#14236, it can't detect whether or not swap limiting should be employed, and falls-back to not adjusting it, leaving it at max. Then when you run anything that exceeds the limit, it will just use swap instead of killing the process. (Note that once a podman releases is available that includes containers/podman#14308, podman will correctly detect swap in spite of being in the root cgroup). To work around the docker scenario, you need a similar workaround as discussed in containers/podman#14236, create an initial cgroup of some kind on the host and run the docker command there. From that point on, once in the container, you should see that both /sys/fs/cgroup/memory.max and memory.swap.max reflect the values you are passing to docker. |
Hey @n1hility 👋
Correct'O'mondo! That's a step I would've had to have done in order to have observed the original outcome I reported in my OP above. It was my oversight not listing it as a "Step to reproduce". So much for my "very specific commands". Right? 😅 Also, so much for my hunch.1 That
I think there is something worth sharing about that…
That "
I'll spare you Podman bros my questions about how reasonable it would be to expect Docker to do that too. Brace yerselves @kolyshkin, @AkihiroSuda, @thaJeztah, … and the rest of you Docker bros and sisters 😁 — 1 Based on the output of 2 Only works if Podman's runtime is |
@skepticoitusInteruptus I just checked this and want to check we are seeing the same thing. With all of the above podman on crun does work with limits. Does this work for you?:
No warning since we created the cgroup, and values are now what we expect (note that cgroup swap max = container memory-swap - memory)
setup our cgroup to ensure that we dont get converted to domain threaded
Now run nested podman using your mem limit container:
Rerun nested mem limit with a shell and double check the cgroup type
Double check we used crun
|
Nah. But I have no idea why it doesn't 😕 If anything in this first recording jumps out at you, please holler… What does work for me are the steps I listed and reported in my OP.1 This second recording demonstrates those steps and the expected outcome…2 And last, but not least, after deleting the cgroup dir I created in the previous recording, I create it again afresh.3 Then I follow all your steps you just listed…4 — 1 With the Workaround of replacing 2 Using the 1st DockerHub 3 I don't umount|mount cgroups in this local 4 Worked without replacing |
Ah looks like an early step
Cool. So once #931 lands in a release threaded will work on crun. BTW to add more color to the limitations. Once you have a threaded controller you can not create cgroups below it that reference non-threaded controllers like the memory controller, so anything that might create another container like construct, or some manual cgroup usage in a container might not behave as expected or error.
Excellent 🎉 |
Oops! "I see!", said the blind man 👓
Q: What is the intent of specifying I'm sure there must be an advantage of doing it that way instead of the way I do it ( It's surprising to me that they're not both TIA. |
Eventually got around to reading the man pages for
So I suppose that presuming the A: Same difference. Six of one, half a dozen of the other type deal ❔ |
Yes thats right. The device spec can be named anything for the same reason. The FS mount location (/sys/fs/cgroup) is the contract/api point that everything looks for. |
Description
When ran inside a cgroup v2 container, crun attempts an apparently cgroup v1-like operation.
My spike:
runc
andcrun
container runtimes for cgroup v2 support.1My tests:
Steps to reproduce the issue:
cgroup.type
:/sys/fs/cgroup
dir:4/sys/fs/cgroup/memory.max
file equals the value in megabytes of the host'sdocker run -m
switch:crun
's interaction with the container's cgroup v2 memory controller:crun
chokes; reporting anEOPNOTSUPP
(guessing: it wants to mod its parent cgroup?):5Describe the results you received:
Describe the results you expected:
runc
behaves under identical constraints (see Workaround)Additional information you deem important:
system.me
container process to be an immediate child of the root cgroupsystemd
6Workaround
crun
withrunc
in the container's/etc/containers/containers.conf
, the container'spodman run...
test behaves as expected:Output of
podman version
:Output of
podman info --debug
:Package info:
Additional environment details:
—
1 Relates to Podman issue #14236
2 The host's root cgroup MUST have all 6 cgroup v2 controllers enabled
3 The child cgroup will inherit
hugetlb io memory rdma
from the host4 "…enabling [a controller] creates the controller's interface files in the child cgroups…" — The Linux Kernel Control Group v2
5 "Operations which fail due to invalid topology use
EOPNTSUPP
as the errno…" — Threads6 Considering
systemd
as a dependency is off the tableThe text was updated successfully, but these errors were encountered: