POC for CAP_CHECKPOINT_RESTORE #5776

dsouzai · 2022-04-04T17:26:56Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

@mrunalp suggested I open this PoC PR for wider discussion with the CRI-O project. This PoC ensures that containers are able to make use of the privileges granted by CAP_CHECKPOINT_RESTORE. It also requires the following change to runc:

diff --git a/libcontainer/rootfs_linux.go b/libcontainer/rootfs_linux.go
index 51660f5e..3cfd2bf1 100644
--- a/libcontainer/rootfs_linux.go
+++ b/libcontainer/rootfs_linux.go
@@ -577,6 +577,7 @@ func checkProcMount(rootfs, dest, source string) error {
                "/proc/loadavg",
                "/proc/slabinfo",
                "/proc/net/dev",
+               "/proc/sys/kernel/ns_last_pid",
        }
        for _, valid := range validProcMounts {

I would appreciate your thoughts/suggestions with regard to the behaviour for handling CAP_CHECKPOINT_RESTORE. Opening this as a draft PR as the primary intention is for discussion.

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

openshift-ci · 2022-04-04T17:27:04Z

@dsouzai: Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci · 2022-04-04T17:27:07Z

Thanks for your pull request. Before we can look at it, you'll need to add a 'DCO signoff' to your commits.

📝 Please follow instructions in the contributing guide to update your commits with the DCO

Full details of the Developer Certificate of Origin can be found at developercertificate.org.

The list of commits missing DCO signoff:

e1d281e POC for CAP_CHECKPOINT_RESTORE

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-ci · 2022-04-04T17:27:11Z

Hi @dsouzai. Thanks for your PR.

I'm waiting for a cri-o member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci · 2022-04-04T17:27:16Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: dsouzai
To complete the pull request process, please assign nalind after the PR has been reviewed.
You can assign the PR to them by writing /assign @nalind in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

adrianreber · 2022-04-04T18:27:44Z

I understand what you are trying to achieve, but this does not seem correct.

I do not think bind mounting a /proc entry from the outside of the container to the inside is a good idea. But I leave this to the container experts, but is not something I have ever seen. If I want to be able to write to /proc/sys/kernel/ns_last_pid I just have to start, Podman in my tests, with --privileged.

I also think I understand the connection you made between CAP_CHECKPOINT_RESTORE and /proc/sys/kernel/ns_last_pid but that also does not look correct. But, both are independent of each other. Although we introduced CAP_CHECKPOINT_RESTORE in the kernel there is no support in CRIU for it. So adding CAP_CHECKPOINT_RESTORE does not help you with CRIU.

What do you hope to achieve with support for CAP_CHECKPOINT_RESTORE like you did it in this PR?

dsouzai · 2022-04-04T18:48:46Z

If I want to be able to write to /proc/sys/kernel/ns_last_pid I just have to start, Podman in my tests, with --privileged.

Yeah that's what we're trying to avoid; we want to run CRIU in an unprivileged container as nonroot.

I also think I understand the connection you made between CAP_CHECKPOINT_RESTORE and /proc/sys/kernel/ns_last_pid but that also does not look correct. But, both are independent of each other.

Well the kernel documentation says:

CAP_CHECKPOINT_RESTORE (since Linux 5.9)
              * Update /proc/sys/kernel/ns_last_pid (see
                [pid_namespaces(7)](https://man7.org/linux/man-pages/man7/pid_namespaces.7.html));
              * employ the set_tid feature of [clone3(2)](https://man7.org/linux/man-pages/man2/clone3.2.html);
              * read the contents of the symbolic links in
                /proc/[pid]/map_files for other processes.

So we've assumed that there must be some connection between the cap and ns_last_pid.

Although we introduced CAP_CHECKPOINT_RESTORE in the kernel there is no support in CRIU for it. So adding CAP_CHECKPOINT_RESTORE does not help you with CRIU.

Yeah, at the moment CRIU does not support rootless CRIU. However, in our tests, we've made use of your currently open PR, as well as other changes we've made.

What do you hope to achieve with support for CAP_CHECKPOINT_RESTORE like you did it in this PR?

We want to deploy a container that is not privileged and where the user inside the container is not root. To give some additional information regarding this, in the checkpoint run, we've done

setcap cap_checkpoint_restore,cap_net_admin,cap_sys_ptrace=eip /usr/sbin/criu

and on the restore run, the deployment yaml's SecurityContext is:

securityContext:
  capabilities:
    add: [ "CHECKPOINT_RESTORE", "NET_ADMIN", "SYS_PTRACE" ]

The way the checkpoint happens is that the JVM exposes an API that allows a java program to self-checkpoint; we link the criu library when building the JVM. On restore, we use a script to launch criu restore.

adrianreber · 2022-04-05T06:50:01Z

If I want to be able to write to /proc/sys/kernel/ns_last_pid I just have to start, Podman in my tests, with --privileged.

Yeah that's what we're trying to avoid; we want to run CRIU in an unprivileged container as nonroot.

Thanks for the context.

I also think I understand the connection you made between CAP_CHECKPOINT_RESTORE and /proc/sys/kernel/ns_last_pid but that also does not look correct. But, both are independent of each other.

Well the kernel documentation says:
CAP_CHECKPOINT_RESTORE (since Linux 5.9)
              * Update /proc/sys/kernel/ns_last_pid (see
                [pid_namespaces(7)](https://man7.org/linux/man-pages/man7/pid_namespaces.7.html));
              * employ the set_tid feature of [clone3(2)](https://man7.org/linux/man-pages/man2/clone3.2.html);
              * read the contents of the symbolic links in
                /proc/[pid]/map_files for other processes.
So we've assumed that there must be some connection between the cap and ns_last_pid.

Well, you can write as non-root to /proc/sys/kernel/ns_last_pid if you have CAP_CHECKPOINT_RESTORE, but just because you have the capability you cannot assume that somebody wants to have /proc/sys/kernel/ns_last_pid. I understand your goal, but this connection you make does not seem like the right thing to do. Anyway, the biggest problem is to bind mount something from the host's /proc into the container's /proc. That seems wrong. Looking at a default config.json from runc I see:

		"readonlyPaths": [
			"/proc/asound",
			"/proc/bus",
			"/proc/fs",
			"/proc/irq",
			"/proc/sys",
			"/proc/sysrq-trigger"
		]

I guess that is your main problem. That /proc/sys is mounted read-only. But mounting /proc/sys read-write also sounds dangerous. Maybe if you target kernel's with clone3() you do not need to access /proc/sys/kernel/ns_last_pid at all. That sounds like an easier solution.

Although we introduced CAP_CHECKPOINT_RESTORE in the kernel there is no support in CRIU for it. So adding CAP_CHECKPOINT_RESTORE does not help you with CRIU.

Yeah, at the moment CRIU does not support rootless CRIU. However, in our tests, we've made use of your currently open PR, as well as other changes we've made.

Good to know. If you are interested in non-root CRIU it would be good if someone from your side could push that CRIU PR forward.

What do you hope to achieve with support for CAP_CHECKPOINT_RESTORE like you did it in this PR?

We want to deploy a container that is not privileged and where the user inside the container is not root. To give some additional information regarding this, in the checkpoint run, we've done
setcap cap_checkpoint_restore,cap_net_admin,cap_sys_ptrace=eip /usr/sbin/criu
and on the restore run, the deployment yaml's SecurityContext is:
securityContext:
  capabilities:
    add: [ "CHECKPOINT_RESTORE", "NET_ADMIN", "SYS_PTRACE" ]
The way the checkpoint happens is that the JVM exposes an API that allows a java program to self-checkpoint; we link the criu library when building the JVM. On restore, we use a script to launch criu restore.

The way you propose it here does not look correct from my point of view. But now I understand what you are trying to solve. Maybe you should try to target newer kernels with clone3() that would make your life easier as /proc/sys/kernel/ns_last_pid is no longer needed.

dsouzai · 2022-04-05T14:35:24Z

Well, you can write as non-root to /proc/sys/kernel/ns_last_pid if you have CAP_CHECKPOINT_RESTORE, but just because you have the capability you cannot assume that somebody wants to have /proc/sys/kernel/ns_last_pid.

Oh I see what you're saying - basically it should be that a user has to specify both CAP_CHECKPOINT_RESTORE and that they want ns_last_pid because they're using something like CRIU; otherwise they may just want the cap but never actually need to write to ns_last_pid.

If you are interested in non-root CRIU it would be good if someone from your side could push that CRIU PR forward.

@ymanton is, I believe, working towards that end.

I guess that is your main problem. That /proc/sys is mounted read-only. But mounting /proc/sys read-write also sounds dangerous.

Yeah I agree; when @mrunalp suggested opening the PoC PR, they acknowledged that there may be security implications, but it's a place to start to drive discussion towards what is the best way forward. FWIW though, in this PoC, /proc/sys is still mounted as RO; only /proc/sys/kernel/ns_last_pid is mounted as RW. The config.json for a container now has:

                {
                        "destination": "/proc/sys/kernel/ns_last_pid",
                        "type": "bind",
                        "source": "/proc/sys/kernel/ns_last_pid",
                        "options": [
                                "rw",
                                "bind"
                        ]
                },
...
                "maskedPaths": [
                        "/proc/acpi",
                        "/proc/kcore",
                        "/proc/keys",
                        "/proc/latency_stats",
                        "/proc/timer_list",
                        "/proc/timer_stats",
                        "/proc/sched_debug",
                        "/proc/scsi",
                        "/sys/firmware",
                        "/sys/dev"
                ],
                "readonlyPaths": [
                        "/proc/asound",
                        "/proc/bus",
                        "/proc/fs",
                        "/proc/irq",
                        "/proc/sys",
                        "/proc/sysrq-trigger"
                ]

Maybe you should try to target newer kernels with clone3() that would make your life easier as /proc/sys/kernel/ns_last_pid is no longer needed.

Interesting. How does that all work? Does CRIU have code that checks if clone3() is available, and if so does not go down the code path the tries to write to ns_last_pid? We've been rebasing our changes on CRIU v3.16.1; does that version already have such code? Or is it this in a much more recent CRIU version?

dsouzai · 2022-04-05T14:41:42Z

Looking online though, clone3() was added in kernel version 5.3, however many existing environments (e.g. RHEL8 which is only on 4.18) does not have clone3(). So we're also looking for a solution that is applicable to the broadest possible set of environments rather than having to wait N years(?) for the newer kernels to be more widely used.

adrianreber · 2022-04-05T15:11:11Z

Interesting. How does that all work? Does CRIU have code that checks if clone3() is available, and if so does not go down the
code path the tries to write to ns_last_pid? We've been rebasing our changes on CRIU v3.16.1; does that version already have > such code? Or is it this in a much more recent CRIU version?

Yes, CRIU looks if clone3() is available and does not touch /proc/sys/kernel/ns_last_pid.

Looking online though, clone3() was added in kernel version 5.3, however many existing environments (e.g. RHEL8 which is
only on 4.18) does not have clone3(). So we're also looking for a solution that is applicable to the broadest possible set of
environments rather than having to wait N years(?) for the newer kernels to be more widely used.

Yes it is not in RHEL8, but if it is important in can probably be backported. RHEL 9 is probably also not very far.

I still think mounting a /proc file from the host into the container is wrong. But that is not up to me to decide.

dsouzai · 2022-04-05T15:39:36Z

Yes, CRIU looks if clone3() is available and does not touch /proc/sys/kernel/ns_last_pid.

Nice, that's good to know for the future. That might also explain the discrepancy I was seeing between RHEL and my local Fedora VM.

I still think mounting a /proc file from the host into the container is wrong. But that is not up to me to decide.

I'm not wedded to the notion that we need to mount a /proc file from the host into the container; it was just the simplest way of proving a PoC to show that in something like RHEL8.5, as long as I can make ns_last_pid writable in an unprivileged container, I can restore my java process using CRIU as non-root.

While newer kernels have mechanisms that allow CRIU to bypass the need for ns_last_pid, because older kernels specify that a cap such as CAP_CHECKPOINT_RESTORE allows a process to write to ns_last_pid, I think container engines should provide users with a mechanism to make use of that. I don't think it's unreasonable that such abilities should be gated, but I do believe they should be available.

haircommander · 2022-04-07T14:33:56Z

/ok-to-test

openshift-ci-robot · 2022-04-07T15:00:27Z

@dsouzai: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/kata-jenkins	`e1d281e`	link	true	`/test kata-containers`

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-ci · 2022-04-07T15:00:28Z

@dsouzai: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/openshift-jenkins/e2e_fedora	`e1d281e`	link	true	`/test e2e_fedora`
ci/prow/e2e-gcp	`e1d281e`	link	true	`/test e2e-gcp`
ci/prow/images	`e1d281e`	link	true	`/test images`
ci/prow/e2e-agnostic	`e1d281e`	link	true	`/test e2e-agnostic`
ci/openshift-jenkins/critest_fedora	`e1d281e`	link	true	`/test critest_fedora`
ci/openshift-jenkins/e2e_features_fedora	`e1d281e`	link	true	`/test e2e_features_fedora`
ci/openshift-jenkins/integration_fedora	`e1d281e`	link	true	`/test integration_fedora`
ci/openshift-jenkins/e2e_crun	`e1d281e`	link	true	`/test e2e_crun`
ci/openshift-jenkins/integration_crun	`e1d281e`	link	true	`/test integration_crun`
ci/openshift-jenkins/integration_crun_cgroupv2	`e1d281e`	link	false	`/test integration_cgroupv2`
ci/openshift-jenkins/e2e_crun_cgroupv2	`e1d281e`	link	false	`/test e2e_cgroupv2`
ci/openshift-jenkins/e2e_features_rhel	`e1d281e`	link	true	`/test e2e_features_rhel`
ci/openshift-jenkins/critest_rhel	`e1d281e`	link	true	`/test critest_rhel`
ci/openshift-jenkins/e2e_rhel	`e1d281e`	link	true	`/test e2e_rhel`
ci/openshift-jenkins/integration_rhel	`e1d281e`	link	true	`/test integration_rhel`
ci/kata-jenkins	`e1d281e`	link	true	`/test kata-containers`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

dsouzai · 2022-04-07T17:39:28Z

Closing this PR; based on the discussion in the weekly meeting, the plan going forward is to open a PR for the runc repo to allow mounting of /proc/sys/kernel/ns_last_pid, and then it falls to the pod author to specify the hostpath volume mount if they have an application that needs to write to ns_last_pid (which means they'll also need to specify CAP_CHECKPOINT_RESTORE). However, there shouldn't be any change required for CRI-O.

Thanks for the discussion @haircommander :)

haircommander · 2022-04-07T17:44:23Z

Thanks for joining us @dsouzai!

POC for CAP_CHECKPOINT_RESTORE

e1d281e

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 4, 2022

openshift-ci bot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Apr 4, 2022

openshift-ci bot requested review from fgiudici and wgahnagl April 4, 2022 17:27

openshift-ci bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 7, 2022

dsouzai closed this Apr 7, 2022

adrianreber mentioned this pull request Apr 8, 2022

Allow mounting of /proc/sys/kernel/ns_last_pid opencontainers/runc#3451

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

POC for CAP_CHECKPOINT_RESTORE #5776

POC for CAP_CHECKPOINT_RESTORE #5776

dsouzai commented Apr 4, 2022

openshift-ci bot commented Apr 4, 2022

openshift-ci bot commented Apr 4, 2022

openshift-ci bot commented Apr 4, 2022

openshift-ci bot commented Apr 4, 2022

adrianreber commented Apr 4, 2022

dsouzai commented Apr 4, 2022 •

edited

Loading

adrianreber commented Apr 5, 2022

dsouzai commented Apr 5, 2022

dsouzai commented Apr 5, 2022

adrianreber commented Apr 5, 2022

dsouzai commented Apr 5, 2022

haircommander commented Apr 7, 2022

openshift-ci-robot commented Apr 7, 2022

openshift-ci bot commented Apr 7, 2022

dsouzai commented Apr 7, 2022

haircommander commented Apr 7, 2022

POC for CAP_CHECKPOINT_RESTORE #5776

POC for CAP_CHECKPOINT_RESTORE #5776

Conversation

dsouzai commented Apr 4, 2022

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

openshift-ci bot commented Apr 4, 2022

openshift-ci bot commented Apr 4, 2022

openshift-ci bot commented Apr 4, 2022

openshift-ci bot commented Apr 4, 2022

adrianreber commented Apr 4, 2022

dsouzai commented Apr 4, 2022 • edited Loading

adrianreber commented Apr 5, 2022

dsouzai commented Apr 5, 2022

dsouzai commented Apr 5, 2022

adrianreber commented Apr 5, 2022

dsouzai commented Apr 5, 2022

haircommander commented Apr 7, 2022

openshift-ci-robot commented Apr 7, 2022

openshift-ci bot commented Apr 7, 2022

dsouzai commented Apr 7, 2022

haircommander commented Apr 7, 2022

dsouzai commented Apr 4, 2022 •

edited

Loading