Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

POC for CAP_CHECKPOINT_RESTORE #5776

Closed
wants to merge 1 commit into from

Conversation

dsouzai
Copy link

@dsouzai dsouzai commented Apr 4, 2022

What type of PR is this?

/kind bug

What this PR does / why we need it:

@mrunalp suggested I open this PoC PR for wider discussion with the CRI-O project. This PoC ensures that containers are able to make use of the privileges granted by CAP_CHECKPOINT_RESTORE. It also requires the following change to runc:

diff --git a/libcontainer/rootfs_linux.go b/libcontainer/rootfs_linux.go
index 51660f5e..3cfd2bf1 100644
--- a/libcontainer/rootfs_linux.go
+++ b/libcontainer/rootfs_linux.go
@@ -577,6 +577,7 @@ func checkProcMount(rootfs, dest, source string) error {
                "/proc/loadavg",
                "/proc/slabinfo",
                "/proc/net/dev",
+               "/proc/sys/kernel/ns_last_pid",
        }
        for _, valid := range validProcMounts {

I would appreciate your thoughts/suggestions with regard to the behaviour for handling CAP_CHECKPOINT_RESTORE. Opening this as a draft PR as the primary intention is for discussion.

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?


@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 4, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Apr 4, 2022

@dsouzai: Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot added dco-signoff: no Indicates the PR's author has not DCO signed all their commits. kind/bug Categorizes issue or PR as related to a bug. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Apr 4, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Apr 4, 2022

Thanks for your pull request. Before we can look at it, you'll need to add a 'DCO signoff' to your commits.

📝 Please follow instructions in the contributing guide to update your commits with the DCO

Full details of the Developer Certificate of Origin can be found at developercertificate.org.

The list of commits missing DCO signoff:

  • e1d281e POC for CAP_CHECKPOINT_RESTORE

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Apr 4, 2022

Hi @dsouzai. Thanks for your PR.

I'm waiting for a cri-o member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Apr 4, 2022
@openshift-ci openshift-ci bot requested review from fgiudici and wgahnagl April 4, 2022 17:27
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Apr 4, 2022

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: dsouzai
To complete the pull request process, please assign nalind after the PR has been reviewed.
You can assign the PR to them by writing /assign @nalind in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@adrianreber
Copy link
Member

I understand what you are trying to achieve, but this does not seem correct.

I do not think bind mounting a /proc entry from the outside of the container to the inside is a good idea. But I leave this to the container experts, but is not something I have ever seen. If I want to be able to write to /proc/sys/kernel/ns_last_pid I just have to start, Podman in my tests, with --privileged.

I also think I understand the connection you made between CAP_CHECKPOINT_RESTORE and /proc/sys/kernel/ns_last_pid but that also does not look correct. But, both are independent of each other. Although we introduced CAP_CHECKPOINT_RESTORE in the kernel there is no support in CRIU for it. So adding CAP_CHECKPOINT_RESTORE does not help you with CRIU.

What do you hope to achieve with support for CAP_CHECKPOINT_RESTORE like you did it in this PR?

@dsouzai
Copy link
Author

dsouzai commented Apr 4, 2022

If I want to be able to write to /proc/sys/kernel/ns_last_pid I just have to start, Podman in my tests, with --privileged.

Yeah that's what we're trying to avoid; we want to run CRIU in an unprivileged container as nonroot.

I also think I understand the connection you made between CAP_CHECKPOINT_RESTORE and /proc/sys/kernel/ns_last_pid but that also does not look correct. But, both are independent of each other.

Well the kernel documentation says:

CAP_CHECKPOINT_RESTORE (since Linux 5.9)
              * Update /proc/sys/kernel/ns_last_pid (see
                [pid_namespaces(7)](https://man7.org/linux/man-pages/man7/pid_namespaces.7.html));
              * employ the set_tid feature of [clone3(2)](https://man7.org/linux/man-pages/man2/clone3.2.html);
              * read the contents of the symbolic links in
                /proc/[pid]/map_files for other processes.

So we've assumed that there must be some connection between the cap and ns_last_pid.

Although we introduced CAP_CHECKPOINT_RESTORE in the kernel there is no support in CRIU for it. So adding CAP_CHECKPOINT_RESTORE does not help you with CRIU.

Yeah, at the moment CRIU does not support rootless CRIU. However, in our tests, we've made use of your currently open PR, as well as other changes we've made.

What do you hope to achieve with support for CAP_CHECKPOINT_RESTORE like you did it in this PR?

We want to deploy a container that is not privileged and where the user inside the container is not root. To give some additional information regarding this, in the checkpoint run, we've done

setcap cap_checkpoint_restore,cap_net_admin,cap_sys_ptrace=eip /usr/sbin/criu

and on the restore run, the deployment yaml's SecurityContext is:

securityContext:
  capabilities:
    add: [ "CHECKPOINT_RESTORE", "NET_ADMIN", "SYS_PTRACE" ]

The way the checkpoint happens is that the JVM exposes an API that allows a java program to self-checkpoint; we link the criu library when building the JVM. On restore, we use a script to launch criu restore.

@adrianreber
Copy link
Member

If I want to be able to write to /proc/sys/kernel/ns_last_pid I just have to start, Podman in my tests, with --privileged.

Yeah that's what we're trying to avoid; we want to run CRIU in an unprivileged container as nonroot.

Thanks for the context.

I also think I understand the connection you made between CAP_CHECKPOINT_RESTORE and /proc/sys/kernel/ns_last_pid but that also does not look correct. But, both are independent of each other.

Well the kernel documentation says:

CAP_CHECKPOINT_RESTORE (since Linux 5.9)
              * Update /proc/sys/kernel/ns_last_pid (see
                [pid_namespaces(7)](https://man7.org/linux/man-pages/man7/pid_namespaces.7.html));
              * employ the set_tid feature of [clone3(2)](https://man7.org/linux/man-pages/man2/clone3.2.html);
              * read the contents of the symbolic links in
                /proc/[pid]/map_files for other processes.

So we've assumed that there must be some connection between the cap and ns_last_pid.

Well, you can write as non-root to /proc/sys/kernel/ns_last_pid if you have CAP_CHECKPOINT_RESTORE, but just because you have the capability you cannot assume that somebody wants to have /proc/sys/kernel/ns_last_pid. I understand your goal, but this connection you make does not seem like the right thing to do. Anyway, the biggest problem is to bind mount something from the host's /proc into the container's /proc. That seems wrong. Looking at a default config.json from runc I see:

		"readonlyPaths": [
			"/proc/asound",
			"/proc/bus",
			"/proc/fs",
			"/proc/irq",
			"/proc/sys",
			"/proc/sysrq-trigger"
		]

I guess that is your main problem. That /proc/sys is mounted read-only. But mounting /proc/sys read-write also sounds dangerous. Maybe if you target kernel's with clone3() you do not need to access /proc/sys/kernel/ns_last_pid at all. That sounds like an easier solution.

Although we introduced CAP_CHECKPOINT_RESTORE in the kernel there is no support in CRIU for it. So adding CAP_CHECKPOINT_RESTORE does not help you with CRIU.

Yeah, at the moment CRIU does not support rootless CRIU. However, in our tests, we've made use of your currently open PR, as well as other changes we've made.

Good to know. If you are interested in non-root CRIU it would be good if someone from your side could push that CRIU PR forward.

What do you hope to achieve with support for CAP_CHECKPOINT_RESTORE like you did it in this PR?

We want to deploy a container that is not privileged and where the user inside the container is not root. To give some additional information regarding this, in the checkpoint run, we've done

setcap cap_checkpoint_restore,cap_net_admin,cap_sys_ptrace=eip /usr/sbin/criu

and on the restore run, the deployment yaml's SecurityContext is:

securityContext:
  capabilities:
    add: [ "CHECKPOINT_RESTORE", "NET_ADMIN", "SYS_PTRACE" ]

The way the checkpoint happens is that the JVM exposes an API that allows a java program to self-checkpoint; we link the criu library when building the JVM. On restore, we use a script to launch criu restore.

The way you propose it here does not look correct from my point of view. But now I understand what you are trying to solve. Maybe you should try to target newer kernels with clone3() that would make your life easier as /proc/sys/kernel/ns_last_pid is no longer needed.

@dsouzai
Copy link
Author

dsouzai commented Apr 5, 2022

Well, you can write as non-root to /proc/sys/kernel/ns_last_pid if you have CAP_CHECKPOINT_RESTORE, but just because you have the capability you cannot assume that somebody wants to have /proc/sys/kernel/ns_last_pid.

Oh I see what you're saying - basically it should be that a user has to specify both CAP_CHECKPOINT_RESTORE and that they want ns_last_pid because they're using something like CRIU; otherwise they may just want the cap but never actually need to write to ns_last_pid.

If you are interested in non-root CRIU it would be good if someone from your side could push that CRIU PR forward.

@ymanton is, I believe, working towards that end.

I guess that is your main problem. That /proc/sys is mounted read-only. But mounting /proc/sys read-write also sounds dangerous.

Yeah I agree; when @mrunalp suggested opening the PoC PR, they acknowledged that there may be security implications, but it's a place to start to drive discussion towards what is the best way forward. FWIW though, in this PoC, /proc/sys is still mounted as RO; only /proc/sys/kernel/ns_last_pid is mounted as RW. The config.json for a container now has:

                {
                        "destination": "/proc/sys/kernel/ns_last_pid",
                        "type": "bind",
                        "source": "/proc/sys/kernel/ns_last_pid",
                        "options": [
                                "rw",
                                "bind"
                        ]
                },
...
                "maskedPaths": [
                        "/proc/acpi",
                        "/proc/kcore",
                        "/proc/keys",
                        "/proc/latency_stats",
                        "/proc/timer_list",
                        "/proc/timer_stats",
                        "/proc/sched_debug",
                        "/proc/scsi",
                        "/sys/firmware",
                        "/sys/dev"
                ],
                "readonlyPaths": [
                        "/proc/asound",
                        "/proc/bus",
                        "/proc/fs",
                        "/proc/irq",
                        "/proc/sys",
                        "/proc/sysrq-trigger"
                ]

Maybe you should try to target newer kernels with clone3() that would make your life easier as /proc/sys/kernel/ns_last_pid is no longer needed.

Interesting. How does that all work? Does CRIU have code that checks if clone3() is available, and if so does not go down the code path the tries to write to ns_last_pid? We've been rebasing our changes on CRIU v3.16.1; does that version already have such code? Or is it this in a much more recent CRIU version?

@dsouzai
Copy link
Author

dsouzai commented Apr 5, 2022

Looking online though, clone3() was added in kernel version 5.3, however many existing environments (e.g. RHEL8 which is only on 4.18) does not have clone3(). So we're also looking for a solution that is applicable to the broadest possible set of environments rather than having to wait N years(?) for the newer kernels to be more widely used.

@adrianreber
Copy link
Member

Interesting. How does that all work? Does CRIU have code that checks if clone3() is available, and if so does not go down the
code path the tries to write to ns_last_pid? We've been rebasing our changes on CRIU v3.16.1; does that version already have > such code? Or is it this in a much more recent CRIU version?

Yes, CRIU looks if clone3() is available and does not touch /proc/sys/kernel/ns_last_pid.

Looking online though, clone3() was added in kernel version 5.3, however many existing environments (e.g. RHEL8 which is
only on 4.18) does not have clone3(). So we're also looking for a solution that is applicable to the broadest possible set of
environments rather than having to wait N years(?) for the newer kernels to be more widely used.

Yes it is not in RHEL8, but if it is important in can probably be backported. RHEL 9 is probably also not very far.

I still think mounting a /proc file from the host into the container is wrong. But that is not up to me to decide.

@dsouzai
Copy link
Author

dsouzai commented Apr 5, 2022

Yes, CRIU looks if clone3() is available and does not touch /proc/sys/kernel/ns_last_pid.

Nice, that's good to know for the future. That might also explain the discrepancy I was seeing between RHEL and my local Fedora VM.

I still think mounting a /proc file from the host into the container is wrong. But that is not up to me to decide.

I'm not wedded to the notion that we need to mount a /proc file from the host into the container; it was just the simplest way of proving a PoC to show that in something like RHEL8.5, as long as I can make ns_last_pid writable in an unprivileged container, I can restore my java process using CRIU as non-root.

While newer kernels have mechanisms that allow CRIU to bypass the need for ns_last_pid, because older kernels specify that a cap such as CAP_CHECKPOINT_RESTORE allows a process to write to ns_last_pid, I think container engines should provide users with a mechanism to make use of that. I don't think it's unreasonable that such abilities should be gated, but I do believe they should be available.

@haircommander
Copy link
Member

/ok-to-test

@openshift-ci openshift-ci bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 7, 2022
@openshift-ci-robot
Copy link

@dsouzai: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/kata-jenkins e1d281e link true /test kata-containers

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Apr 7, 2022

@dsouzai: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/openshift-jenkins/e2e_fedora e1d281e link true /test e2e_fedora
ci/prow/e2e-gcp e1d281e link true /test e2e-gcp
ci/prow/images e1d281e link true /test images
ci/prow/e2e-agnostic e1d281e link true /test e2e-agnostic
ci/openshift-jenkins/critest_fedora e1d281e link true /test critest_fedora
ci/openshift-jenkins/e2e_features_fedora e1d281e link true /test e2e_features_fedora
ci/openshift-jenkins/integration_fedora e1d281e link true /test integration_fedora
ci/openshift-jenkins/e2e_crun e1d281e link true /test e2e_crun
ci/openshift-jenkins/integration_crun e1d281e link true /test integration_crun
ci/openshift-jenkins/integration_crun_cgroupv2 e1d281e link false /test integration_cgroupv2
ci/openshift-jenkins/e2e_crun_cgroupv2 e1d281e link false /test e2e_cgroupv2
ci/openshift-jenkins/e2e_features_rhel e1d281e link true /test e2e_features_rhel
ci/openshift-jenkins/critest_rhel e1d281e link true /test critest_rhel
ci/openshift-jenkins/e2e_rhel e1d281e link true /test e2e_rhel
ci/openshift-jenkins/integration_rhel e1d281e link true /test integration_rhel
ci/kata-jenkins e1d281e link true /test kata-containers

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@dsouzai
Copy link
Author

dsouzai commented Apr 7, 2022

Closing this PR; based on the discussion in the weekly meeting, the plan going forward is to open a PR for the runc repo to allow mounting of /proc/sys/kernel/ns_last_pid, and then it falls to the pod author to specify the hostpath volume mount if they have an application that needs to write to ns_last_pid (which means they'll also need to specify CAP_CHECKPOINT_RESTORE). However, there shouldn't be any change required for CRI-O.

Thanks for the discussion @haircommander :)

@dsouzai dsouzai closed this Apr 7, 2022
@haircommander
Copy link
Member

Thanks for joining us @dsouzai!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dco-signoff: no Indicates the PR's author has not DCO signed all their commits. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/bug Categorizes issue or PR as related to a bug. ok-to-test Indicates a non-member PR verified by an org member that is safe to test.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants