Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CRIU: Checkpoint/Restore Feature Status #14361

Open
ymanton opened this issue Jan 25, 2022 · 7 comments
Open

CRIU: Checkpoint/Restore Feature Status #14361

ymanton opened this issue Jan 25, 2022 · 7 comments
Labels
criu Used to track CRIU snapshot related work

Comments

@ymanton
Copy link
Member

ymanton commented Jan 25, 2022

CRIU-based Checkpoint/Restore Feature Status

Dependencies

Running as root or as user with root capabilities

Native

  • criu >= 3.14 (Older versions are untested.)
  • glibc with fixes to allow for disabling hardware exploitations
    • Fixes are in version >= 2.34 1
    • On RHEL fixes are backported to 2.28-170.el8 2

Containers

Docker
  • All of the above.
  • Container with --privileged or explicitly defined security and capability options. 3
Podman
  • Same as Docker
OCP
  • Same requirements as Native above
  • Need to create a Role / RoleBinding to use the privileged SCC
  • Need to use privileged: true in the SecurityContext section of the pod spec

Running as user with only CAP_CHECKPOINT_RESTORE

Native

  • kernel with CAP_CHECKPOINT_RESTORE support
  • criu with CAP_CHECKPOINT_RESTORE support and additional fixes

Containers

Docker
  • All of the above.
  • Container with --privileged or explicitly defined security and capability options 3 4
  • Docker with CAP_CHECKPOINT_RESTORE support
Podman
  • Appears not possible, Podman needs to be invoked as root
OCP
  • Same requirements as Native above
  • Need to have a version of runc that has Allow mounting of /proc/sys/kernel/ns_last_pid opencontainers/runc#3451
  • Need to author a SCC definition that is more restrictive than the privileged SCC
  • Need to create a Role / RoleBinding to use newly authored SCC
  • Need to pass in the small set of capabilities 4 in the SecurityContext section of the pod spec
  • Need to specify volume mount of type hostPath to mount /proc/sys/kernel/ns_last_pid as RW (when running linux kernels that don't have clone3())

Status of Dependencies

CRIU

Ubuntu

  • bionic (18.04LTS): ❌ 3.6-2: amd64 arm64 armhf ppc64el s390x
  • focal (20.04LTS): ❌ Not available
  • hirsute (21.04): ✔️ 3.14-1: amd64 arm64 armhf ppc64el s390x
  • impish (21.10): ✔️ 3.14-1: amd64 arm64 armhf ppc64el s390x
  • jammy (22.04): ✔️ 3.16.1-2: amd64 arm64 armhf ppc64el s390x
    • previously fixed kernel bug prevents criu from working inside containers 5

RHEL

  • RHEL 8.x: ✔️ 3.15: aarch64 ppc64le s390x x86_64
  • RHEL 9.x: ✔️ 3.15: aarch64 ppc64le s390x x86_64

Additional patches

Kernel CAP_CHECKPOINT_RESTORE Support

CAP_CHECKPOINT_RESTORE was added to Linux 5.9; 4.x kernels on RHEL seem to have backported it as well.

Ubuntu

  • bionic (18.04LTS): ❌
  • focal (20.04LTS): ✔️
  • hirsute (21.04): ✔️
  • impish (21.10): ✔️
  • jammy (22.04): ✔️

RHEL

  • RHEL 8.x: ✔️
  • RHEL 9.x: ✔️

System Libraries (libcap) CAP_CHECKPOINT_RESTORE Support

CAP_CHECKPOINT_RESTORE was added to libcap 2.43. It is not strictly necessary to have this; you can use CAP_CHECKPOINT_RESTORE as long as the kernel supports it but without libcap support you can't refer to it by name when using tools like setcap.

Ubuntu

  • bionic (18.04LTS): ❌ 1:2.25-1.2: amd64 arm64 armhf i386 ppc64el s390x
  • focal (20.04LTS): ❌ 1:2.32-1: amd64 arm64 armhf i386 ppc64el s390x
  • hirsute (21.04): ✔️ 1:2.44-1build1: amd64 arm64 armhf i386 ppc64el s390x
  • impish (21.10): ✔️ 1:2.44-1build1: amd64 arm64 armhf i386 ppc64el s390x
  • jammy (22.04): ✔️ 1:2.44-1build2: amd64 arm64 armhf i386 ppc64el s390x

RHEL

  • RHEL 8.x: ❌ 2.26: aarch64 ppc64le s390x x86_64
  • RHEL 9.x: ✔️ 2.48: aarch64 ppc64le s390x x86_64

GLIBC with hardware exploitation fixes

Ubuntu

  • bionic (18.04LTS): ❌ 2.27-3ubuntu1.5: amd64 arm64 armhf i386 ppc64el s390x
  • focal (20.04LTS): ❌ 2.31-0ubuntu9.7: amd64 arm64 armhf i386 ppc64el s390x
  • hirsute (21.04): ❌ 2.33-0ubuntu5: amd64 arm64 armhf i386 ppc64el s390x
  • impish (21.10): ✔️ 2.34-0ubuntu3.2: amd64 arm64 armhf i386 ppc64el s390x
  • jammy (22.04): ✔️ 2.35-0ubuntu3: amd64 arm64 armhf i386 ppc64el s390x

RHEL

  • RHEL 8.x: ✔️ 2.28: aarch64 ppc64le s390x x86_64 2
  • RHEL 9.x: ✔️ 2.34: aarch64 ppc64le s390x x86_64

CRIU CAP_CHECKPOINT_RESTORE Support

See #14265 for details.

Docker CAP_CHECKPOINT_RESTORE Support

  • Not yet supported, needs moby/moby@e6a3313 which is not yet released, hence you need to build Docker from source.

Status Matrix

Mode Host OS Container runtime Container OS CRIU version Status
Root - - - - -
User - - - - -

Footnotes

  1. We need to be able to disable glibc hardware exploitations which are not guaranteed to be portable across a checkpoint/restore. See https://github.com/eclipse-openj9/openj9/issues/14253.

  2. glibc fixes for hardware exploitations backport: https://bugzilla.redhat.com/show_bug.cgi?id=1937515 2

  3. In lieu of --privileged, Docker containers can be started with --cap-add=ALL --security-opt seccomp=unconfined --security-opt systempaths=unconfined --security-opt apparmor=unconfined. In lieu of unconfined, containers can be started with the options specified in https://github.com/eclipse-openj9/openj9/issues/15117. 2

  4. In lieu of --cap-add=ALL, if running on a kernel that has CAP_CHECKPOINT_RESTORE, the CRIU binary only needs to have cap_checkpoint_restore,cap_net_admin,cap_sys_ptrace=eip set on it, and the Docker command only needs the caps --cap-add=CHECKPOINT_RESTORE --cap-add=NET_ADMIN --cap-add=SYS_PTRACE. 2

  5. https://github.com/checkpoint-restore/criu/issues/860#issuecomment-1060809782

@ymanton ymanton added the criu Used to track CRIU snapshot related work label Jan 25, 2022
@tajila tajila added the beta Used to track items that will be included in a feature beta release label Jan 26, 2022
@vijaysun-omr
Copy link
Contributor

vijaysun-omr commented Jan 27, 2022

We may need to mention dumb-init and any other pre-requisites like that (are there any ?) here too, so that it is all captured in one place.

As we discussed yesterday, the docker command line options to avoid privileged should also be mentioned here in a separate section of the above summary (am sure Younes/Irwin will do that when time permits, no rush). This is also related to actually trying out those command options with non-root CRIU inside the Liberty container with no elevated capabilities and seeing a successful restore (so that we know that that does in fact work).

@vijaysun-omr
Copy link
Contributor

@ymanton since you are in touch with the CRIU folks, are you able to find out more about the outlook for checkpoint-restore/criu#1155 ? I ask, since this would make it easier for us to consume CRIU for our purposes and would also allow you to merge the commits that you created that rely on that PR as a prereq.

@vijaysun-omr
Copy link
Contributor

(Not urgent for our beta) @ymanton we may need to push checkpoint-restore/criu#1155 forward ourselves and to that end, you may want to target how to add/clean commits that are already there in the next month or two, as we discussed.

@abhishek179
Copy link

abhishek179 commented Apr 22, 2022

Hi Vijay,

Will this new privilege(cap_checkpoint_restore) eliminate the need of launching docker container with --privilege for the case where the criu checkpoint/restore executed inside the the docker container for a process tree.

Has anyone tested it with docker containers?

@vijaysun-omr
Copy link
Contributor

@abhishek179 we are presently exploring alternatives to using --privileged when we start a container in a kubernetes environment. See opencontainers/runc#3451 for some more information on this front.

We have also tested using docker containers outside of kubernetes using the alternative mentioned at #14361 (reference)

@abhishek179 are you interested in this work that we are doing for a use case that you have ? If so, we would be interested in knowing about it, to see if it ought to affect our design in some way.

@abhishek179
Copy link

@vijaysun-omr yeah i am interested in this feature. Below is my use case

Requirement: Checkpoint/Restore a Process Tree within the docker container namespace from within the container where the container is not launched with --privileged switch.

Based on what i understood from this thread that with this new capability "CAP_CHECKPOINT_RESTORE" in newer Kernels we should be able to take criu checkpoint/restore on a non-privileged docker container from within the container itself without any root access.

For the existing/older kernels i am attempting to Checkpoint/Restore by doing an

  1. launch a container without --privilege and launch the application.
  2. Enter container namespace as root from the host "nsenter -a -t %pid%" which will have all capabilities.
  3. Take Criu Checkpoint of the process tree using the elevated creds and then kill the container.
  4. Launch a new container with /proc fs RW and enter namespace as root again from the base.
  5. do criu restore with the elevated shell and make sure to exit criu.

@dsouzai
Copy link
Contributor

dsouzai commented Apr 26, 2022

Based on what i understood from this thread that with this new capability "CAP_CHECKPOINT_RESTORE" in newer Kernels we should be able to take criu checkpoint/restore on a non-privileged docker container from within the container itself without any root access.

You would need CAP_CHECKPOINT_RESTORE AND checkpoint-restore/criu#1155 in order to do an unprivileged checkpoint. I think in our case we needed CAP_CHECKPOINT_RESTORE and NET_ADMIN. We set those caps on the criu binary (at image build time).

On the restore run, you have to give docker --cap-add=CHECKPOINT_RESTORE,NET_ADMIN --security-opt seccomp=unconfined and possibly also --security-opt systempaths=unconfined (though that last one may not be needed if your runc contains opencontainers/runc#3451 and you specify a volume mount for /proc/sys/kernel/ns_last_pid).

In our tests, the processes that checkpoint and restore are both non-root inside the container.

@tajila tajila removed the beta Used to track items that will be included in a feature beta release label Jun 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
criu Used to track CRIU snapshot related work
Projects
Status: No status
Development

No branches or pull requests

5 participants