Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow mounting of /proc/sys/kernel/ns_last_pid #3451

Merged
merged 1 commit into from
Apr 22, 2022

Conversation

dsouzai
Copy link
Contributor

@dsouzai dsouzai commented Apr 7, 2022

The CAP_CHECKPOINT_RESTORE linux capability provides the ability to update /proc/sys/kernel/ns_last_pid. However, because this "file" is under /proc, and by default both K8s and CRI-O specify that /proc/sys should be mounted as Read-Only, even with the capability specified, a process will not be able to write to ns_last_pid.

To get around this, a pod author can specify a volume mount and a host path to bind-mount /proc/sys/kernel/ns_last_pid. However, runc does not allow specifying mounts under /proc.

This PR adds /proc/sys/kernel/ns_last_pid to the validProcMounts string array to enable a pod author to mount ns_last_pid as read-write. The default remains unchanged; unless explicitly requested as a volume mount, ns_last_pid will remain read-only regardless of whether or not CAP_CHECKPOINT_RESTORE is specified.

The CAP_CHECKPOINT_RESTORE linux capability provides the ability to
update /proc/sys/kernel/ns_last_pid. However, because this file is under
/proc, and by default both K8s and CRI-O specify that /proc/sys should
be mounted as Read-Only, by default even with the capability specified,
a process will not be able to write to ns_last_pid.

To get around this, a pod author can specify a volume mount and a
hostpath to bind-mount /proc/sys/kernel/ns_last_pid. However, runc does
not allow specifying mounts under /proc.

This commit adds /proc/sys/kernel/ns_last_pid to the validProcMounts
string array to enable a pod author to mount ns_last_pid as read-write.
The default remains unchanged; unless explicitly requested as a volume
mount, ns_last_pid will remain read-only regardless of whether or not
CAP_CHECKPOINT_RESTORE is specified.

Signed-off-by: Irwin D'Souza <[email protected]>
@dsouzai
Copy link
Contributor Author

dsouzai commented Apr 7, 2022

fyi @mrunalp @haircommander

@kolyshkin
Copy link
Contributor

@dsouzai looks like your idea is to use criu from inside the container. Do you think it is feasible to do so?

@adrianreber
Copy link
Contributor

@kolyshkin see cri-o/cri-o#5776 for a discussion about possible use cases.

@dsouzai
Copy link
Contributor Author

dsouzai commented Apr 8, 2022

Do you think it is feasible to do so?

@kolyshkin Yeah, we have a prototype where we

  1. Build CRIU with the change to allow rootless checkpointing as well as other changes.
  2. Build the JVM that links the criu lib, and which exposes an API to allow java applications to self dump.
  3. Build a container in which the criu binary has the following set to it:
setcap cap_checkpoint_restore,cap_net_admin,cap_sys_ptrace=eip /usr/sbin/criu
  1. Start said container, in which the java application self checkpoints and the process ends. The container is then committed to create a restore image.
  2. Start new container based on the restore image wherein a script launches criu restore.

In K8s, I'm able to successfully run the restore container with the runc change in this PR and the following pod spec:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: liberty-criu
  namespace: criu
spec:
  replicas: 1
  selector:
    matchLabels:
      name: liberty-criu
  template:
    metadata:
      labels:
        name: liberty-criu
    spec:
      serviceAccount: criusvcacct
      serviceAccountName: criusvcacct
      containers:
        - name: liberty-criu
          image: localhost/liberty-criu:latest
          imagePullPolicy: Always
          volumeMounts:
          - mountPath: /proc/sys/kernel/ns_last_pid
            name: ns-last-pid-mount
          securityContext:
            capabilities:
              add: [ "CHECKPOINT_RESTORE", "NET_ADMIN", "SYS_PTRACE" ]
      volumes:
      - name: ns-last-pid-mount
        hostPath:
          path: /proc/sys/kernel/ns_last_pid
          type: File

Copy link
Contributor

@kolyshkin kolyshkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kolyshkin
Copy link
Contributor

@opencontainers/runc-maintainers PTAL

@kolyshkin
Copy link
Contributor

@AkihiroSuda PTAL

Copy link
Contributor

@rst0git rst0git left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@cyphar
Copy link
Member

cyphar commented May 26, 2022

@dsouzai sent a mail on oci-dev asking when this will be in a release -- given that the patch is very simple and a bugfix, maybe we should backport it to release-1.1?

@cyphar cyphar added the backport/1.1-todo A PR in main branch which needs to be backported to release-1.1 label May 26, 2022
@kolyshkin kolyshkin added backport/1.1-done A PR in main branch which has been backported to release-1.1 and removed backport/1.1-todo A PR in main branch which needs to be backported to release-1.1 labels May 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/1.1-done A PR in main branch which has been backported to release-1.1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants