Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Permission denied when container process executes close_range syscall #10337

Closed
smac89 opened this issue Apr 24, 2021 · 29 comments
Closed

Permission denied when container process executes close_range syscall #10337

smac89 opened this issue Apr 24, 2021 · 29 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. stale-issue

Comments

@smac89
Copy link

smac89 commented Apr 24, 2021

Is this a BUG REPORT or FEATURE REQUEST? (leave only one on its own line)

/kind bug

Description
I have an application which uses close_range syscall running inside a container. When I run the container, and the application makes that syscall, I get an error saying "Permission denied".

At first I was thinking this was a problem with the application, but after some investigating, I am starting to think this may be a podman issue and may have something to do with how it handles seccomp profiles.

Steps to reproduce the issue:

walk.c
#define _GNU_SOURCE
#include <fcntl.h>
#include <linux/close_range.h>
#include <linux/limits.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/syscall.h>
#include <string.h>
#include <unistd.h>
#include <dirent.h>

/* Show the contents of the symbolic links in /proc/self/fd */

static void
show_fds(void)
{
   DIR *dirp = opendir("/proc/self/fd");
   if (dirp  == NULL) {
       perror("opendir");
       exit(EXIT_FAILURE);
   }

   for (;;) {
       struct dirent *dp = readdir(dirp);
       if (dp == NULL)
           break;

       if (dp->d_type == DT_LNK) {
           char path[PATH_MAX], target[PATH_MAX];
           snprintf(path, sizeof(path), "/proc/self/fd/%s",
                    dp->d_name);

           ssize_t len = readlink(path, target, sizeof(target));
           printf("%s ==> %.*s\n", path, (int) len, target);
       }
   }

   closedir(dirp);
}

int
main(int argc, char *argv[])
{
   for (int j = 1; j < argc; j++) {
       int fd = open(argv[j], O_RDONLY);
       if (fd == -1) {
           perror(argv[j]);
           exit(EXIT_FAILURE);
       }
       printf("%s opened as FD %d\n", argv[j], fd);
   }

   show_fds();

   printf("========= About to call close_range() =======\n");

   if (syscall(__NR_close_range, 3, ~0U, 0) == -1) {
       perror("close_range");
       exit(EXIT_FAILURE);
   }

   show_fds();
   exit(EXIT_SUCCESS);
}
  1. Copy the above script to /tmp on your host machine

  2. Using buildah:

buildah bud --no-cache --platform linux/amd64 -f - /tmp <<'EOF'
FROM alpine:edge
RUN apk update && apk add --upgrade build-base libc-dev linux-headers
COPY walk.c /app/walk.c
RUN gcc -o /app/walk /app/walk.c
ENTRYPOINT ["/app/walk"]
EOF
  1. Run the resulting image with podman (replace 7bd46f9814bb with the id of the built image)
podman run --rm -it 7bd46f9814bb /app/walk.c

Describe the results you received:

The result will look something like:

/app/walk.c opened as FD 3
/proc/self/fd/0 ==> /dev/pts/0
/proc/self/fd/1 ==> /dev/pts/0
/proc/self/fd/2 ==> /dev/pts/0
/proc/self/fd/3 ==> /app/walk.c
/proc/self/fd/4 ==> /proc/1/fd
========= About to call close_range() =======
close_range: Operation not permitted

Describe the results you expected:

Now repeat this same process on your host linux machine (assuming you are running atleast kernel version 5.9)

The program should run successfully with an output similar to:

/tmp/walk.c opened as FD 3
/proc/self/fd/0 ==> /dev/pts/1
/proc/self/fd/1 ==> /dev/pts/1
/proc/self/fd/2 ==> /dev/pts/1
/proc/self/fd/3 ==> /tmp/walk.c
/proc/self/fd/4 ==> /proc/547032/fd
========= About to call close_range() =======
/proc/self/fd/0 ==> /dev/pts/1
/proc/self/fd/1 ==> /dev/pts/1
/proc/self/fd/2 ==> /dev/pts/1
/proc/self/fd/3 ==> /proc/547032/fd

This is what I expected inside the container

Additional information you deem important (e.g. issue happens only occasionally):

If you run the image with the option --security-opt seccomp=unconfined, everything works fine.

Does that mean podman is simply blocking the close_range syscall? Where does podman's default seccomp.json file live? I was under the impression that they use the default one from docker, which whitelists close_range syscall.

Output of podman version:

Version:      3.1.2
API Version:  3.1.2
Go Version:   go1.16.3
Git Commit:   51b8ddbc22cf5b10dd76dd9243924aa66ad7db39
Built:        Wed Apr 21 15:34:03 2021
OS/Arch:      linux/amd64

Output of podman info --debug:

host:
  arch: amd64
  buildahVersion: 1.20.1
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: /usr/bin/conmon is owned by conmon 1:2.0.27-1
    path: /usr/bin/conmon
    version: 'conmon version 2.0.27, commit: 65fad4bfcb250df0435ea668017e643e7f462155'
  cpus: 12
  distribution:
    distribution: arcolinux
    version: unknown
  eventLogger: journald
  hostname: ArcoB
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 10000
      size: 65536
    uidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 10000
      size: 65536
  kernel: 5.11.16-arch1-1
  linkmode: dynamic
  memFree: 18868236288
  memTotal: 41711120384
  ociRuntime:
    name: crun
    package: /usr/bin/crun is owned by crun 0.19.1-1
    path: /usr/bin/crun
    version: |-
      crun version 0.19.1
      commit: 1535fedf0b83fb898d449f9680000f729ba719f5
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +YAJL
  os: linux
  remoteSocket:
    path: /run/user/1000/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: true
    seccompEnabled: true
    selinuxEnabled: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: /usr/bin/slirp4netns is owned by slirp4netns 1.1.9-1
    version: |-
      slirp4netns version 1.1.9
      commit: 4e37ea557562e0d7a64dc636eff156f64927335e
      libslirp: 4.4.0
      SLIRP_CONFIG_VERSION_MAX: 3
      libseccomp: 2.5.1
  swapFree: 32211202048
  swapTotal: 32211202048
  uptime: 6h 42m 15.48s (Approximately 0.25 days)
registries:
  search:
  - docker.io
  - ghcr.io
store:
  configFile: /home/chigozirim/.config/containers/storage.conf
  containerStore:
    number: 2
    paused: 0
    running: 1
    stopped: 1
  graphDriverName: overlay
  graphOptions:
    overlay.mount_program:
      Executable: /usr/bin/fuse-overlayfs
      Package: /usr/bin/fuse-overlayfs is owned by fuse-overlayfs 1.5.0-1
      Version: |-
        fusermount3 version: 3.10.3
        fuse-overlayfs: version 1.5
        FUSE library version 3.10.3
        using FUSE kernel interface version 7.31
  graphRoot: /home/chigozirim/.local/share/containers/storage
  graphStatus:
    Backing Filesystem: extfs
    Native Overlay Diff: "false"
    Supports d_type: "true"
    Using metacopy: "false"
  imageStore:
    number: 5
  runRoot: /run/user/1000/containers
  volumePath: /home/chigozirim/.local/share/containers/storage/volumes
version:
  APIVersion: 3.1.2
  Built: 1619040843
  BuiltTime: Wed Apr 21 15:34:03 2021
  GitCommit: 51b8ddbc22cf5b10dd76dd9243924aa66ad7db39
  GoVersion: go1.16.3
  OsArch: linux/amd64
  Version: 3.1.2

Package info (e.g. output of rpm -q podman or apt list podman):

Name                  : podman
Version               : 3.1.2-1
Description           : Tool and library for running OCI-based containers in
                        pods
URL                   : https://github.com/containers/libpod
Licenses              : Apache
Repository            : community
Installed Size        : 76.0 MB
Depends On            : cni-plugins conmon containers-common device-mapper
                        iptables libseccomp runc slirp4netns libsystemd
                        fuse-overlayfs libgpgme.so=11-64
Optional Dependencies : podman-docker: for Docker-compatible CLI [Installed]
                        btrfs-progs: support btrfs backend devices [Installed]
                        catatonit: --init flag support [Installed]
                        crun: support for unified cgroupsv2 [Installed]
Make Dependencies     : btrfs-progs go go-md2man git gpgme systemd
Packager              : Morten Linderud <[email protected]>
Build Date            : 2021-04-21
Install Date          : 2021-04-21
Install Reason        : Explicitly installed
Signatures            : Yes
Backup files          : /etc/cni/net.d/87-podman-bridge.conflist

Have you tested with the latest version of Podman and have you checked the Podman Troubleshooting Guide?

Yes

Additional environment details (AWS, VirtualBox, physical, etc.):

@mheon
Copy link
Member

mheon commented Apr 24, 2021

Doesn't look like Seccomp. Our default profile lives at https://github.com/containers/common/blob/master/pkg/seccomp/seccomp.json#L80 and you can see that close_range is in the list of allowed calls.

@smac89
Copy link
Author

smac89 commented Apr 25, 2021

@mheon Do you have any other explanation for this behavior?

The reason I brought up seccomp is because like I said, using --security-opt seccomp=unconfined allows the container to run just fine. So why does this flag work if the problem has nothing to do with seccomp?

I've used strace in the real container: once with the flag and once without. With the flag, the strace log shows that the close_range syscall succeeds:

574   close_range(0, -1, CLOSE_RANGE_CLOEXEC <unfinished ...>
557   <... poll resumed>)               = 0 (Timeout)
559   futex(0x55b1c61a09b0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
574   <... close_range resumed>)        = 0

Without the flag, we get the following:

571   close_range(0, -1, CLOSE_RANGE_CLOEXEC) = -1 EPERM (Operation not permitted)
571   +++ exited with 127 +++

(The numbers beside each syscall is the process id)

@mheon
Copy link
Member

mheon commented Apr 25, 2021

Can you verify what profile is in use in the container you're running in? The default Podman profile does allow the syscall, so I have to assume your system may not be using the default

@mheon
Copy link
Member

mheon commented Apr 25, 2021

The default profile should live at /usr/share/containers/seccomp.json. However, if an alternative is present at /etc/containers/seccomp.json we will use that one instead.

@smac89
Copy link
Author

smac89 commented Apr 25, 2021

Can you verify what profile is in use in the container you're running in? The default Podman profile does allow the syscall, so I have to assume your system may not be using the default

Please how do I do this?

I did:

podman create <image_hash>
podman inspect <container_name>
The output:

[
    {
        "Id": "03803ce5d0f2421e3e4a0778c9262d834f91ad3df96ff962147cb767667d4478",
        "Created": "2021-04-25T17:29:52.519720505-06:00",
        "Path": "/app/walk",
        "Args": [
            "/app/walk"
        ],
        "State": {
            "OciVersion": "1.0.2-dev",
            "Status": "configured",
            "Running": false,
            "Paused": false,
            "Restarting": false,
            "OOMKilled": false,
            "Dead": false,
            "Pid": 0,
            "ExitCode": 0,
            "Error": "",
            "StartedAt": "0001-01-01T00:00:00Z",
            "FinishedAt": "0001-01-01T00:00:00Z",
            "Healthcheck": {
                "Status": "",
                "FailingStreak": 0,
                "Log": null
            }
        },
        "Image": "b8adaf3fbdcf539038216f0e061b638003ac8708cc6933177dd9f8dba0c4cd4e",
        "ImageName": "b8adaf3fbdc",
        "Rootfs": "",
        "Pod": "",
        "ResolvConfPath": "",
        "HostnamePath": "",
        "HostsPath": "",
        "StaticDir": "/home/chigozirim/.local/share/containers/storage/overlay-containers/03803ce5d0f2421e3e4a0778c9262d834f91ad3df96ff962147cb767667d4478/userdata",
        "OCIRuntime": "crun",
        "ConmonPidFile": "/run/user/1000/containers/overlay-containers/03803ce5d0f2421e3e4a0778c9262d834f91ad3df96ff962147cb767667d4478/userdata/conmon.pid",
        "Name": "priceless_meninsky",
        "RestartCount": 0,
        "Driver": "overlay",
        "MountLabel": "",
        "ProcessLabel": "",
        "AppArmorProfile": "",
        "EffectiveCaps": [
            "CAP_CHOWN",
            "CAP_DAC_OVERRIDE",
            "CAP_FOWNER",
            "CAP_FSETID",
            "CAP_KILL",
            "CAP_NET_BIND_SERVICE",
            "CAP_SETFCAP",
            "CAP_SETGID",
            "CAP_SETPCAP",
            "CAP_SETUID",
            "CAP_SYS_CHROOT"
        ],
        "BoundingCaps": [
            "CAP_CHOWN",
            "CAP_DAC_OVERRIDE",
            "CAP_FOWNER",
            "CAP_FSETID",
            "CAP_KILL",
            "CAP_NET_BIND_SERVICE",
            "CAP_SETFCAP",
            "CAP_SETGID",
            "CAP_SETPCAP",
            "CAP_SETUID",
            "CAP_SYS_CHROOT"
        ],
        "ExecIDs": [],
        "GraphDriver": {
            "Name": "overlay",
            "Data": {
                "LowerDir": "/home/chigozirim/.local/share/containers/storage/overlay/899938f8a7d4f906eda9dda6f1a413cd792177f6cb2af01d18fd215eab659cd5/diff:/home/chigozirim/.local/share/containers/storage/overlay/30d61bb737bb9be7178afce441d0ca5098909a59001a0301d3b50544e659ace1/diff",
                "UpperDir": "/home/chigozirim/.local/share/containers/storage/overlay/dbc15944f329eec9343405100a0d3095cffd6b0ed5885f365cdfbb7e327817fc/diff",
                "WorkDir": "/home/chigozirim/.local/share/containers/storage/overlay/dbc15944f329eec9343405100a0d3095cffd6b0ed5885f365cdfbb7e327817fc/work"
            }
        },
        "Mounts": [],
        "Dependencies": [],
        "NetworkSettings": {
            "EndpointID": "",
            "Gateway": "",
            "IPAddress": "",
            "IPPrefixLen": 0,
            "IPv6Gateway": "",
            "GlobalIPv6Address": "",
            "GlobalIPv6PrefixLen": 0,
            "MacAddress": "",
            "Bridge": "",
            "SandboxID": "",
            "HairpinMode": false,
            "LinkLocalIPv6Address": "",
            "LinkLocalIPv6PrefixLen": 0,
            "Ports": {},
            "SandboxKey": ""
        },
        "ExitCommand": [
            "/usr/bin/podman",
            "--root",
            "/home/chigozirim/.local/share/containers/storage",
            "--runroot",
            "/run/user/1000/containers",
            "--log-level",
            "warning",
            "--cgroup-manager",
            "systemd",
            "--tmpdir",
            "/run/user/1000/libpod/tmp",
            "--runtime",
            "crun",
            "--storage-driver",
            "overlay",
            "--storage-opt",
            "overlay.mount_program=/usr/bin/fuse-overlayfs",
            "--events-backend",
            "journald",
            "container",
            "cleanup",
            "03803ce5d0f2421e3e4a0778c9262d834f91ad3df96ff962147cb767667d4478"
        ],
        "Namespace": "",
        "IsInfra": false,
        "Config": {
            "Hostname": "03803ce5d0f2",
            "Domainname": "",
            "User": "",
            "AttachStdin": false,
            "AttachStdout": false,
            "AttachStderr": false,
            "Tty": false,
            "OpenStdin": false,
            "StdinOnce": false,
            "Env": [
                "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
                "TERM=xterm",
                "container=podman"
            ],
            "Cmd": null,
            "Image": "b8adaf3fbdc",
            "Volumes": null,
            "WorkingDir": "/",
            "Entrypoint": "/app/walk",
            "OnBuild": null,
            "Labels": {
                "io.buildah.version": "1.20.1"
            },
            "Annotations": {
                "io.kubernetes.cri-o.TTY": "false",
                "io.podman.annotations.autoremove": "FALSE",
                "io.podman.annotations.init": "FALSE",
                "io.podman.annotations.privileged": "FALSE",
                "io.podman.annotations.publish-all": "FALSE"
            },
            "StopSignal": 15,
            "CreateCommand": [
                "podman",
                "create",
                "b8adaf3fbdc"
            ],
            "Umask": "0022"
        },
        "HostConfig": {
            "Binds": [],
            "CgroupManager": "systemd",
            "CgroupMode": "private",
            "ContainerIDFile": "",
            "LogConfig": {
                "Type": "k8s-file",
                "Config": null,
                "Path": "/home/chigozirim/.local/share/containers/storage/overlay-containers/03803ce5d0f2421e3e4a0778c9262d834f91ad3df96ff962147cb767667d4478/userdata/ctr.log",
                "Tag": "",
                "Size": "0B"
            },
            "NetworkMode": "slirp4netns",
            "PortBindings": {},
            "RestartPolicy": {
                "Name": "",
                "MaximumRetryCount": 0
            },
            "AutoRemove": false,
            "VolumeDriver": "",
            "VolumesFrom": null,
            "CapAdd": [],
            "CapDrop": [
                "CAP_AUDIT_WRITE",
                "CAP_MKNOD",
                "CAP_NET_RAW"
            ],
            "Dns": [],
            "DnsOptions": [],
            "DnsSearch": [],
            "ExtraHosts": [],
            "GroupAdd": [],
            "IpcMode": "private",
            "Cgroup": "",
            "Cgroups": "default",
            "Links": null,
            "OomScoreAdj": 0,
            "PidMode": "private",
            "Privileged": false,
            "PublishAllPorts": false,
            "ReadonlyRootfs": false,
            "SecurityOpt": [],
            "Tmpfs": {},
            "UTSMode": "private",
            "UsernsMode": "",
            "ShmSize": 65536000,
            "Runtime": "oci",
            "ConsoleSize": [
                0,
                0
            ],
            "Isolation": "",
            "CpuShares": 0,
            "Memory": 0,
            "NanoCpus": 0,
            "CgroupParent": "user.slice",
            "BlkioWeight": 0,
            "BlkioWeightDevice": null,
            "BlkioDeviceReadBps": null,
            "BlkioDeviceWriteBps": null,
            "BlkioDeviceReadIOps": null,
            "BlkioDeviceWriteIOps": null,
            "CpuPeriod": 0,
            "CpuQuota": 0,
            "CpuRealtimePeriod": 0,
            "CpuRealtimeRuntime": 0,
            "CpusetCpus": "",
            "CpusetMems": "",
            "Devices": [],
            "DiskQuota": 0,
            "KernelMemory": 0,
            "MemoryReservation": 0,
            "MemorySwap": 0,
            "MemorySwappiness": 0,
            "OomKillDisable": false,
            "PidsLimit": 2048,
            "Ulimits": [],
            "CpuCount": 0,
            "CpuPercent": 0,
            "IOMaximumIOps": 0,
            "IOMaximumBandwidth": 0,
            "CgroupConf": null
        }
    }
]

@smac89
Copy link
Author

smac89 commented Apr 25, 2021

I've also checked the installed profile (both /usr/share/containers/seccomp.json and /etc/containers/seccomp.json are the same), and here it is:

seccomp.json

{
	"defaultAction": "SCMP_ACT_ERRNO",
	"archMap": [
		{
			"architecture": "SCMP_ARCH_X86_64",
			"subArchitectures": [
				"SCMP_ARCH_X86",
				"SCMP_ARCH_X32"
			]
		},
		{
			"architecture": "SCMP_ARCH_AARCH64",
			"subArchitectures": [
				"SCMP_ARCH_ARM"
			]
		},
		{
			"architecture": "SCMP_ARCH_MIPS64",
			"subArchitectures": [
				"SCMP_ARCH_MIPS",
				"SCMP_ARCH_MIPS64N32"
			]
		},
		{
			"architecture": "SCMP_ARCH_MIPS64N32",
			"subArchitectures": [
				"SCMP_ARCH_MIPS",
				"SCMP_ARCH_MIPS64"
			]
		},
		{
			"architecture": "SCMP_ARCH_MIPSEL64",
			"subArchitectures": [
				"SCMP_ARCH_MIPSEL",
				"SCMP_ARCH_MIPSEL64N32"
			]
		},
		{
			"architecture": "SCMP_ARCH_MIPSEL64N32",
			"subArchitectures": [
				"SCMP_ARCH_MIPSEL",
				"SCMP_ARCH_MIPSEL64"
			]
		},
		{
			"architecture": "SCMP_ARCH_S390X",
			"subArchitectures": [
				"SCMP_ARCH_S390"
			]
		}
	],
	"syscalls": [
		{
			"names": [
				"_llseek",
				"_newselect",
				"accept",
				"accept4",
				"access",
				"adjtimex",
				"alarm",
				"bind",
				"brk",
				"capget",
				"capset",
				"chdir",
				"chmod",
				"chown",
				"chown32",
				"clock_adjtime",
				"clock_adjtime64",
				"clock_getres",
				"clock_getres_time64",
				"clock_gettime",
				"clock_gettime64",
				"clock_nanosleep",
				"clock_nanosleep_time64",
				"clone",
				"close",
				"close_range",
				"connect",
				"copy_file_range",
				"creat",
				"dup",
				"dup2",
				"dup3",
				"epoll_create",
				"epoll_create1",
				"epoll_ctl",
				"epoll_ctl_old",
				"epoll_pwait",
				"epoll_pwait2",
				"epoll_wait",
				"epoll_wait_old",
				"eventfd",
				"eventfd2",
				"execve",
				"execveat",
				"exit",
				"exit_group",
				"faccessat",
				"faccessat2",
				"fadvise64",
				"fadvise64_64",
				"fallocate",
				"fanotify_mark",
				"fchdir",
				"fchmod",
				"fchmodat",
				"fchown",
				"fchown32",
				"fchownat",
				"fcntl",
				"fcntl64",
				"fdatasync",
				"fgetxattr",
				"flistxattr",
				"flock",
				"fork",
				"fremovexattr",
				"fsconfig",
				"fsetxattr",
				"fsmount",
				"fsopen",
				"fspick",
				"fstat",
				"fstat64",
				"fstatat64",
				"fstatfs",
				"fstatfs64",
				"fsync",
				"ftruncate",
				"ftruncate64",
				"futex",
				"futimesat",
				"get_robust_list",
				"get_thread_area",
				"getcpu",
				"getcwd",
				"getdents",
				"getdents64",
				"getegid",
				"getegid32",
				"geteuid",
				"geteuid32",
				"getgid",
				"getgid32",
				"getgroups",
				"getgroups32",
				"getitimer",
				"getpeername",
				"getpgid",
				"getpgrp",
				"getpid",
				"getppid",
				"getpriority",
				"getrandom",
				"getresgid",
				"getresgid32",
				"getresuid",
				"getresuid32",
				"getrlimit",
				"getrusage",
				"getsid",
				"getsockname",
				"getsockopt",
				"gettid",
				"gettimeofday",
				"getuid",
				"getuid32",
				"getxattr",
				"inotify_add_watch",
				"inotify_init",
				"inotify_init1",
				"inotify_rm_watch",
				"io_cancel",
				"io_destroy",
				"io_getevents",
				"io_setup",
				"io_submit",
				"ioctl",
				"ioprio_get",
				"ioprio_set",
				"ipc",
				"keyctl",
				"kill",
				"lchown",
				"lchown32",
				"lgetxattr",
				"link",
				"linkat",
				"listen",
				"listxattr",
				"llistxattr",
				"lremovexattr",
				"lseek",
				"lsetxattr",
				"lstat",
				"lstat64",
				"madvise",
				"memfd_create",
				"mincore",
				"mkdir",
				"mkdirat",
				"mknod",
				"mknodat",
				"mlock",
				"mlock2",
				"mlockall",
				"mmap",
				"mmap2",
				"mount",
				"move_mount",
				"mprotect",
				"mq_getsetattr",
				"mq_notify",
				"mq_open",
				"mq_timedreceive",
				"mq_timedsend",
				"mq_unlink",
				"mremap",
				"msgctl",
				"msgget",
				"msgrcv",
				"msgsnd",
				"msync",
				"munlock",
				"munlockall",
				"munmap",
				"name_to_handle_at",
				"nanosleep",
				"newfstatat",
				"open",
				"openat",
				"openat2",
				"open_tree",
				"pause",
				"pidfd_getfd",
				"pidfd_open",
				"pidfd_send_signal",
				"pipe",
				"pipe2",
				"pivot_root",
				"poll",
				"ppoll",
				"ppoll_time64",
				"prctl",
				"pread64",
				"preadv",
				"preadv2",
				"prlimit64",
				"pselect6",
				"pselect6_time64",
				"pwrite64",
				"pwritev",
				"pwritev2",
				"read",
				"readahead",
				"readlink",
				"readlinkat",
				"readv",
				"reboot",
				"recv",
				"recvfrom",
				"recvmmsg",
				"recvmsg",
				"remap_file_pages",
				"removexattr",
				"rename",
				"renameat",
				"renameat2",
				"restart_syscall",
				"rmdir",
				"rt_sigaction",
				"rt_sigpending",
				"rt_sigprocmask",
				"rt_sigqueueinfo",
				"rt_sigreturn",
				"rt_sigsuspend",
				"rt_sigtimedwait",
				"rt_tgsigqueueinfo",
				"sched_get_priority_max",
				"sched_get_priority_min",
				"sched_getaffinity",
				"sched_getattr",
				"sched_getparam",
				"sched_getscheduler",
				"sched_rr_get_interval",
				"sched_setaffinity",
				"sched_setattr",
				"sched_setparam",
				"sched_setscheduler",
				"sched_yield",
				"seccomp",
				"select",
				"semctl",
				"semget",
				"semop",
				"semtimedop",
				"send",
				"sendfile",
				"sendfile64",
				"sendmmsg",
				"sendmsg",
				"sendto",
				"setns",
				"set_robust_list",
				"set_thread_area",
				"set_tid_address",
				"setfsgid",
				"setfsgid32",
				"setfsuid",
				"setfsuid32",
				"setgid",
				"setgid32",
				"setgroups",
				"setgroups32",
				"setitimer",
				"setpgid",
				"setpriority",
				"setregid",
				"setregid32",
				"setresgid",
				"setresgid32",
				"setresuid",
				"setresuid32",
				"setreuid",
				"setreuid32",
				"setrlimit",
				"setsid",
				"setsockopt",
				"setuid",
				"setuid32",
				"setxattr",
				"shmat",
				"shmctl",
				"shmdt",
				"shmget",
				"shutdown",
				"sigaltstack",
				"signalfd",
				"signalfd4",
				"sigreturn",
				"socketcall",
				"socketpair",
				"splice",
				"stat",
				"stat64",
				"statfs",
				"statfs64",
				"statx",
				"symlink",
				"symlinkat",
				"sync",
				"sync_file_range",
				"syncfs",
				"sysinfo",
				"syslog",
				"tee",
				"tgkill",
				"time",
				"timer_create",
				"timer_delete",
				"timer_getoverrun",
				"timer_gettime",
				"timer_gettime64",
				"timer_settime",
				"timerfd_create",
				"timerfd_gettime",
				"timerfd_gettime64",
				"timerfd_settime",
				"timerfd_settime64",
				"times",
				"tkill",
				"truncate",
				"truncate64",
				"ugetrlimit",
				"umask",
				"umount",
				"umount2",
				"uname",
				"unlink",
				"unlinkat",
				"unshare",
				"utime",
				"utimensat",
				"utimensat_time64",
				"utimes",
				"vfork",
				"wait4",
				"waitid",
				"waitpid",
				"write",
				"writev"
			],
			"action": "SCMP_ACT_ALLOW",
			"args": [],
			"comment": "",
			"includes": {},
			"excludes": {}
		},
		{
			"names": [
				"personality"
			],
			"action": "SCMP_ACT_ALLOW",
			"args": [
				{
					"index": 0,
					"value": 0,
					"valueTwo": 0,
					"op": "SCMP_CMP_EQ"
				}
			],
			"comment": "",
			"includes": {},
			"excludes": {}
		},
		{
			"names": [
				"personality"
			],
			"action": "SCMP_ACT_ALLOW",
			"args": [
				{
					"index": 0,
					"value": 8,
					"valueTwo": 0,
					"op": "SCMP_CMP_EQ"
				}
			],
			"comment": "",
			"includes": {},
			"excludes": {}
		},
		{
			"names": [
				"personality"
			],
			"action": "SCMP_ACT_ALLOW",
			"args": [
				{
					"index": 0,
					"value": 131072,
					"valueTwo": 0,
					"op": "SCMP_CMP_EQ"
				}
			],
			"comment": "",
			"includes": {},
			"excludes": {}
		},
		{
			"names": [
				"personality"
			],
			"action": "SCMP_ACT_ALLOW",
			"args": [
				{
					"index": 0,
					"value": 131080,
					"valueTwo": 0,
					"op": "SCMP_CMP_EQ"
				}
			],
			"comment": "",
			"includes": {},
			"excludes": {}
		},
		{
			"names": [
				"personality"
			],
			"action": "SCMP_ACT_ALLOW",
			"args": [
				{
					"index": 0,
					"value": 4294967295,
					"valueTwo": 0,
					"op": "SCMP_CMP_EQ"
				}
			],
			"comment": "",
			"includes": {},
			"excludes": {}
		},
		{
			"names": [
				"sync_file_range2"
			],
			"action": "SCMP_ACT_ALLOW",
			"args": [],
			"comment": "",
			"includes": {
				"arches": [
					"ppc64le"
				]
			},
			"excludes": {}
		},
		{
			"names": [
				"arm_fadvise64_64",
				"arm_sync_file_range",
				"sync_file_range2",
				"breakpoint",
				"cacheflush",
				"set_tls"
			],
			"action": "SCMP_ACT_ALLOW",
			"args": [],
			"comment": "",
			"includes": {
				"arches": [
					"arm",
					"arm64"
				]
			},
			"excludes": {}
		},
		{
			"names": [
				"arch_prctl"
			],
			"action": "SCMP_ACT_ALLOW",
			"args": [],
			"comment": "",
			"includes": {
				"arches": [
					"amd64",
					"x32"
				]
			},
			"excludes": {}
		},
		{
			"names": [
				"modify_ldt"
			],
			"action": "SCMP_ACT_ALLOW",
			"args": [],
			"comment": "",
			"includes": {
				"arches": [
					"amd64",
					"x32",
					"x86"
				]
			},
			"excludes": {}
		},
		{
			"names": [
				"s390_pci_mmio_read",
				"s390_pci_mmio_write",
				"s390_runtime_instr"
			],
			"action": "SCMP_ACT_ALLOW",
			"args": [],
			"comment": "",
			"includes": {
				"arches": [
					"s390",
					"s390x"
				]
			},
			"excludes": {}
		},
		{
			"names": [
				"open_by_handle_at"
			],
			"action": "SCMP_ACT_ALLOW",
			"args": [],
			"comment": "",
			"includes": {
				"caps": [
					"CAP_DAC_READ_SEARCH"
				]
			},
			"excludes": {}
		},
		{
			"names": [
				"bpf",
				"fanotify_init",
				"lookup_dcookie",
				"perf_event_open",
				"quotactl",
				"setdomainname",
				"sethostname",
				"setns"
			],
			"action": "SCMP_ACT_ALLOW",
			"args": [],
			"comment": "",
			"includes": {
				"caps": [
					"CAP_SYS_ADMIN"
				]
			},
			"excludes": {}
		},
		{
			"names": [
				"chroot"
			],
			"action": "SCMP_ACT_ALLOW",
			"args": [],
			"comment": "",
			"includes": {
				"caps": [
					"CAP_SYS_CHROOT"
				]
			},
			"excludes": {}
		},
		{
			"names": [
				"delete_module",
				"init_module",
				"finit_module",
				"query_module"
			],
			"action": "SCMP_ACT_ALLOW",
			"args": [],
			"comment": "",
			"includes": {
				"caps": [
					"CAP_SYS_MODULE"
				]
			},
			"excludes": {}
		},
		{
			"names": [
				"get_mempolicy",
				"mbind",
				"set_mempolicy"
			],
			"action": "SCMP_ACT_ALLOW",
			"args": [],
			"comment": "",
			"includes": {
				"caps": [
					"CAP_SYS_NICE"
				]
			},
			"excludes": {}
		},
		{
			"names": [
				"acct"
			],
			"action": "SCMP_ACT_ALLOW",
			"args": [],
			"comment": "",
			"includes": {
				"caps": [
					"CAP_SYS_PACCT"
				]
			},
			"excludes": {}
		},
		{
			"names": [
				"kcmp",
				"process_madvise",
				"process_vm_readv",
				"process_vm_writev",
				"ptrace"
			],
			"action": "SCMP_ACT_ALLOW",
			"args": [],
			"comment": "",
			"includes": {
				"caps": [
					"CAP_SYS_PTRACE"
				]
			},
			"excludes": {}
		},
		{
			"names": [
				"iopl",
				"ioperm"
			],
			"action": "SCMP_ACT_ALLOW",
			"args": [],
			"comment": "",
			"includes": {
				"caps": [
					"CAP_SYS_RAWIO"
				]
			},
			"excludes": {}
		},
		{
			"names": [
				"settimeofday",
				"stime",
				"clock_settime",
				"clock_settime64"
			],
			"action": "SCMP_ACT_ALLOW",
			"args": [],
			"comment": "",
			"includes": {
				"caps": [
					"CAP_SYS_TIME"
				]
			},
			"excludes": {}
		},
		{
			"names": [
				"vhangup"
			],
			"action": "SCMP_ACT_ALLOW",
			"args": [],
			"comment": "",
			"includes": {
				"caps": [
					"CAP_SYS_TTY_CONFIG"
				]
			},
			"excludes": {}
		},
		{
			"names": [
				"socket"
			],
			"action": "SCMP_ACT_ERRNO",
			"args": [
				{
					"index": 0,
					"value": 16,
					"valueTwo": 0,
					"op": "SCMP_CMP_EQ"
				},
				{
					"index": 2,
					"value": 9,
					"valueTwo": 0,
					"op": "SCMP_CMP_EQ"
				}
			],
			"comment": "",
			"includes": {},
			"excludes": {
				"caps": [
					"CAP_AUDIT_WRITE"
				]
			},
			"errnoRet": 22
		},
		{
			"names": [
				"socket"
			],
			"action": "SCMP_ACT_ALLOW",
			"args": [
				{
					"index": 2,
					"value": 9,
					"valueTwo": 0,
					"op": "SCMP_CMP_NE"
				}
			],
			"comment": "",
			"includes": {},
			"excludes": {
				"caps": [
					"CAP_AUDIT_WRITE"
				]
			}
		},
		{
			"names": [
				"socket"
			],
			"action": "SCMP_ACT_ALLOW",
			"args": [
				{
					"index": 0,
					"value": 16,
					"valueTwo": 0,
					"op": "SCMP_CMP_NE"
				}
			],
			"comment": "",
			"includes": {},
			"excludes": {
				"caps": [
					"CAP_AUDIT_WRITE"
				]
			}
		},
		{
			"names": [
				"socket"
			],
			"action": "SCMP_ACT_ALLOW",
			"args": [
				{
					"index": 2,
					"value": 9,
					"valueTwo": 0,
					"op": "SCMP_CMP_NE"
				}
			],
			"comment": "",
			"includes": {},
			"excludes": {
				"caps": [
					"CAP_AUDIT_WRITE"
				]
			}
		},
		{
			"names": [
				"socket"
			],
			"action": "SCMP_ACT_ALLOW",
			"args": null,
			"comment": "",
			"includes": {
				"caps": [
					"CAP_AUDIT_WRITE"
				]
			},
			"excludes": {}
		}
	]
}

@mheon
Copy link
Member

mheon commented Apr 26, 2021

Your Seccomp profile does include close_range in the list of allowed calls, so Podman and Libseccomp should not be generating profiles that block it. It's not conditional in any way, either - allowed without any checks.

@rhatdan
Copy link
Member

rhatdan commented Apr 26, 2021

You should see the denied seccomp call in /var/log/audit/audit.log

ausearch -m seccomp -i

@smac89
Copy link
Author

smac89 commented Apr 27, 2021

You should see the denied seccomp call in /var/log/audit/audit.log

ausearch -m seccomp -i

@rhatdan

I do:

----
type=SECCOMP msg=audit(2021-04-27 14:04:24.740:425) : auid=chigozirim uid=unknown(10099) gid=unknown(10099) ses=2 subj==unconfined pid=190649 comm=xfce4-terminal exe=/usr/bin/xfce4-terminal sig=SIG0 arch=x86_64 syscall=close_range compat=0 ip=0x7f76336d8a9d code=errno

Like I said, this only happens inside the container. On my host machine, the problem never occurs

@rhatdan
Copy link
Member

rhatdan commented Apr 27, 2021

Something is going wrong then, some kind of mismatch between what the OCI Runtime understands is close_range and what the kernel does.
You see close_range in /usr/share/containers/seccomp.json correct?

I just wrote a quick patch to podman info to show what seccomp.json file the tool is using.

@smac89
Copy link
Author

smac89 commented Apr 27, 2021

You see close_range in /usr/share/containers/seccomp.json correct?

Indeed I do

➜ grep -C4 'close_range' /usr/share/containers/seccomp.json
				"clock_nanosleep",
				"clock_nanosleep_time64",
				"clone",
				"close",
				"close_range",
				"connect",
				"copy_file_range",
				"creat",
				"dup",

@rhatdan
Copy link
Member

rhatdan commented Apr 27, 2021

Are you using runc or crun?

@giuseppe ideas?

@smac89
Copy link
Author

smac89 commented Apr 27, 2021

I am using crun. I can switch back to runc and test it. Let me do that.

The same issue with runc

@smac89
Copy link
Author

smac89 commented Apr 27, 2021

Also when I switch to runc, the error is not detected by auditd (i.e. I don't see it in the logs), but when I strace the command, I see that it still ends at close_range:

[pid   706] close_range(0, -1, CLOSE_RANGE_CLOEXEC) = -1 EPERM (Operation not permitted)

@giuseppe
Copy link
Member

giuseppe commented May 3, 2021

@giuseppe ideas?

close_range is used by crun.

This is again the same issue with EPERM vs ENOSYS we already faced few months ago.

I think it is time we switch to use ENOSYS by default, the only issue AFAIK is that runc doesn't support yet (opencontainers/runtime-spec#1087).

CC @kolyshkin

@paravz
Copy link

paravz commented May 14, 2021

Seeing same issue on F33, starting container with --security-opt=seccomp=unconfined solves it.

$  grep -C4 'close_range' /usr/share/containers/seccomp.json
                                "clock_nanosleep",
                                "clock_nanosleep_time64",
                                "clone",
                                "close",
                                "close_range",
                                "connect",
                                "copy_file_range",
                                "creat",
                                "dup",

$  rpm -qf /usr/share/containers/seccomp.json
containers-common-1-10.fc33.noarch

$  rpm -q podman runc crun
podman-3.1.0-3.fc33.x86_64
runc-1.0.0-377.rc93.fc33.x86_64
crun-0.19.1-2.fc33.x86_64

#Edit: seccomp audit message:
audit[1112353]: SECCOMP auid=1000 uid=1000 gid=1000 ses=3 subj=system_u:system_r:container_init_t:s0:c344,c914 pid=1112353 comm="xfce4-terminal" exe="/usr/bin/xfce4-terminal" sig=0 arch=c000003e syscall=436 compat=0 ip=0x7f6184e4f15d code=0x50000

@paravz
Copy link

paravz commented May 14, 2021

@smac89 love your bug report, so easy to reproduce!

@giuseppe giuseppe transferred this issue from containers/podman May 14, 2021
@giuseppe giuseppe transferred this issue from containers/crun May 14, 2021
@openshift-ci openshift-ci bot added the kind/bug Categorizes issue or PR as related to a bug. label May 14, 2021
@giuseppe
Copy link
Member

we also need an updated libseccomp that knows about close_range and apparently it is not present even upstream at the moment

@rhatdan
Copy link
Member

rhatdan commented May 20, 2021

@giuseppe Did you open a PR with libseccomp to add this?

@giuseppe
Copy link
Member

I think at this point it is easier to fix it for good in our default seccomp profile now that runc rc95 is out and with the feature we need. Also libseccomp uses some scripts to read all the syscalls from the kernel sources, so it is not necessary to update it manually

@rhatdan
Copy link
Member

rhatdan commented May 21, 2021

Ok what is our next steps then? Do we need a new PR to Podman? Containers-common?

@giuseppe
Copy link
Member

PR opened here: containers/common#573

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@giuseppe
Copy link
Member

this is fixed in c/common

@rmsc
Copy link

rmsc commented Jul 7, 2021

I'm still facing this same issue with containers/common-0.40.1:

$ pacman -Ss containers-common
community/containers-common 0.40.1-2 [installed]
    Configuration files and manpages for containers
 $ podman run --rm -it a83749b0c3fdecb23737bcbc591262cbd8fc91f517b5d61106273d1965658320 /app/walk.c
/app/walk.c opened as FD 3
/proc/self/fd/0 ==> /dev/pts/0
/proc/self/fd/1 ==> /dev/pts/0
/proc/self/fd/2 ==> /dev/pts/0
/proc/self/fd/3 ==> /app/walk.c
/proc/self/fd/4 ==> /proc/1/fd
========= About to call close_range() =======
close_range: Operation not permitted

@rmsc
Copy link

rmsc commented Jul 7, 2021

In my case it seems that a stale config file was probably to blame. Removing and reinstalling the files in /etc/containers fixed this for me. Sorry for the noise.

EDIT: the problem is actually still here.

@rmsc
Copy link

rmsc commented Jul 7, 2021

I just triple checked, and I'm now in a very weird situation:

  • the problem is still 100% reproducible using the steps above by @smac89
  • the problem is completely gone on my own setup (which is puzzling).

I'm now hitting another error (Error: capset: Operation not permitted: OCI permission denied), but that is in a podman-in-podman situation, and easy to workaround for now with --drop-caps all.

@erikarvstedt
Copy link

Support for close_range has only recently been added to seccomp:
seccomp/libseccomp@ac849e7

@idleroamer
Copy link
Contributor

idleroamer commented Nov 13, 2021

in the dunfell yocto build on even podman 3.4.2 with seccomp/libseccomp@ac849e7 I still observe the same defect.
Error: OCI runtime error: invalid seccomp syscall 'close_range'

never mind updating "crun" from 0.10 to 0.19 fixed the issue.

@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 21, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 21, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. stale-issue
Projects
None yet
Development

No branches or pull requests

8 participants