Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

youki need seccomp unconfined when runc/crun don't #2022

Closed
martadinata666 opened this issue Jun 8, 2023 · 4 comments · Fixed by #2029
Closed

youki need seccomp unconfined when runc/crun don't #2022

martadinata666 opened this issue Jun 8, 2023 · 4 comments · Fixed by #2029
Assignees

Comments

@martadinata666
Copy link

martadinata666 commented Jun 8, 2023

As the title said, for some reason youki need seccomp unconfined, two containers that I tested was mariadb and jellyfin.

Jellyfin log

jellyfin_server  | Failed to create CoreCLR, HRESULT: 0x80070008
jellyfin_server exited with code 137

Mariadb log

documize-db-1  | 2023-06-08 13:56:14+07:00 [Note] [Entrypoint]: Entrypoint script for MariaDB Server 1:10.11.2+maria~debsid started.
documize-db-1  | 2023-06-08 13:56:14+07:00 [ERROR] [Entrypoint]: mysqld failed while attempting to check config
documize-db-1  | 	command was: mariadbd --character-set-server=utf8mb4 --collation-server=utf8mb4_bin --verbose --help --log-bin-index=/tmp/tmp.axLXmY9ftc
documize-db-1  | 	Can't initialize timers
documize-db-1 exited with code 0

Runtime:
Kernel 6.3
Ubuntu Jammy
Docker 24.0.2

Compiler
Rust 1.70

@utam0k
Copy link
Member

utam0k commented Jun 8, 2023

👋 Hi, @martadinata666. Thanks for your report. May I ask you to give us more specific commands to reproduce the problem you pointed out?

@martadinata666
Copy link
Author

martadinata666 commented Jun 8, 2023

Hi, thanks for the response.

Let me start with minimal compose startup

services:
  db:
    image: mariadb:lts
    environment:
      - MARIADB_ROOT_PASSWORD=mariadbnc
      - TZ=Asia/Jakarta
      - MARIADB_DATABASE=dummy
      - MARIADB_USER=dummyuser
      - MARIADB_PASSWORD=dummypass
    volumes:
      - db:/var/lib/mysql
    restart: unless-stopped

volumes:
  db:

In this setup that docker by default using runc runtime the mariadb container start correctly, now on to youki runtime

  db:
    image: mariadb:lts
    runtime: youki
    environment:
      - MARIADB_ROOT_PASSWORD=mariadbnc
      - TZ=Asia/Jakarta
      - MARIADB_DATABASE=dummy
      - MARIADB_USER=dummyuser
      - MARIADB_PASSWORD=dummypass
    volumes:
      - db:/var/lib/mysql
    restart: unless-stopped

volumes:
  db:

In this compose I define runtime with youki, that should be start container correctly, but unfortunately it don't, like the original post. Mariadb throw error about fail initialize timers

Attaching to mariadb-test-db-1
mariadb-test-db-1  | 2023-06-08 18:21:01+07:00 [Note] [Entrypoint]: Entrypoint script for MariaDB Server 1:10.11.3+maria~ubu2204 started.
mariadb-test-db-1  | 2023-06-08 18:21:01+07:00 [ERROR] [Entrypoint]: mariadbd failed while attempting to check config
mariadb-test-db-1  | 	command was: mariadbd --verbose --help
mariadb-test-db-1  | 	Can't initialize timers
mariadb-test-db-1 exited with code 0

So I'm looking around and find

But just out of curiosity I follow the guide, updating my compose to

services:
  db:
    image: mariadb:lts
    runtime: youki
    security_opt:
      - seccomp=unconfined
    environment:
      - MARIADB_ROOT_PASSWORD=mariadbnc
      - TZ=Asia/Jakarta
      - MARIADB_DATABASE=dummy
      - MARIADB_USER=dummyuser
      - MARIADB_PASSWORD=dummypass
    volumes:
      - db:/var/lib/mysql
    restart: unless-stopped

volumes:
  db:

And unexpectedly it starts correctly, so I don't know why youki need seccomp tweak when other runtimes doesn't. FYI crun also start container correctly without seccomp tweak. 🤔

@yihuaf yihuaf self-assigned this Jun 8, 2023
@yihuaf
Copy link
Collaborator

yihuaf commented Jun 9, 2023

OK, so here is a preliminary investigation.

First of all, we need to update the OCI spec crate to the latest. @utam0k We may need your help to cut a new release for the crate. Specifically, we need this PR:

The docker default seccomp profile has a clone3 to Error(ENOSYS). Without picking up the errorRet, the spec defaults the field to None and Youki will default the None value to EPERM. On newer system, if clone3 returns EPERM, the glibc will exit without fallback. The right errno should be ENOSYS.

I can confirm this is the cause because I made it working by directly overriding all errorRet to ENOSYS instead of EPERM. This is just to verify that this is indeed the issue. The proper fix should come from fixing the oci spec crate.

Now, a little more into this rabbit hole that is semi-related to this issue. Due to the nature of libseccomp, to properly secure the sandbox, we have to use a whitelist approach. In another word, dockerd's default seccomp profile will be deny all syscalls and enumerate the allowed syscalls. And currently, the default errno is EPERM. This is required because in the future, we want the same profile to work when new syscall is introduced. Otherwise, the new syscalls can potentially escape the policy. Therefore, docker hardcode a whitelist in its codebase.

However, this can become a problem in some cases. For example, clone3 was introduced recently. If using an older version of docker, the whitelist doesn't have clone3. This works fine on kernel version before clone3 was introduced. Once clone3 is introduced, if the docker whitelist is not updated, clone3 will return EPERM. This is a big issue because glibc will not fallback to clone when clone3 returns EPERM. The only legit value in this case should be ENOSYS.

runc implemented a workaround which I don't quite like. The PR is here:

The real fix should be inside libseccomp, but the issue has been pending for a while and likely will not be fixed soon:

An alternative solution discussed is to just make all unknown syscall to ENOSYS. Here is the discussion from runc side. For a number of different reasons, runc back then did not want to make the switch right away. The main issue is that docker and containerd is actually sending the whitelist seccomp profile, not runc, so runc is reluctant to unilaterally override.

The same proposal is passed to docker and containerd.

With all of these being said, potentially youki can do this. I think people from different discussion agrees that EPERM is not the right choice. ENOSYS should be the right choice. The debate is who should be making the override.

Some other reference/readings if we want to go down this rabbit hole with me lol.

Reference: https://medium.com/nttlabs/ubuntu-21-10-and-fedora-35-do-not-work-on-docker-20-10-9-1cd439d9921

I am brain dumping all these info here before I loose all these details in my head. The short term fix is update the oci-spec-rs crate. I want to sleep on the long term issue with regarding to ENOSYS for unknown syscall.

@yihuaf
Copy link
Collaborator

yihuaf commented Jun 9, 2023

Interestingly, podman just went ahead and made the default seccomp action error return code to ENOSYS. https://github.com/containers/common/blob/main/pkg/seccomp/seccomp.json#L4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants