This is a list of blocked syscalls.
We are more prone to blocking syscalls rather than allowing them. If a syscall does not seem to be useful to our users, then we block it out of an abundance of caution. Many of the blocked syscalls may indeed be safe, but we think that defaulting to blocking them will be the safer option. If a blocked syscall is both safe and useful, then we will be open to discussion of adding it to the allow list.
We will only allow syscalls that are unlikely to lead to exploits.
Particularly dangerous syscalls will be marked with
Not needed by demo programs, and is helpful for restricting networking.
See accept.
Requires CAP_SYS_PACCT privileges, and does not seem to be useful for user processes.
It is unlikely that a user program would have a need to add a key to the kernel's keyring, so best to block it out of caution.
Needs CAP_SYS_TIME privileges. Tunes the kernels clock, could be used nefariously.
Unimplemented system call. This system call is not implemented in the Linux kernel, and always returns -1.
Berkeley Packet Filter. Used for network packet filtering and for seccomp. Neither of which operations should be necessary for user programs to do.
Set capabilities of calling thread. This syscall can grant the thread privileges to make additional syscalls, for example CAP_SYS_CHROOT. Could be used maliciously.
Change root directory. This is the big one to block, since it can be used to escape the jail.
See adjtime.
See adjtime.
Not intentionally blocked, but because libseccomp does not provide a SCMP_SYS
macro for __clone2
, we are not able to allow it.
enableseccomp.c:23:5: error: ‘__SNR___clone2’ undeclared here (not in a function)
23 | SCMP_SYS(__clone2),
| ^~~~~~~~
Create a loadable kernel module. This was removed in Linux 2.6, so no reason to use it.
Probably no good reason for user code to unload kernel modules.
Not intentionally blocked, but this is a new syscall that currently doesn't compile with the version of libseccomp used in the container. The plan is to allow this once libseccomp supports it.
See epoll_create.
User code probably shouldn't be loading kernel modules.
Literally no man pages for this syscall. Syscall number 431 in asm/unistd_64.h on AMD64 Linux.
Literally no man pages for this syscall.
Literally no man pages for this syscall.
Literally no man pages for this syscall.
Obsolete system call.
Not intentionally blocked, but this is a new syscall that currently doesn't compile with the version of libseccomp used in the container. The plan is to allow this once libseccomp supports it.
See futex
Retrieve exported kernel and module symbols. Users probably don't need to do anything with kernel modules.
Unimplemented system call. This system call is not implemented in the Linux kernel, and always returns -1.
Initialize a kernel module. A user function shouldn't need to do this.
Set port IO permissions. Turning on requires CAP_SYS_RAWIO permissions.
No man pages for this syscall.
Similar to ioperm, but this one is deprecated due to it being a lot slower than ioprem.
Compare if two processes share kernel resources such as virtual memory, file descriptors, etc. Requires the same privileges as ptrace.
Load a new kernel that will run on reboot.
See kexec_file_load
Manage kernel keyring.
No man page entries, so defaulting to blocking.
No man page entries, so defaulting to blocking.
No man page entries, so defaulting to blocking.
Hard links that lead outside of the chroot are a way of escaping a chroot jail. Since the process is already jailed, I'm not aware of a way for this to be exploited, I also don't think that it will have any value to user functions, so I will block it out of an abundance of caution.
See link
Listen for a connection on a socket. Could be used to run a webserver from within the enclave, possibly spoofing sentinel.
Move all pages in another process to another set of nodes. I'm suspicious of something that can affect memory of other processes, so for now we're blocking this one.
Create special files, e.g., the files that can be found within /dev
. One of
the goals of capejail
is to block access to certain special files such as
/dev/nsm
. We want to be sure that a malicious process will not be able to
create these files itself.
See mknod
Attach a filesystem to a target path. User's probably won't need to mount any filesystems.
See mount
No manpage entries.
See migrate_pages
Interface with NFS daemon. This system call no longer exists in Linux since version 3.1.
No man pages.
Performance monitoring. Can be used to spy on other processes.
Gets a duplicate file descriptor of another processes file descriptor. Requres ptrace.
See here
See pidfd_getfd
Change the root mount.
Fairly new system call that is not yet available in the libseccomp version used in the container. Should be safe to allow when we upgrade the container.
Transfer data between address spaces.
Transfer data between address spaces.
Allows caller to observe the target process. Can be used to extract memory from any other process.
This syscall is not implemented.
User code probably shouldn't need to worry about kernel module.
Manipulate disk quotas
See quotactl
Reboot the system.
Request a key from the kernel's key management facility.
We are already using seccomp to restrict syscalls that the user code can make. We want to avoid user code being able to potentially exploit subsequent calls to seccomp to re-enable previously disabled syscalls. While I am not aware of such an exploit, I don't want to rule it out.
Unimplemented syscall.
User code probably won't need to set the filesystem group ID.
User code probably won't need to set the filesystem user ID.
Let's avoid a possible exploit of user code trying to change its group ID to another user on the system.
Let's avoid a possible exploit of user code trying to change its group membership.
User code should not need to change the hostname.
Move the calling thread into a different namespace. We certainly don't want user code to be able to escape its namespace.
Requires CAP_SYS_NICE to be able to get a more favorable priority.
User code should not be changing its group ID.
User code should not be changing its group ID.
User code should not be changing its user ID.
User code should not be changing its user ID.
User code should not be adjusting its resource limits.
Let's block this to avoid the user running a web server within the enclave.
Set system time of day. User code should not need to do this.
Users should not be able to change their UID. Users code should be restricted to the capejail user.
User code should not be able to shutdown the virtual machine.
Disable swap area. Users shouldn't need to configure swap.
Enable swap area. Users shouldn't need to configure swap.
I'm hesitant to allow links in case if there is an exploit to set and follow links to escape the chroot jail.
See symlink
Commit file to disk. Shouldn't be necessary.
See sync
See sync
No longer exists in current kernels, removed with Linux version 5.5.
Read from kernel message ring buffer. Users shouldn't need this.
Unimplemented syscall.
Unmount a volume. Users should not be umounting volumes from the filesystem.
See umount
Unimplemented syscall.