-
Notifications
You must be signed in to change notification settings - Fork 215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
put xsnap worker in a seccomp jail #2386
Comments
@warner came up with a simplification in recent discussion: The parent opens a snapshot file for writing and passes this open fd to the child; when the child gets the "write a snapshot" instruction, it writes to this fd and then exits. How often do we want / need to write a snapshot and keep going anyway? Maybe never, to start with? |
XS recently started using the |
I'm liking this option more and more. We still need seccomp-bpf to enable the
I think it's the dominant use case, actually. Currently, we exclusively do snapshot-and-keep-going. |
Two other technologies we might apply:
Both are not native to linux, though: https://justine.lol/pledge/ is a project which aims to port the OpenBSD feature to Linux, in terms of seccomp-bpf. I think that cane be done from userspace (i.e. the validator operator doesn't need to install a kernel module or something), but I'm not sure. https://docs.kernel.org/userspace-api/landlock.html is probably a kernel module and seems to be what backs |
I just discovered bubblewrap, which seems to provide a pretty thorough sandbox as a parent process. It might save some development time. OTOH, maybe it would just add complexity/risk. |
testing ideas with @siarhei-agoric : add a compile-time switch which exposes some benign syscall like We need to make sure libc doesn't cache the getpid (or find some other benign syscall to use). We do not want to share a powerful syscall like
Note: with vdso, The xsnap process could perform destructive testing of whether seccomp is available by spawning a child process at startup, wait for it to perform some tests (possibly killing itself in the process, by violating the rules), and then enabling/not seccomp itself based on the results. |
getpid() cache had been removed starting from glibc 2.25 [1, 2]. However, there is a possibility of having it implemented via vDSO which would also bypass the actual syscall [1]. On the other hand, both cache and vDSO can be avoided by invoking syscall(SYS_getpid) directly [3]. |
my assignment here looks pretty stale. I'm removing it. |
Snapshot streaming was addressed in #6363, so there is currently no remaining In the future we're still considering integrating a SQLite DB directly in the xsnap process, so that would re-introduce fs syscalls. |
if I understand correctly, seccomp operates on per-thread basis, so it is conceivable to have a separate thread to handle native SQLite interface in C. |
What is the Problem Being Solved?
seccomp(2) is a Linux kernel facility that allows a process to voluntarily give up access to nearly everything. Once invoked (in the original "strict" mode), the process can only use
read()
,write()
, and_exit()
. Any other syscall causes the process to be killed. The program can only read/write pre-existing file descriptors; withoutopen()
it cannot create any new ones. The lack ofrecv()
means it cannot accept new file descriptors over a unix-domain socket either.This limits the damage a compromised process could do to the system that hosts it. Its authority is limited to whatever actions will be taken on its behalf by the code at the other end of those file descriptors.
Our
xsnap
vat worker is a good candiate for this sort of isolation. The parent process sends it pass-by-copy "deliveries" to begin each crank. The xsnap worker executes the delivery, updates its internal state, and emits some number of swingset-syscalls (not to be confused with the linux-syscalls thatseccomp(2)
restricts). Those swingset-syscalls are expressed as more pass-by-copy messages sent over a pipe to the parent process, as are their return values.A memory safety error in the XS engine (stack overflow, buffer overflow, use-after-free) would allow supposedly-confined JavaScript vat objects to compromise the entire
xsnap
process. Without something likeseccomp(2)
, each process gets unrestricted access to the entire user account, which would allow it to modify the entire swingset state file (giving it full control over the swingset kernel and all the other vats it manages), as well as any secret access tokens or privileged hardware access avaiable to that user account.But with the isolation, such memory errors would only give the attacking object control over the same set of authorities as the vat worker as a whole. This includes forging messages from other objects within the same vat (as well as observing their internal secrets, if the vat could keep secrets to begin with, which is not true for chain-based vats). It also gets access to any nondeterminism that was meant to be reserved for the liveslots layer (e.g. access to WeakRef) or the supervisor installed into the worker.
Description of the Design
To implement this, we'd have
xsnap
's C-based supervisor invokeseccomp()
shortly after worker startup. At this point, the file descriptors are already opened, and the worker has switched into "react to messages over the pipe" mode.We currently have
xsnap
write its snapshot files directly to disk (the piped message includes a filename to which the snapshot should be written), which obviously has to change, sinceseccomp(2)
forbidsopen()
. We must either send the snapshot data over the same message pipe used for commands and swingset-syscalls, or prepare a second pipe used just for snapshot data. The latter might easier to manage, especially because thexsnap
snapshot writing process does not know ahead of time how large the snapshot will be (you give it awrite()
function and gets called a zillion times with tiny bits of data until the object graph has been traversed). I'm thinking the parent uses the command pipe to say "please write a snapshot", then starts copying any data from the snapshot pipe to a temporary file, counting bytes as it goes. Eventually the child sends a response on the command pipe that says "I finished writing NN bytes to the snapshot pipe", at which point the parent waits until the snapshot pipe's counter reaches that target, then closes and renames the file.(the alternative would be to add framing to the messages sent over the snapshot pipe, so we could distinguish an EOF frame, but that would add some length-prefix overhead to each
write()
, and I expect those writes to be numerous and tiny, so the overhead could be significant)(if we could open a new snapshot pipe for each act of snapshotting, we could use the pipe's EOF event for framing, but under
seccomp(2)
you can't)At
xsnap
startup, we currently use a command-line argument to name a snapshot file to read from. We could keep doing this: we don't invokeseccomp
until after we've finished loading the snapshot. For symmetry we could implement it the other way (use the snapshot pipe to write data into the child), but it seems unnecessary.We might have some diagnostic instrumentation (like elapsed-time measurements) which would need to change under
seccomp
, becausegettimeofday()
is also forbidden. Some JS engines call functions like this constantly, but I think XS does not (they're conveniently conservative about platform expectations: the microcontrollers it runs on do not necessarily have a real-time clock at all).seccomp-bpf
seccomp
has a newer mode, in which the caller provides a Berkeley Packet Filter -syntax program, and this program gets to evaluate and approve each syscall. In this mode, we could allow direct writes to snapshot files while denying all other forms of IO. While possible, I'd prefer the strict mode, because:safePrefix/../../../etc/passwd
shenanigansSecurity Considerations
This ticket is entirely about improving the security of our system by reducing authority given to the vat worker. Unless we manage to introduce a buffer overflow while modifying the C code to invoke
seccomp
, the result should be strictly safer than before. The biggest risk is accidentally killing the process when it does some benign-but-unexpected operation.Compatibility Considerations
seccomp()
is Linux-specific. If we use it, our validators will be limited to running on Linux kernels. It would be simple to sense the platform we're running on and only enableseccomp
if it's available, removing that limitation, although we might prefer to mandate the improved security. We certainly want to enable developers to work on non-Linux platforms, but we could also accomplish that by just not using XS workers during development (which has other benefits, most notably for debugging). OT3H to get accurate metering information (which matches what happens on-chain), there should at least be an option for local development to use XS.FreeBSD has a similar notion (I think "jail" is the keyword), but I'm not very familiar with it. I have no idea what Windows might do. Linux has a separate feature, misleadingly named "capabilities", which effectively breaks up
root
into a bitmap of allowed authorities, but linux-capabilites are rather coarse and probably wouldn't help us much. Linux (and some other Unix-ish systems) also has a notion of "namespaces", which can be used to isolate a process to a particular corner of the filesystem, as well as denying it the ability to observe other processes/etc. This could be used for isolation while still allowing limited file access. But our vat worker's platform needs are so modest, it seems appropriate to just lock down everything except the message pipe.Test Plan
It would be nice to have a unit test which enables the jail, then attempts to make an illegal
open()
and watch the process get killed. However we'd have to include special code in the worker to expose any such syscalls for the test to exercise, and that code would not serve any other purpose but to enable such a test. It's probably worth doing, but feels a bit weird.If we elect to allow workers to run without
seccomp
, we should have a test for that too. This test will probably need to run on a non-Linux CI system to exercise both cases properly.cc @dckc @michaelfig @erights
The text was updated successfully, but these errors were encountered: