-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFE: distinguish unknown syscalls #286
Comments
Hi @srd424. I want to make sure I understand what you are asking for in this issue ... it sounds to me that you would basically like to know if libseccomp "knows" about a given syscall, regardless of if that particular syscall is implemented on that arch/ABI, yes? If so, I believe you should be able to use |
That might make the filters .. long-winded! What I was hoping to do is to to have a filter rule compare the syscall number to the highest known, and if greater, return However looking at the details for |
It might or might not be reasonable functionality to add to libseccomp as I guess the original problem may occur for multiple users of the library. It looks like it would (just?) be a question of generating a slightly more sophisticated default action.. |
I'm not quite sure what you mean here ... ? The call to
Okay, I think I'm beginning to understand what you are asking for now. You want the filter itself, not the application code, to take a certain action (return |
This comment in the second thread mentioned above made me smile 👍
After reading through the threads you mentioned, I think I'm on the same page. If someone (libseccomp, nspawn, whoever) could return I think the request is reasonable. I need to think some more if libseccomp can meet these needs, but I have no objections at this point. There are definitely opportunities to improve the end-user experience here. Thanks for the RFE. |
Yes, that sounds right. The opinion of the systemd folks is that EPERM is reasonable most of time for denylisted syscalls, presumably as it conveys "not allowed" to end user / admin. Hence the idea to distinguish between "new" and "old" syscalls and do ENOSYS for anything unrecognised. I assume we don't want to enumerate and test every single syscall in the BPF for performance reasons, so tracking a high water mark for the known syscall numbers per-arch seemed like a "best effort" way to go. Would be interesting to know what docker, podman, lxc etc. do with their seccomp filtering, to see if they would benefit. In the meantime I've PR'd a patch for nspawn that would allow logging of seccomp events, that would make debugging a little easier. |
I agree with @drakenclimber, this request sounds reasonable, I think I just need some more time to think about possible solutions :) At a pretty basic level, this is similar to RFE #11, and in the end that may be the easiest way to implement this in a way that isn't terrible for applications: an application can specify a maximum supported kernel API version, e.g. v5.8 (obviously tokenized), as well as a given action for anything beyond and then libseccomp handles the rest. Would that work for you guys @srd424? |
Hi, this was discussed also in systemd/systemd#16739.
This would work great. In systemd/systemd-nspawn we'd want to return custom errnos for any explicitly allow-listed and deny-listed syscalls, EPERM for any others in the "supported kernel API version", and ENOSYS for any new ones. I think the implementation wouldn't be too complicated. For example for amd64, "known" syscalls can be expressed as |
#94 could be related too. |
I'm under-caffeinated this morning, but would having the ENOSYS handling then give us the possibility to turn large allowlists into small denylists for a possible performance win as well? |
As you allude to, the actual BPF is going to be both arch/ABI and kernel version specific. In the x86_64 example above the BPF isn't going to be too bad, but we not be that lucky for other arches/versions. Regardless, this is now two issue that are effectively requesting the same thing so I think it's something we will want to do ... I'm just not going to start jumping up and down about how easy it is going to be just yet ;)
Sort of yes, sort of no. It involves ranges, but #94 is about caller specified argument ranges (which is still something I think we want to do, the PR just came in at a bad time and I think the API needs some tweaking) whereas what we are talking about are implicitly created syscall ranges which are generated by the library itself.
From an application perspective, e.g. systemd, if you are trying to block "new" syscalls then yes ... assuming we're talking about the same thing :) |
To be more specific .. at the moment anyone trying to securely block certain syscalls effectively has to allowlist, because you can't be sure what syscalls a newer kernel might add. If we can request libseccomp to automatically block unknown syscalls, that means we can safely switch to a small denylist instead? |
I sincerely hope we can get there, as that would be an absolutely awesome feature. For example, Docker is currently employing an allowlist and their default list is now ~240 syscalls (and always growing). The performance impact of such a large list can be prohibitive. Note that it can be somewhat mitigated by using the binary tree feature we added in v2.5. |
I don't see how this could work. Unknown to libseccomp and unknown to the denylist author will usually mean different things. Which means that the conceptual issue won't go away even if libseccomp has a clearer picture of the supported system calls internally. |
Good point - I guess we'd need well defined sets tagged by kernel version for that to work, which does seem to be being discussed a bit. |
The tables are fairly continuous: >>> l = {int(s[1]):s[0] for s in (s.split() for s in open('syscalls-x86_64').readlines()) if len(s)>1}; x = np.array(sorted(l.keys())); np.diff(x)
array([ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 5, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 90, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1])
>>> l = {int(s[1]):s[0] for s in (s.split() for s in open('syscalls-alpha').readlines()) if len(s)>1}; x = np.array(sorted(l.keys())); np.diff(x)
array([ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 2, 1, 1,
1, 1, 1, 2, 1, 1, 1, 3, 12, 3, 3, 1, 11, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
2, 1, 1, 1, 1, 1, 1, 5, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 39, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 3, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
2, 1, 1, 1])
>>> l = {int(s[1]):s[0] for s in (s.split() for s in open('syscalls-arm').readlines()) if len(s)>1}; x = np.array(sorted(l.keys())); np.diff(x)
array([1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 3, 1, 1, 2, 1, 2, 3, 4,
1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 2, 1, 2, 3, 1, 1,
1, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 2, 1, 1, 1, 3,
1, 1, 1, 1, 1, 1, 2, 1, 3, 1, 1, 1, 1, 1, 3, 3, 1, 1, 2, 1, 1, 1,
1, 2, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
>>> l = {int(s[1]):s[0] for s in (s.split() for s in open('syscalls-riscv64').readlines()) if len(s)>1}; x = np.array(sorted(l.keys())); np.diff(x)
array([ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 16, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 130, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1]) I implemented a filter of "known" syscalls for systemd-nspawn in systemd/systemd#16819. systemd/systemd@158e30f has a dump of some libseccomp-generated programs. Those dumps are long, so I won't repeat them here, but SCMP_FLTATR_CTL_OPTIMIZE makes the program more efficient, but also longer. Things could be made ~50 times shorter by using range comparisons. |
On Linux the major C libraries expect that syscalls that are blocked from running in the container runtime return ENOSYS to allow fallbacks to be used. Returning EPERM by default is not useful particularly for syscalls that would return EPERM for actual access restrictions e.g. the new faccessat2. The runtime-spec should set the standard and recommend ENOSYS be returned just like a kernel would that doesn't support that syscall. This allows C runtimes to fall back on other possible implementations given the userspace policies. Please see the upstream discussions: https://lwn.net/Articles/738694/ - Discusses fragility of syscall filtering. opencontainers/runc#2151 - glibc and musl request ENOSYS return for unknown syscalls. systemd/systemd#16739 - Discusses systemd-nspawn breakage with faccessat2. systemd/systemd#16819 - General policy for systemd-nspawn to return ENOSYS. seccomp/libseccomp#286 - Block unknown syscalls and erturn ENOSYS.
On Linux the major C libraries expect that syscalls that are blocked from running in the container runtime return ENOSYS to allow fallbacks to be used. Returning EPERM by default is not useful particularly for syscalls that would return EPERM for actual access restrictions e.g. the new faccessat2. The runtime-spec should set the standard and recommend ENOSYS be returned just like a kernel would that doesn't support that syscall. This allows C runtimes to fall back on other possible implementations given the userspace policies. Please see the upstream discussions: https://lwn.net/Articles/738694/ - Discusses fragility of syscall filtering. opencontainers/runc#2151 - glibc and musl request ENOSYS return for unknown syscalls. systemd/systemd#16739 - Discusses systemd-nspawn breakage with faccessat2. systemd/systemd#16819 - General policy for systemd-nspawn to return ENOSYS. seccomp/libseccomp#286 - Block unknown syscalls and return ENOSYS.
I only just found this thread, just chiming in to say that I've been thinking on similar lines and this is definitely something Docker/runc would like to have solved as well. Doing it with a maximum kernel version is probably the nicest way of doing it, because it means profile writers (and container runtimes) don't need to track syscalls added out-of-order or what the newest syscall was at the time of writing the profile. |
Triggered by a discussion (in June & Aug) on systemd-devel ..
systemd-nspawn chooses to return
EPERM
for non-whitelisted syscalls. However, this causes problems in cases likeopenat2
, where libc checks forENOSYS
and falls back to a different implementation.It seems to me a 'mostly right' solution could be to check if the syscall number falls within the range of defined syscalls that existed at the time seccomp was built. I'm sure there are corner cases (I know some archs do weird things), but if the tools that parse
syscalls.csv
etc could generate a simple#define
for the max known syscall number that might be useful?The text was updated successfully, but these errors were encountered: