-
Notifications
You must be signed in to change notification settings - Fork 316
Introduce support for syscall filtering in containers #263
Conversation
Replaces [RFC] Introduce support for syscall filtering in containers #237 Matt is back at school for the semester, so I want to drive this one home. I added "seccomp" compilation option, for platforms that do not support seccomp. I also added a test.go, but I can't seem to get it to run. I hacked up the Dockerfile to pass in the --tag seccomp call, but this is far from ideal. |
@@ -68,6 +68,9 @@ type Config struct { | |||
// RestrictSys will remount /proc/sys, /sys, and mask over sysrq-trigger as well as /proc/irq and | |||
// /proc/bus | |||
RestrictSys bool `json:"restrict_sys,omitempty"` | |||
|
|||
// Syscalls which will be restricted on container start | |||
RestrictSyscalls []string `json:"restrict_syscalls,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think we want to go ahead and create a struct
for the values here as we probably want to do things like prevent some flags to certain syscalls like clone
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I switched to a struct that takes either a Architecture or a Syscall, with optional Args.
The import path is giving me a 404 http://sourceforge.net/seccomp I cannot go get it |
Same here. Also, can't find the project on sourceforge.net. |
I was just sort of making this part up. The library is actually at. http://sourceforge.net/projects/libseccomp Should I just specify this as sourceforge.net/projects/libseccomp? |
@rhatdan That looks like the repo for the actual libseccomp code, not the go wrapper library. I guess the requirement would be to put it anywhere such that doing a go get on the URL works with it. Putting it on github might be the easiest. I am not sure if go get works with sourceforge.net URLs or not. |
@mrunalp The problem with that is the libseccomp maintainer Paul Moore wants to control the go code. I will ping him to join this conversation, and see if we can get it in a proper place. |
@rhatdan Ahh, okay. SGTM. |
The golang bindings do currently live inside the sf.net git repo, they are in the "working-golang" branch and tagged "go1" so that they can be fetched with the following command line:
|
@pcmoore When I try that go get, it prompts me for sf.net username/password. Is there a way to do an anonymous checkout? Also, I am not sure if there is a requirement for vendored code to be in master branch or not. @crosbymichael would know. |
@mrunalp I don't know what to say, it works for me without any credentials:
@mrunalp @crosbymichael As far as the git branch, the bindings live in the working-golang branch as opposed to the master branch because they are still a work in progress and I'm not yet comfortable enough with the API to "release" the bindings. Once the binding's API is stable I'll merge the working-golang branch into master. |
@pcmoore I tried it on my laptop and go get worked. It did not work from inside my development vm, though. I guess that should be okay. |
So in the mean time what are we supposed to do with this PR? Also it is common that you can |
ssh-add -l It just hangs for me. |
I will change the path to whatever is agreed upon, But I would like to have comments on the "Struct". Also Paul, I need to be able to handle the Syscall + Param calls. I think we will need go bindings for this. |
@rhatdan Regarding the hang, it does take some time for me, likely due to "go get" building the library?
@rhatdan Regarding bindings, we'll want to have golang bindings for everything that libseccomp supports, see the Python bindings. I'm just stuck dealing with the steaming pile that is audit at the moment, I likely won't have time to work on this for a bit. |
+1 for |
This PR introduces the ability to filter system calls on a per-container basis on Linux, using libseccomp to support multiple architectures. This adds another layer of security between containers and the kernel. System calls which are unnecessary in a container or problematic from a security perspective can be restricted to prevent their use. Most of the truly problematic syscalls are already restricted by dropping capabilities; this adds an additional, finer-grained layer of protection. There's a similar feature present in LXC already, with the significant difference that LXC uses a whitelist of system calls, whereas these patches use a blacklist. The blacklist approach ensures no difference in functionality to clients not explicitly aware of seccomp support (the restricted syscalls list in the container config is left empty, and the seccomp init function exits without taking action). This PR adds a vendored library dependency (Go bindings for libseccomp) and a build dependency on libseccomp >= v2.1. The actual changes to libcontainer are fairly minimal, most of the delta is in the libseccomp bindings. Presently missing: integration tests, documentation Docker-DCO-1.1-Signed-off-by: Matt Heon <[email protected]> (github: mheon) Docker-DCO-1.1-Signed-off-by: Dan Walsh <[email protected]> (github: rhatdan)
Hi, is there any reason why this uses a blacklist-approach only? it's kind of weird. Should the user not be able to specify whether they want to run in blacklist or whitelisting mode? |
jandre we could support both, but the white list would be a lot harder to put together. |
Putting a whitelist together is certainly not an easy thing (and it will vary from app to app), but I imagine you could put seccomp into permit, but log unexpected behaviors mode (similar to how you would train AppArmor or SELinux profiles by putting it in complain mode). There's a quick demo of this here: http://outflux.net/teach-seccomp/ (see syscall-reporter). Then, you could create tools read the logfile and build a profile automagically (e.g., something like the AppArmor easyprof tool). Anyway, I think having it by default in blacklist mode is a good idea, but allowing the user to put it in whitelist mode gives a lot more flexibility. Let's say you were leveraging containers as a true 'sandboxed' compute cluster, this would be a great step into making that happen securely in the future. |
Yes I agree, although it is difficult to figure out when you have done enough testing before putting it in non-testing mode. SELinux and AppArmor are a little bit simpler in that I think the access is a little easier to understand. I don't think the patch currently allows us to specify what happens when a process is not allowed a syscall, I think this version will return EPERM. @mheon do you know? |
I have a feeling that the default set of enabled syscalls should be a whitelist, and people can add/remove from that set (just as with capabilities). We've been burned by blacklists before, let's not do it again. |
@jandre unfortunately the permissive/reporting mode for seccomp isn't really the same as what is exists with SELinux/AppArmor/etc.; there are a number of limitations, the most significant is that the individual container applications will likely need to be made aware of seccomp and ensure they don't overwrite the necessary signal handlers. There is currently no good permissive/reporting mode for seccomp filters. |
@cyphar It is not that easy, and really syscall filtering is not that similar to capabilities or SELinux or Apparmor which are all about white listing. All syscall filtering is doing is reducing the attack surface on the kernel. There are going to be huge wholes that can not be closed, like ioctl. Bottom line with this technology is if we choose a black list people will use it. We might even be able to slowly increase the blacklist. If we choose a whitelist we will break lots of apps, and people will not take the time to figure out what to add, so they will run in --privileged mode, turning off ALL security including SELinux, Capabilites, Seccomp, UserNamespace... The other tools I have looked at that use seccomp systemd-nspawn, qemu, and another I can not remember the name use Blacklist. If you can really define the application that you are going to run you can really lock it down with tools like SELInux/AppArmor and Seccomp, but when you have a general purpose tool like Docker Containers, I would argue it is almost impossible to successfully define a limited whitelist. |
I 100% agree the default should be a blacklist (or maybe just have seccomp filtering disabled by default), but what I don't understand (and this is just simply a flag in libseccomp init) is why the user can't toggle between blacklist mode and whitelist mode, should they need to. You are making it seem like we have to do one or the other. lxc containers allow you to specify this in the first line of their config. See: https://github.com/lxc/lxc/blob/master/doc/examples/seccomp-v2-blacklist.conf vs: https://github.com/lxc/lxc/blob/master/doc/examples/seccomp-v1.conf I think in the general use case, you have to be as permissive as possible in order to avoid breaking existing applications. But in order to enable more sandbox-like use cases for Docker containers for those who wish to configure it, you will want the ability to throttle this as tightly as possible. |
Agreed we should make seccomp patch handle both models. Hopefully libseccomp makes it easy to do this. |
@rhatdan It's easy, you just pick the default action (allow for a blacklist, kill for a whitelist) when you call seccomp_init(). |
Ok, I will have @mheon update the patch. |
@rhatdan Surely if we can create a blacklist of syscalls, and we know what syscalls currently exist, then we can create a whitelist of syscalls? I understand that it might be an issue to actually create the whitelist, but when you're dealing with syscall filtering you're going to break some applications anyway. If you have the ability to specify which syscalls to explicitly allow (or disallow), then maintainers can trivially fix their runconfig options (it's just a matter of After all: |
@cyphar It is important not to underestimate the difficulty of creating, and maintaining, a syscall whitelist for an application. |
@pcmoore For some reason, my edit to that comment isn't being shown. But yeah, I understand that maintaining a syscall whitelist would not be an easy task. As long as we support both modes, people who have the time and resources to maintain a syscall whitelist can do so. |
cyphar, I would argue that your statement is the problem. " then maintainers can trivially fix their runconfig options (it's just a matter of straceing the binary)" is the problem. This is not the way docker works. Docker does not support alternate runconfig per image. It is one size fits all. Although this is something I wish to fix in the future. When someone write the apache image they do not know what applications the user will run on top of the image. So how do they define a whitelist of what syscalls can or can not be run within the container. When the end user or Admin of the container image runs the app and it fails, and it will. They will not diagnose the problem by looking in some strange log file like /var/lib/audit/audit.log to realize the "foobar" syscall is blocked, then change their docker run command to include --security-opt seccomp:allow:foobar to get their app to run only to find out that they also needed the ABC syscall. Lather Rince repeat, what they will Theoretically if we had a way to allow the image to specify the command line to be used when the container is run, and the developer of the image did enough testing then a whitelist approach might be possible, |
Replacing this pull request with #384 |
This PR introduces the ability to filter system calls on a per-container basis on Linux, using libseccomp to support multiple architectures.
This adds another layer of security between containers and the kernel. System calls which are unnecessary in a container or problematic from a security perspective can be restricted to prevent their use. Most of the truly problematic syscalls are already restricted by dropping capabilities; this adds an additional, finer-grained layer of protection.
There's a similar feature present in LXC already, with the significant difference that LXC uses a whitelist of system calls, whereas these patches use a blacklist. The blacklist approach ensures no difference in functionality to clients not explicitly aware of seccomp support (the restricted syscalls list in the container config is left empty, and the seccomp init function exits without taking action).
This PR adds a vendored library dependency (Go bindings for libseccomp) and a build dependency on libseccomp >= v2.1. The actual changes to libcontainer are fairly minimal, most of the delta is in the libseccomp bindings.
Presently missing: integration tests, documentation
Docker-DCO-1.1-Signed-off-by: Matt Heon [email protected] (github: mheon)
Docker-DCO-1.1-Signed-off-by: Dan Walsh [email protected] (github: rhatdan)