Introduce support for syscall filtering in containers #263

rhatdan · 2014-11-14T16:03:07Z

This PR introduces the ability to filter system calls on a per-container basis on Linux, using libseccomp to support multiple architectures.

This adds another layer of security between containers and the kernel. System calls which are unnecessary in a container or problematic from a security perspective can be restricted to prevent their use. Most of the truly problematic syscalls are already restricted by dropping capabilities; this adds an additional, finer-grained layer of protection.

There's a similar feature present in LXC already, with the significant difference that LXC uses a whitelist of system calls, whereas these patches use a blacklist. The blacklist approach ensures no difference in functionality to clients not explicitly aware of seccomp support (the restricted syscalls list in the container config is left empty, and the seccomp init function exits without taking action).

This PR adds a vendored library dependency (Go bindings for libseccomp) and a build dependency on libseccomp >= v2.1. The actual changes to libcontainer are fairly minimal, most of the delta is in the libseccomp bindings.

Presently missing: integration tests, documentation

Docker-DCO-1.1-Signed-off-by: Matt Heon [email protected] (github: mheon)
Docker-DCO-1.1-Signed-off-by: Dan Walsh [email protected] (github: rhatdan)

rhatdan · 2014-11-14T16:04:57Z

Replaces [RFC] Introduce support for syscall filtering in containers #237

Matt is back at school for the semester, so I want to drive this one home.

I added "seccomp" compilation option, for platforms that do not support seccomp.

I also added a test.go, but I can't seem to get it to run.

I hacked up the Dockerfile to pass in the --tag seccomp call, but this is far from ideal.

crosbymichael · 2014-11-19T01:44:14Z

config.go

@@ -68,6 +68,9 @@ type Config struct {
 	// RestrictSys will remount /proc/sys, /sys, and mask over sysrq-trigger as well as /proc/irq and
 	// /proc/bus
 	RestrictSys bool `json:"restrict_sys,omitempty"`
+
+	// Syscalls which will be restricted on container start
+	RestrictSyscalls []string `json:"restrict_syscalls,omitempty"`


Do you think we want to go ahead and create a struct for the values here as we probably want to do things like prevent some flags to certain syscalls like clone.

Ok I switched to a struct that takes either a Architecture or a Syscall, with optional Args.

crosbymichael · 2014-11-19T01:48:01Z

The import path is giving me a 404 http://sourceforge.net/seccomp

I cannot go get it

mrunalp · 2014-11-19T06:29:21Z

Same here. Also, can't find the project on sourceforge.net.

rhatdan · 2014-11-19T14:37:45Z

I was just sort of making this part up. The library is actually at.

http://sourceforge.net/projects/libseccomp

Should I just specify this as sourceforge.net/projects/libseccomp?

mrunalp · 2014-11-19T16:45:52Z

@rhatdan That looks like the repo for the actual libseccomp code, not the go wrapper library. I guess the requirement would be to put it anywhere such that doing a go get on the URL works with it. Putting it on github might be the easiest. I am not sure if go get works with sourceforge.net URLs or not.

rhatdan · 2014-11-19T17:02:41Z

@mrunalp The problem with that is the libseccomp maintainer Paul Moore wants to control the go code. I will ping him to join this conversation, and see if we can get it in a proper place.

mrunalp · 2014-11-19T17:05:48Z

@rhatdan Ahh, okay. SGTM.

pcmoore · 2014-11-19T21:45:50Z

The golang bindings do currently live inside the sf.net git repo, they are in the "working-golang" branch and tagged "go1" so that they can be fetched with the following command line:

go get git.code.sf.net/p/libseccomp/libseccomp.git/src/golang

mrunalp · 2014-11-19T21:51:01Z

@pcmoore When I try that go get, it prompts me for sf.net username/password. Is there a way to do an anonymous checkout? Also, I am not sure if there is a requirement for vendored code to be in master branch or not. @crosbymichael would know.

pcmoore · 2014-11-19T22:16:22Z

@mrunalp I don't know what to say, it works for me without any credentials:

ssh-add -l
The agent has no identities.
go get git.code.sf.net/p/libseccomp/libseccomp.git/src/golang
echo $?
0

@mrunalp @crosbymichael As far as the git branch, the bindings live in the working-golang branch as opposed to the master branch because they are still a work in progress and I'm not yet comfortable enough with the API to "release" the bindings. Once the binding's API is stable I'll merge the working-golang branch into master.

mrunalp · 2014-11-20T00:35:03Z

@pcmoore I tried it on my laptop and go get worked. It did not work from inside my development vm, though. I guess that should be okay.

crosbymichael · 2014-11-20T00:35:24Z

So in the mean time what are we supposed to do with this PR?

Also it is common that you can go get the code for golang and that the url matches the import path. This is not the case here and can cause a lot of confusion.

rhatdan · 2014-11-20T12:01:10Z

ssh-add -l
2048 a0:41:29:44:1a:99:58:16:bc:01:eb:0b:f7:0b:9b:61 [email protected] (RSA)
4096 f1:72:89:1f:fa:a8:1c:82:2f:57:ea:8d:5b:6f:8d:87 dwalsh@redsox (RSA)
2048 10:23:12:10:74:e9:9e:54:fc:52:7d:8a:49:90:4d:51 [email protected] (RSA)
$ go get git.code.sf.net/p/libseccomp/libseccomp.git/src/golang

It just hangs for me.

rhatdan · 2014-11-20T12:02:26Z

I will change the path to whatever is agreed upon, But I would like to have comments on the "Struct".

Also Paul, I need to be able to handle the Syscall + Param calls. I think we will need go bindings for this.

pcmoore · 2014-11-20T15:23:43Z

@rhatdan Regarding the hang, it does take some time for me, likely due to "go get" building the library?

time go get git.code.sf.net/p/libseccomp/libseccomp.git/src/golang

real 0m45.908s
user 0m0.499s
sys 0m0.173s

@rhatdan Regarding bindings, we'll want to have golang bindings for everything that libseccomp supports, see the Python bindings. I'm just stuck dealing with the steaming pile that is audit at the moment, I likely won't have time to work on this for a bit.

cyphar · 2014-12-12T09:34:31Z

+1 for syscall filtering. We need all the security improvements and lock-downs we can get. :P

This PR introduces the ability to filter system calls on a per-container basis on Linux, using libseccomp to support multiple architectures. This adds another layer of security between containers and the kernel. System calls which are unnecessary in a container or problematic from a security perspective can be restricted to prevent their use. Most of the truly problematic syscalls are already restricted by dropping capabilities; this adds an additional, finer-grained layer of protection. There's a similar feature present in LXC already, with the significant difference that LXC uses a whitelist of system calls, whereas these patches use a blacklist. The blacklist approach ensures no difference in functionality to clients not explicitly aware of seccomp support (the restricted syscalls list in the container config is left empty, and the seccomp init function exits without taking action). This PR adds a vendored library dependency (Go bindings for libseccomp) and a build dependency on libseccomp >= v2.1. The actual changes to libcontainer are fairly minimal, most of the delta is in the libseccomp bindings. Presently missing: integration tests, documentation Docker-DCO-1.1-Signed-off-by: Matt Heon <[email protected]> (github: mheon) Docker-DCO-1.1-Signed-off-by: Dan Walsh <[email protected]> (github: rhatdan)

jandre · 2015-01-30T21:24:55Z

Hi, is there any reason why this uses a blacklist-approach only? it's kind of weird. Should the user not be able to specify whether they want to run in blacklist or whitelisting mode?

rhatdan · 2015-01-31T10:56:40Z

jandre we could support both, but the white list would be a lot harder to put together.
I could see where we have a drop-all and then add them back in.

jandre · 2015-01-31T15:20:39Z

Putting a whitelist together is certainly not an easy thing (and it will vary from app to app), but I imagine you could put seccomp into permit, but log unexpected behaviors mode (similar to how you would train AppArmor or SELinux profiles by putting it in complain mode). There's a quick demo of this here: http://outflux.net/teach-seccomp/ (see syscall-reporter). Then, you could create tools read the logfile and build a profile automagically (e.g., something like the AppArmor easyprof tool).

Anyway, I think having it by default in blacklist mode is a good idea, but allowing the user to put it in whitelist mode gives a lot more flexibility. Let's say you were leveraging containers as a true 'sandboxed' compute cluster, this would be a great step into making that happen securely in the future.

rhatdan · 2015-01-31T18:14:05Z

Yes I agree, although it is difficult to figure out when you have done enough testing before putting it in non-testing mode. SELinux and AppArmor are a little bit simpler in that I think the access is a little easier to understand.

I don't think the patch currently allows us to specify what happens when a process is not allowed a syscall, I think this version will return EPERM.

@mheon do you know?

cyphar · 2015-02-01T01:00:59Z

I have a feeling that the default set of enabled syscalls should be a whitelist, and people can add/remove from that set (just as with capabilities). We've been burned by blacklists before, let's not do it again.

pcmoore · 2015-02-01T01:53:09Z

@jandre unfortunately the permissive/reporting mode for seccomp isn't really the same as what is exists with SELinux/AppArmor/etc.; there are a number of limitations, the most significant is that the individual container applications will likely need to be made aware of seccomp and ensure they don't overwrite the necessary signal handlers.

There is currently no good permissive/reporting mode for seccomp filters.

rhatdan · 2015-02-01T11:51:20Z

@cyphar It is not that easy, and really syscall filtering is not that similar to capabilities or SELinux or Apparmor which are all about white listing. All syscall filtering is doing is reducing the attack surface on the kernel. There are going to be huge wholes that can not be closed, like ioctl.

Bottom line with this technology is if we choose a black list people will use it. We might even be able to slowly increase the blacklist. If we choose a whitelist we will break lots of apps, and people will not take the time to figure out what to add, so they will run in --privileged mode, turning off ALL security including SELinux, Capabilites, Seccomp, UserNamespace...

The other tools I have looked at that use seccomp systemd-nspawn, qemu, and another I can not remember the name use Blacklist.

If you can really define the application that you are going to run you can really lock it down with tools like SELInux/AppArmor and Seccomp, but when you have a general purpose tool like Docker Containers, I would argue it is almost impossible to successfully define a limited whitelist.

jandre · 2015-02-01T17:05:25Z

I 100% agree the default should be a blacklist (or maybe just have seccomp filtering disabled by default), but what I don't understand (and this is just simply a flag in libseccomp init) is why the user can't toggle between blacklist mode and whitelist mode, should they need to. You are making it seem like we have to do one or the other. lxc containers allow you to specify this in the first line of their config. See: https://github.com/lxc/lxc/blob/master/doc/examples/seccomp-v2-blacklist.conf vs: https://github.com/lxc/lxc/blob/master/doc/examples/seccomp-v1.conf

I think in the general use case, you have to be as permissive as possible in order to avoid breaking existing applications. But in order to enable more sandbox-like use cases for Docker containers for those who wish to configure it, you will want the ability to throttle this as tightly as possible.

rhatdan · 2015-02-01T20:57:54Z

Agreed we should make seccomp patch handle both models. Hopefully libseccomp makes it easy to do this.

pcmoore · 2015-02-01T21:01:06Z

@rhatdan It's easy, you just pick the default action (allow for a blacklist, kill for a whitelist) when you call seccomp_init().

rhatdan · 2015-02-01T21:02:32Z

Ok, I will have @mheon update the patch.

cyphar · 2015-02-01T23:32:27Z

@rhatdan Surely if we can create a blacklist of syscalls, and we know what syscalls currently exist, then we can create a whitelist of syscalls? I understand that it might be an issue to actually create the whitelist, but when you're dealing with syscall filtering you're going to break some applications anyway. If you have the ability to specify which syscalls to explicitly allow (or disallow), then maintainers can trivially fix their runconfig options (it's just a matter of straceing the binary). I'm just thinking forwards on this one: if (say 15 kernel revisions later) a set of new syscalls are added (and since they are new, there are probably bugs that could lead to security vulnerabilities), then those syscalls increase the attack surface by a wide margin in the interlude between the kernel releasing the new syscalls and us blacklisting them.

After all: {whitelist} = {everything} - {blacklist} (except the fact that a blacklist allows more things if the set of everything increases, while a whitelist blocks more things if the set of everything increases.

pcmoore · 2015-02-02T01:08:44Z

@cyphar It is important not to underestimate the difficulty of creating, and maintaining, a syscall whitelist for an application.

cyphar · 2015-02-02T06:09:29Z

@pcmoore For some reason, my edit to that comment isn't being shown. But yeah, I understand that maintaining a syscall whitelist would not be an easy task. As long as we support both modes, people who have the time and resources to maintain a syscall whitelist can do so.

rhatdan · 2015-02-02T13:13:41Z

cyphar, I would argue that your statement is the problem. " then maintainers can trivially fix their runconfig options (it's just a matter of straceing the binary)" is the problem. This is not the way docker works. Docker does not support alternate runconfig per image. It is one size fits all. Although this is something I wish to fix in the future. When someone write the apache image they do not know what applications the user will run on top of the image. So how do they define a whitelist of what syscalls can or can not be run within the container. When the end user or Admin of the container image runs the app and it fails, and it will. They will not diagnose the problem by looking in some strange log file like /var/lib/audit/audit.log to realize the "foobar" syscall is blocked, then change their docker run command to include --security-opt seccomp:allow:foobar to get their app to run only to find out that they also needed the ABC syscall. Lather Rince repeat, what they will
do is say docker sucks and then run docker run --privileged and call it a day.

Theoretically if we had a way to allow the image to specify the command line to be used when the container is run, and the developer of the image did enough testing then a whitelist approach might be possible,

rhatdan · 2015-02-17T16:59:26Z

Replacing this pull request with #384

rhatdan force-pushed the seccomp branch from 1220ca8 to be96531 Compare November 14, 2014 16:08

rhatdan changed the title ~~Introduce support for syscall filtering in containers #237~~ Introduce support for syscall filtering in containers Nov 14, 2014

rhatdan force-pushed the seccomp branch from be96531 to 1943e00 Compare November 14, 2014 16:12

mrunalp mentioned this pull request Nov 15, 2014

[RFC] Introduce support for syscall filtering in containers #237

Closed

crosbymichael reviewed Nov 19, 2014
View reviewed changes

rhatdan force-pushed the seccomp branch from 1943e00 to 6f66c9e Compare November 19, 2014 14:35

rhatdan force-pushed the seccomp branch from 6f66c9e to 6d1be31 Compare December 2, 2014 19:16

rhatdan force-pushed the seccomp branch from 6d1be31 to 91c3149 Compare January 7, 2015 21:41

mheon mentioned this pull request Feb 17, 2015

Introduce support for syscall filtering in containers #237 #384

Closed

rhatdan closed this Feb 17, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce support for syscall filtering in containers #263

Introduce support for syscall filtering in containers #263

rhatdan commented Nov 14, 2014

rhatdan commented Nov 14, 2014

crosbymichael Nov 19, 2014

rhatdan Nov 19, 2014

crosbymichael commented Nov 19, 2014

mrunalp commented Nov 19, 2014

rhatdan commented Nov 19, 2014

mrunalp commented Nov 19, 2014

rhatdan commented Nov 19, 2014

mrunalp commented Nov 19, 2014

pcmoore commented Nov 19, 2014

mrunalp commented Nov 19, 2014

pcmoore commented Nov 19, 2014

mrunalp commented Nov 20, 2014

crosbymichael commented Nov 20, 2014

rhatdan commented Nov 20, 2014

rhatdan commented Nov 20, 2014

pcmoore commented Nov 20, 2014

cyphar commented Dec 12, 2014

jandre commented Jan 30, 2015

rhatdan commented Jan 31, 2015

jandre commented Jan 31, 2015

rhatdan commented Jan 31, 2015

cyphar commented Feb 1, 2015

pcmoore commented Feb 1, 2015

rhatdan commented Feb 1, 2015

jandre commented Feb 1, 2015

rhatdan commented Feb 1, 2015

pcmoore commented Feb 1, 2015

rhatdan commented Feb 1, 2015

cyphar commented Feb 1, 2015

pcmoore commented Feb 2, 2015

cyphar commented Feb 2, 2015

rhatdan commented Feb 2, 2015

rhatdan commented Feb 17, 2015

Introduce support for syscall filtering in containers #263

Introduce support for syscall filtering in containers #263

Conversation

rhatdan commented Nov 14, 2014

rhatdan commented Nov 14, 2014

crosbymichael Nov 19, 2014

Choose a reason for hiding this comment

rhatdan Nov 19, 2014

Choose a reason for hiding this comment

crosbymichael commented Nov 19, 2014

mrunalp commented Nov 19, 2014

rhatdan commented Nov 19, 2014

mrunalp commented Nov 19, 2014

rhatdan commented Nov 19, 2014

mrunalp commented Nov 19, 2014

pcmoore commented Nov 19, 2014

mrunalp commented Nov 19, 2014

pcmoore commented Nov 19, 2014

mrunalp commented Nov 20, 2014

crosbymichael commented Nov 20, 2014

rhatdan commented Nov 20, 2014

rhatdan commented Nov 20, 2014

pcmoore commented Nov 20, 2014

cyphar commented Dec 12, 2014

jandre commented Jan 30, 2015

rhatdan commented Jan 31, 2015

jandre commented Jan 31, 2015

rhatdan commented Jan 31, 2015

cyphar commented Feb 1, 2015

pcmoore commented Feb 1, 2015

rhatdan commented Feb 1, 2015

jandre commented Feb 1, 2015

rhatdan commented Feb 1, 2015

pcmoore commented Feb 1, 2015

rhatdan commented Feb 1, 2015

cyphar commented Feb 1, 2015

pcmoore commented Feb 2, 2015

cyphar commented Feb 2, 2015

rhatdan commented Feb 2, 2015

rhatdan commented Feb 17, 2015