Skip to content
This repository has been archived by the owner on Dec 13, 2018. It is now read-only.

Introduce support for syscall filtering in containers #263

Closed
wants to merge 1 commit into from

Conversation

rhatdan
Copy link
Contributor

@rhatdan rhatdan commented Nov 14, 2014

This PR introduces the ability to filter system calls on a per-container basis on Linux, using libseccomp to support multiple architectures.

This adds another layer of security between containers and the kernel. System calls which are unnecessary in a container or problematic from a security perspective can be restricted to prevent their use. Most of the truly problematic syscalls are already restricted by dropping capabilities; this adds an additional, finer-grained layer of protection.

There's a similar feature present in LXC already, with the significant difference that LXC uses a whitelist of system calls, whereas these patches use a blacklist. The blacklist approach ensures no difference in functionality to clients not explicitly aware of seccomp support (the restricted syscalls list in the container config is left empty, and the seccomp init function exits without taking action).

This PR adds a vendored library dependency (Go bindings for libseccomp) and a build dependency on libseccomp >= v2.1. The actual changes to libcontainer are fairly minimal, most of the delta is in the libseccomp bindings.

Presently missing: integration tests, documentation

Docker-DCO-1.1-Signed-off-by: Matt Heon [email protected] (github: mheon)
Docker-DCO-1.1-Signed-off-by: Dan Walsh [email protected] (github: rhatdan)

@rhatdan
Copy link
Contributor Author

rhatdan commented Nov 14, 2014

Replaces [RFC] Introduce support for syscall filtering in containers #237

Matt is back at school for the semester, so I want to drive this one home.

I added "seccomp" compilation option, for platforms that do not support seccomp.

I also added a test.go, but I can't seem to get it to run.

I hacked up the Dockerfile to pass in the --tag seccomp call, but this is far from ideal.

@rhatdan rhatdan changed the title Introduce support for syscall filtering in containers #237 Introduce support for syscall filtering in containers Nov 14, 2014
@@ -68,6 +68,9 @@ type Config struct {
// RestrictSys will remount /proc/sys, /sys, and mask over sysrq-trigger as well as /proc/irq and
// /proc/bus
RestrictSys bool `json:"restrict_sys,omitempty"`

// Syscalls which will be restricted on container start
RestrictSyscalls []string `json:"restrict_syscalls,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we want to go ahead and create a struct for the values here as we probably want to do things like prevent some flags to certain syscalls like clone.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I switched to a struct that takes either a Architecture or a Syscall, with optional Args.

@crosbymichael
Copy link
Contributor

The import path is giving me a 404 http://sourceforge.net/seccomp

I cannot go get it

@mrunalp
Copy link
Contributor

mrunalp commented Nov 19, 2014

Same here. Also, can't find the project on sourceforge.net.

@rhatdan
Copy link
Contributor Author

rhatdan commented Nov 19, 2014

I was just sort of making this part up. The library is actually at.

http://sourceforge.net/projects/libseccomp

Should I just specify this as sourceforge.net/projects/libseccomp?

@mrunalp
Copy link
Contributor

mrunalp commented Nov 19, 2014

@rhatdan That looks like the repo for the actual libseccomp code, not the go wrapper library. I guess the requirement would be to put it anywhere such that doing a go get on the URL works with it. Putting it on github might be the easiest. I am not sure if go get works with sourceforge.net URLs or not.

@rhatdan
Copy link
Contributor Author

rhatdan commented Nov 19, 2014

@mrunalp The problem with that is the libseccomp maintainer Paul Moore wants to control the go code. I will ping him to join this conversation, and see if we can get it in a proper place.

@mrunalp
Copy link
Contributor

mrunalp commented Nov 19, 2014

@rhatdan Ahh, okay. SGTM.

@pcmoore
Copy link

pcmoore commented Nov 19, 2014

The golang bindings do currently live inside the sf.net git repo, they are in the "working-golang" branch and tagged "go1" so that they can be fetched with the following command line:

go get git.code.sf.net/p/libseccomp/libseccomp.git/src/golang

@mrunalp
Copy link
Contributor

mrunalp commented Nov 19, 2014

@pcmoore When I try that go get, it prompts me for sf.net username/password. Is there a way to do an anonymous checkout? Also, I am not sure if there is a requirement for vendored code to be in master branch or not. @crosbymichael would know.

@pcmoore
Copy link

pcmoore commented Nov 19, 2014

@mrunalp I don't know what to say, it works for me without any credentials:

ssh-add -l
The agent has no identities.
go get git.code.sf.net/p/libseccomp/libseccomp.git/src/golang
echo $?
0

@mrunalp @crosbymichael As far as the git branch, the bindings live in the working-golang branch as opposed to the master branch because they are still a work in progress and I'm not yet comfortable enough with the API to "release" the bindings. Once the binding's API is stable I'll merge the working-golang branch into master.

@mrunalp
Copy link
Contributor

mrunalp commented Nov 20, 2014

@pcmoore I tried it on my laptop and go get worked. It did not work from inside my development vm, though. I guess that should be okay.

@crosbymichael
Copy link
Contributor

So in the mean time what are we supposed to do with this PR?

Also it is common that you can go get the code for golang and that the url matches the import path. This is not the case here and can cause a lot of confusion.

@rhatdan
Copy link
Contributor Author

rhatdan commented Nov 20, 2014

ssh-add -l
2048 a0:41:29:44:1a:99:58:16:bc:01:eb:0b:f7:0b:9b:61 [email protected] (RSA)
4096 f1:72:89:1f:fa:a8:1c:82:2f:57:ea:8d:5b:6f:8d:87 dwalsh@redsox (RSA)
2048 10:23:12:10:74:e9:9e:54:fc:52:7d:8a:49:90:4d:51 [email protected] (RSA)
$ go get git.code.sf.net/p/libseccomp/libseccomp.git/src/golang

It just hangs for me.

@rhatdan
Copy link
Contributor Author

rhatdan commented Nov 20, 2014

I will change the path to whatever is agreed upon, But I would like to have comments on the "Struct".

Also Paul, I need to be able to handle the Syscall + Param calls. I think we will need go bindings for this.

@pcmoore
Copy link

pcmoore commented Nov 20, 2014

@rhatdan Regarding the hang, it does take some time for me, likely due to "go get" building the library?

time go get git.code.sf.net/p/libseccomp/libseccomp.git/src/golang

real 0m45.908s
user 0m0.499s
sys 0m0.173s

@rhatdan Regarding bindings, we'll want to have golang bindings for everything that libseccomp supports, see the Python bindings. I'm just stuck dealing with the steaming pile that is audit at the moment, I likely won't have time to work on this for a bit.

@cyphar
Copy link
Contributor

cyphar commented Dec 12, 2014

+1 for syscall filtering. We need all the security improvements and lock-downs we can get. :P



This PR introduces the ability to filter system calls on a per-container basis on Linux, using libseccomp to support multiple architectures.

This adds another layer of security between containers and the kernel. System calls which are unnecessary in a container or problematic from a security perspective can be restricted to prevent their use. Most of the truly problematic syscalls are already restricted by dropping capabilities; this adds an additional, finer-grained layer of protection.

There's a similar feature present in LXC already, with the significant difference that LXC uses a whitelist of system calls, whereas these patches use a blacklist. The blacklist approach ensures no difference in functionality to clients not explicitly aware of seccomp support (the restricted syscalls list in the container config is left empty, and the seccomp init function exits without taking action).

This PR adds a vendored library dependency (Go bindings for libseccomp) and a build dependency on libseccomp >= v2.1. The actual changes to libcontainer are fairly minimal, most of the delta is in the libseccomp bindings.

Presently missing: integration tests, documentation

Docker-DCO-1.1-Signed-off-by: Matt Heon <[email protected]> (github: mheon)
Docker-DCO-1.1-Signed-off-by: Dan Walsh <[email protected]> (github: rhatdan)
@jandre
Copy link

jandre commented Jan 30, 2015

Hi, is there any reason why this uses a blacklist-approach only? it's kind of weird. Should the user not be able to specify whether they want to run in blacklist or whitelisting mode?

@rhatdan
Copy link
Contributor Author

rhatdan commented Jan 31, 2015

jandre we could support both, but the white list would be a lot harder to put together.
I could see where we have a drop-all and then add them back in.

@jandre
Copy link

jandre commented Jan 31, 2015

Putting a whitelist together is certainly not an easy thing (and it will vary from app to app), but I imagine you could put seccomp into permit, but log unexpected behaviors mode (similar to how you would train AppArmor or SELinux profiles by putting it in complain mode). There's a quick demo of this here: http://outflux.net/teach-seccomp/ (see syscall-reporter). Then, you could create tools read the logfile and build a profile automagically (e.g., something like the AppArmor easyprof tool).

Anyway, I think having it by default in blacklist mode is a good idea, but allowing the user to put it in whitelist mode gives a lot more flexibility. Let's say you were leveraging containers as a true 'sandboxed' compute cluster, this would be a great step into making that happen securely in the future.

@rhatdan
Copy link
Contributor Author

rhatdan commented Jan 31, 2015

Yes I agree, although it is difficult to figure out when you have done enough testing before putting it in non-testing mode. SELinux and AppArmor are a little bit simpler in that I think the access is a little easier to understand.

I don't think the patch currently allows us to specify what happens when a process is not allowed a syscall, I think this version will return EPERM.

@mheon do you know?

@cyphar
Copy link
Contributor

cyphar commented Feb 1, 2015

I have a feeling that the default set of enabled syscalls should be a whitelist, and people can add/remove from that set (just as with capabilities). We've been burned by blacklists before, let's not do it again.

@pcmoore
Copy link

pcmoore commented Feb 1, 2015

@jandre unfortunately the permissive/reporting mode for seccomp isn't really the same as what is exists with SELinux/AppArmor/etc.; there are a number of limitations, the most significant is that the individual container applications will likely need to be made aware of seccomp and ensure they don't overwrite the necessary signal handlers.

There is currently no good permissive/reporting mode for seccomp filters.

@rhatdan
Copy link
Contributor Author

rhatdan commented Feb 1, 2015

@cyphar It is not that easy, and really syscall filtering is not that similar to capabilities or SELinux or Apparmor which are all about white listing. All syscall filtering is doing is reducing the attack surface on the kernel. There are going to be huge wholes that can not be closed, like ioctl.

Bottom line with this technology is if we choose a black list people will use it. We might even be able to slowly increase the blacklist. If we choose a whitelist we will break lots of apps, and people will not take the time to figure out what to add, so they will run in --privileged mode, turning off ALL security including SELinux, Capabilites, Seccomp, UserNamespace...

The other tools I have looked at that use seccomp systemd-nspawn, qemu, and another I can not remember the name use Blacklist.

If you can really define the application that you are going to run you can really lock it down with tools like SELInux/AppArmor and Seccomp, but when you have a general purpose tool like Docker Containers, I would argue it is almost impossible to successfully define a limited whitelist.

@jandre
Copy link

jandre commented Feb 1, 2015

I 100% agree the default should be a blacklist (or maybe just have seccomp filtering disabled by default), but what I don't understand (and this is just simply a flag in libseccomp init) is why the user can't toggle between blacklist mode and whitelist mode, should they need to. You are making it seem like we have to do one or the other. lxc containers allow you to specify this in the first line of their config. See: https://github.com/lxc/lxc/blob/master/doc/examples/seccomp-v2-blacklist.conf vs: https://github.com/lxc/lxc/blob/master/doc/examples/seccomp-v1.conf

I think in the general use case, you have to be as permissive as possible in order to avoid breaking existing applications. But in order to enable more sandbox-like use cases for Docker containers for those who wish to configure it, you will want the ability to throttle this as tightly as possible.

@rhatdan
Copy link
Contributor Author

rhatdan commented Feb 1, 2015

Agreed we should make seccomp patch handle both models. Hopefully libseccomp makes it easy to do this.

@pcmoore
Copy link

pcmoore commented Feb 1, 2015

@rhatdan It's easy, you just pick the default action (allow for a blacklist, kill for a whitelist) when you call seccomp_init().

@rhatdan
Copy link
Contributor Author

rhatdan commented Feb 1, 2015

Ok, I will have @mheon update the patch.

@cyphar
Copy link
Contributor

cyphar commented Feb 1, 2015

@rhatdan Surely if we can create a blacklist of syscalls, and we know what syscalls currently exist, then we can create a whitelist of syscalls? I understand that it might be an issue to actually create the whitelist, but when you're dealing with syscall filtering you're going to break some applications anyway. If you have the ability to specify which syscalls to explicitly allow (or disallow), then maintainers can trivially fix their runconfig options (it's just a matter of straceing the binary). I'm just thinking forwards on this one: if (say 15 kernel revisions later) a set of new syscalls are added (and since they are new, there are probably bugs that could lead to security vulnerabilities), then those syscalls increase the attack surface by a wide margin in the interlude between the kernel releasing the new syscalls and us blacklisting them.

After all: {whitelist} = {everything} - {blacklist} (except the fact that a blacklist allows more things if the set of everything increases, while a whitelist blocks more things if the set of everything increases.

@pcmoore
Copy link

pcmoore commented Feb 2, 2015

@cyphar It is important not to underestimate the difficulty of creating, and maintaining, a syscall whitelist for an application.

@cyphar
Copy link
Contributor

cyphar commented Feb 2, 2015

@pcmoore For some reason, my edit to that comment isn't being shown. But yeah, I understand that maintaining a syscall whitelist would not be an easy task. As long as we support both modes, people who have the time and resources to maintain a syscall whitelist can do so.

@rhatdan
Copy link
Contributor Author

rhatdan commented Feb 2, 2015

cyphar, I would argue that your statement is the problem. " then maintainers can trivially fix their runconfig options (it's just a matter of straceing the binary)" is the problem. This is not the way docker works. Docker does not support alternate runconfig per image. It is one size fits all. Although this is something I wish to fix in the future. When someone write the apache image they do not know what applications the user will run on top of the image. So how do they define a whitelist of what syscalls can or can not be run within the container. When the end user or Admin of the container image runs the app and it fails, and it will. They will not diagnose the problem by looking in some strange log file like /var/lib/audit/audit.log to realize the "foobar" syscall is blocked, then change their docker run command to include --security-opt seccomp:allow:foobar to get their app to run only to find out that they also needed the ABC syscall. Lather Rince repeat, what they will
do is say docker sucks and then run docker run --privileged and call it a day.

Theoretically if we had a way to allow the image to specify the command line to be used when the container is run, and the developer of the image did enough testing then a whitelist approach might be possible,

@rhatdan
Copy link
Contributor Author

rhatdan commented Feb 17, 2015

Replacing this pull request with #384

@rhatdan rhatdan closed this Feb 17, 2015
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants