Skip to content

Commit

Permalink
config: Make capabilities, noNewPrivileges, and rlimits Linux-only (a…
Browse files Browse the repository at this point in the history
…gain)

Roll back the genericization from 718f9f3 (minor narrative cleanup
regarding config compatibility, 2017-01-30, #673).  Lifting the
restriction there seems to have been motivated by "Solaris supports
capabilities", but that was before the split into a capabilities
object which happened in eb114f0 (Add ambient and bounding capability
support, 2017-02-02, #675).  It's not clear if Solaris supports
ambient caps, or what Solaris API rlimits or noNewPrivileges were
punting to [1].  And John Howard has recently confirmed that Windows
does not support capabilities and is unlikely to do so in the future
[2].  John's statement didn't directly address rlimits or
noNewPrivileges, but we can always restore any of these properties to
the Solaris/Windows platforms if/when we get docs about which API
we're punting to on those platforms.

Also add some backticks, remove the hyphens in "OPTIONAL) - the",
standardize lines I touch to use "the process" [3], and use four-space
indents here to keep Pandoc happy (see 7795661 (runtime.md: Fix
sub-bullet indentation, 2016-06-08, #495).

[1]: #673 (comment)
[2]: #810 (comment)
[3]: #809 (comment)

Signed-off-by: W. Trevor King <[email protected]>
  • Loading branch information
wking committed May 18, 2017
1 parent 3036273 commit 550db9f
Showing 1 changed file with 21 additions and 18 deletions.
39 changes: 21 additions & 18 deletions config.md
Original file line number Diff line number Diff line change
Expand Up @@ -130,35 +130,38 @@ For Solaris, the mount entry corresponds to the 'fs' resource in the [zonecfg(1M
* **`env`** (array of strings, OPTIONAL) with the same semantics as [IEEE Std 1003.1-2001's `environ`][ieee-1003.1-2001-xbd-c8.1].
* **`args`** (array of strings, REQUIRED) with similar semantics to [IEEE Std 1003.1-2001 `execvp`'s *argv*][ieee-1003.1-2001-xsh-exec].
This specification extends the IEEE standard in that at least one entry is REQUIRED, and that entry is used with the same semantics as `execvp`'s *file*.
* **`capabilities`** (object, OPTIONAL) is an object containing arrays that specifies the sets of capabilities for the process(es) inside the container. Valid values are platform-specific. For example, valid values for Linux are defined in the [capabilities(7)][capabilities.7] man page, such as `CAP_CHOWN`. Any value which cannot be mapped to a relevant kernel interface MUST cause an error.
capabilities contains the following properties:
* **`effective`** (array of strings, OPTIONAL) - the `effective` field is an array of effective capabilities that are kept for the process.
* **`bounding`** (array of strings, OPTIONAL) - the `bounding` field is an array of bounding capabilities that are kept for the process.
* **`inheritable`** (array of strings, OPTIONAL) - the `inheritable` field is an array of inheritable capabilities that are kept for the process.
* **`permitted`** (array of strings, OPTIONAL) - the `permitted` field is an array of permitted capabilities that are kept for the process.
* **`ambient`** (array of strings, OPTIONAL) - the `ambient` field is an array of ambient capabilities that are kept for the process.
* **`rlimits`** (array of objects, OPTIONAL) allows setting resource limits for a process inside the container.
Each entry has the following structure:

* **`type`** (string, REQUIRED) - the platform resource being limited, for example on Linux as defined in the [setrlimit(2)][setrlimit.2] man page.
* **`soft`** (uint64, REQUIRED) - the value of the limit enforced for the corresponding resource.
* **`hard`** (uint64, REQUIRED) - the ceiling for the soft limit that could be set by an unprivileged process. Only a privileged process (e.g. under Linux: one with the CAP_SYS_RESOURCE capability) can raise a hard limit.

If `rlimits` contains duplicated entries with same `type`, the runtime MUST error out.

* **`noNewPrivileges`** (bool, OPTIONAL) setting `noNewPrivileges` to true prevents the processes in the container from gaining additional privileges.
As an example, the ['no_new_privs'][no-new-privs] article in the kernel documentation has information on how this is achieved using a prctl system call on Linux.

For Linux-based systems the process structure supports the following process-specific fields.

* **`apparmorProfile`** (string, OPTIONAL) specifies the name of the AppArmor profile to be applied to processes in the container.
For more information about AppArmor, see [AppArmor documentation][apparmor].
* **`capabilities`** (object, OPTIONAL) is an object containing arrays that specifies the sets of capabilities for the process.
Valid values are defined in the [capabilities(7)][capabilities.7] man page, such as `CAP_CHOWN`.
Any value which cannot be mapped to a relevant kernel interface MUST cause an error.
`capabilities` contains the following properties:

* **`effective`** (array of strings, OPTIONAL) the `effective` field is an array of effective capabilities that are kept for the process.
* **`bounding`** (array of strings, OPTIONAL) the `bounding` field is an array of bounding capabilities that are kept for the process.
* **`inheritable`** (array of strings, OPTIONAL) the `inheritable` field is an array of inheritable capabilities that are kept for the process.
* **`permitted`** (array of strings, OPTIONAL) the `permitted` field is an array of permitted capabilities that are kept for the process.
* **`ambient`** (array of strings, OPTIONAL) the `ambient` field is an array of ambient capabilities that are kept for the process.
* **`noNewPrivileges`** (bool, OPTIONAL) setting `noNewPrivileges` to true prevents the process from gaining additional privileges.
As an example, the [`no_new_privs`][no-new-privs] article in the kernel documentation has information on how this is achieved using a `prctl` system call on Linux.
* **`oomScoreAdj`** *(int, OPTIONAL)* adjusts the oom-killer score in `[pid]/oom_score_adj` for the container process's `[pid]` in a [proc pseudo-filesystem][procfs].
If `oomScoreAdj` is set, the runtime MUST set `oom_score_adj` to the given value.
If `oomScoreAdj` is not set, the runtime MUST NOT change the value of `oom_score_adj`.

This is a per-process setting, where as [`disableOOMKiller`](config-linux.md#disable-out-of-memory-killer) is scoped for a memory cgroup.
For more information on how these two settings work together, see [the memory cgroup documentation section 10. OOM Contol][cgroup-v1-memory_2].
* **`rlimits`** (array of objects, OPTIONAL) allows setting resource limits for the process.
Each entry has the following structure:

* **`type`** (string, REQUIRED) the platform resource being limited as defined in the [`setrlimit(2)`][setrlimit.2] man page.
* **`soft`** (uint64, REQUIRED) the value of the limit enforced for the corresponding resource.
* **`hard`** (uint64, REQUIRED) the ceiling for the soft limit that could be set by an unprivileged process.
Only a privileged process (e.g. one with the `CAP_SYS_RESOURCE` capability) can raise a hard limit.

If `rlimits` contains duplicated entries with same `type`, the runtime MUST error out.
* **`selinuxLabel`** (string, OPTIONAL) specifies the SELinux label to be applied to the processes in the container.
For more information about SELinux, see [SELinux documentation][selinux].

Expand Down

0 comments on commit 550db9f

Please sign in to comment.