-
Notifications
You must be signed in to change notification settings - Fork 554
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add CPU affinity to executed processes #1253
Conversation
8179d10
to
cb918d2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@opencontainers/runtime-spec-maintainers PTAL (I'm going to merge this this week if there are no other comments) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
config.md
Outdated
on after the transition to container's cgroup. The format is either the | ||
same as for `initial`, or a special value `cgroup` which means container's | ||
CPU affinity, as defined by [cpu.cpus property](./config.md#configLinuxCPUs)). | ||
If omitted or empty, the `initial` setting is kept. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't it be the opposite? I mean "if ommitted, cpu.cpus will be in effect"?
The initial
seems to allow to set affinity to cpu cores that are outside of the set which is specified by cpu.cpus
.
This also trigger the question of validation of the values between cpuAffinity
and cpu.cpus
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two questions here.
-
What is the default if
final
is not set. You are probably right thatcpu.cpus
makes more sense. UPDATED. -
Whether to allow affinity that's outside of container's cpu.cpus. While it seems theoretically possible for the
initial
cpuset, I haven't checked that it works, and I don't want to add any specific constraints to the spec (or to runc/crun).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PTAL @kad
@kolyshkin please, give us some time, we would like to do few experiments with @klihub, as we also observed corner cases scenarios with DPDK-style apps in the past, but we need some time to replicate that environment and see behaviour in case of invalid arguments between requests via cgroups or Regarding PR in general:
|
@kad thanks for reviewing this!
I think you are misunderstanding. These two are for
See, they (
Again, this is about I think it should be up to implementation what to do it case of errors. IOW the spec should not be way too rigid. Maybe some implementation could have a flag saying "ignore this" or "ignore error from that" or "if this won't work, fall back to that" etc. If the spec is too rigid, such hacks render an implementation incompatible, which I guess is not good. |
I guess I am merging this later today. @kad if you have something, please open a followup PR. |
We also need to add a flag into runtime features. |
I couldn't find a good place to add such a flag, so maybe the upper-level runtime (cri-o, containerd) can check the |
if the feature piece isn't sanctified in the runtime spec that's fine, as long as runtimes that support it agree on the name of the feature |
sorry for delay, missed notification from previous reply :(
well, then it is even worse with naming, I'd say. Description says "specifies CPU affinity used to execute the process." but doesn't mention that it is not applicable for normal container entry point process. Furthermore, the "exec" is not really part of the runtime spec since 2016 or so. Yes, runtimes are implementing it, and it can be used, but it is not covered by spec officially. So, any fields in the spec is originally referring to standard operations create/start/stop/status/... if we want this field only be respected on exec, we have two choices: a) explicitly call it with exec prefix in name or something similar, b) having field still named more generically as now, but inside
it is not really a problem to have empty struct, effect should be the same as it is not specified at all. One thing which I'm asking about 'initial' be optional, is because sched_cpuaffinity is reset once process is attached to cgroup with cpuset. So, practically So, my line of thinking is like this:
with both of those variants being optional, it would allow to do affinity only for "infrastructure" operations before entering into container, only after entering container, or both. And if we do it linked with other container lifecycle points like createRuntime, createRuntime, etc... it will give more flexibility to set affinity for all scenarios.
well,
Regarding implementation: if sched_cpuaffinity() is called with disjoint set of cores compared to cpuset.cpus, it would return EINVAL. for the rest, it is AND masked with cpuset.cpus.... So, in worst case scenarios the requested affinity might not be in place at all (if error is ignored) or failure (if error not ignored), to "best" case where affinity would be set to subset from what was actually requested.... |
Also, s/in/on/g. Signed-off-by: Kir Kolyshkin <[email protected]>
@kad thanks, changed like this: diff --git a/config.md b/config.md
index eab6eec..b9b5573 100644
--- a/config.md
+++ b/config.md
@@ -340,9 +340,10 @@ For Linux-based systems, the `process` object supports the following process-spe
* **`class`** (string, REQUIRED) specifies the I/O scheduling class. Possible values are `IOPRIO_CLASS_RT`, `IOPRIO_CLASS_BE`, and `IOPRIO_CLASS_IDLE`.
* **`priority`** (int, REQUIRED) specifies the priority level within the class. The value should be an integer ranging from 0 (highest) to 7 (lowest).
-* **`cpuAffinity`** (object, OPTIONAL) specifies CPU affinity used to execute the process.
+* **`execCPUAffinity`** (object, OPTIONAL) specifies CPU affinity used to execute the process.
+ This setting is not applicable to the container's init process.
The following properties are available:
- * **`initial`** (string, REQUIRED) is a list of CPUs a runtime parent
+ * **`initial`** (string, OPTIONAL) is a list of CPUs a runtime parent
process to be run on initially, before the transition to container's
cgroup. This is a a comma-separated list, with dashes to represent
ranges. For example, `0-3,7` represents CPUs 0,1,2,3, and 7.
@@ -427,7 +427,7 @@ _Note: symbolic name for uid and gid, such as uname and gname respectively, are
"soft": 1024
}
],
- "cpuAffinity": {
+ "execCPUAffinity": {
"initial": "7",
"final": "0-3,7"
}
diff --git a/schema/config-schema.json b/schema/config-schema.json
index 39cc71e..cb74342 100644
--- a/schema/config-schema.json
+++ b/schema/config-schema.json
@@ -221,11 +221,8 @@
}
}
},
- "cpuAffinity": {
+ "execCPUAffinity": {
"type": "object",
- "required": [
- "initial"
- ],
"properties": {
"initial": {
"type": "string",
diff --git a/specs-go/config.go b/specs-go/config.go
index 00bf946..290b6dc 100644
--- a/specs-go/config.go
+++ b/specs-go/config.go
@@ -94,8 +94,8 @@ type Process struct {
SelinuxLabel string `json:"selinuxLabel,omitempty" platform:"linux"`
// IOPriority contains the I/O priority settings for the cgroup.
IOPriority *LinuxIOPriority `json:"ioPriority,omitempty" platform:"linux"`
- // CPUAffinity specifies CPU affinity for the process.
- CPUAffinity *CPUAffinity `json:"cpuAffinity,omitempty" platform:"linux"`
+ // ExecCPUAffinity specifies CPU affinity for exec processes.
+ ExecCPUAffinity *CPUAffinity `json:"execCPUAffinity,omitempty" platform:"linux"`
}
// LinuxCapabilities specifies the list of allowed capabilities that are kept for a process. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for update. looks better now. I'm still not convinced about names initial
and final
, maybe something more descriptive can be done like execRuntime
/ execContainer
for consistency with hooks?
But for that I'll let others to chime in.
* **`final`** (string, OPTIONAL) is a list of CPUs the process will be run | ||
on after the transition to container's cgroup. The format is the same as | ||
for `initial`. If omitted or empty, the container's default CPU affinity, | ||
as defined by [cpu.cpus property](./config.md#configLinuxCPUs)), is used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the tests, if you don't call sched_setaffinity()
, the kernel itself will reset it to cgroups cpuset.cpu, so, maybe here statement can be simplified, that if not specified, after transition into container namespace, default affinity depends on kernel implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is nicer to have a reference to another way of setting cpu affinity here. If removed, it still says the same thing:
on after the transition to container's cgroup. The format is the same as
- for `initial`. If omitted or empty, the container's default CPU affinity,
- as defined by [cpu.cpus property](./config.md#configLinuxCPUs)), is used.
+ for `initial`. If omitted or empty, the container's default CPU affinity
+ applies.
so I think it's better to keep it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, ok for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the tests, if you don't call sched_setaffinity(), the kernel itself will reset it to cgroups cpuset.cpu
@kad It seems your tests are wrong. Mine are showing this only happens when the initial (pre cgroup-move) affinity is not a subset of cgroup's affinity.
So maybe we need to change the spec to say something like
If omitted or empty, no attempt to change the CPU affinity of the process after moving it to container's cgroup is made, and its final affinity is determined by the Linux kernel.
I'd rather not go into more details here, because otherwise I'll be documenting the kernel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my tests, it is not the case. (-p is "pause" where I'm attaching in another terminal process to cgroup with cpuset.cpus=0-10):
r640-2:/home/akanevsk # ~akanevsk/getaffinity -g -s "0-5" -g -p -g
get_affinity: 0-87
set_affinity: affinity set to 0-5
get_affinity: 0-5
get_affinity: 0-10
r640-2:/home/akanevsk # ~akanevsk/getaffinity -g -s "50-60" -g -p -g
get_affinity: 0-87
set_affinity: affinity set to 50-60
get_affinity: 50-60
get_affinity: 0-10
This allows to set initial and final CPU affinity for a process being run in a container, which is needed to solve the issue described in [1]. [1] opencontainers/runc#3922 Signed-off-by: Kir Kolyshkin <[email protected]>
As per - opencontainers/runtime-spec#1253 - opencontainers/runtime-spec#1261 Add some tests (alas it's impossible to test initial CPU affinity without adding debug logging). Signed-off-by: Kir Kolyshkin <[email protected]>
As per - opencontainers/runtime-spec#1253 - opencontainers/runtime-spec#1261 Add some tests (alas it's impossible to test initial CPU affinity without adding debug logging). Signed-off-by: Kir Kolyshkin <[email protected]>
This is to include - opencontainers/runtime-spec#1261 - opencontainers/runtime-spec#1253 Signed-off-by: Kir Kolyshkin <[email protected]>
As per - opencontainers/runtime-spec#1253 - opencontainers/runtime-spec#1261 Add some tests (alas it's impossible to test initial CPU affinity without adding debug logging). Signed-off-by: Kir Kolyshkin <[email protected]>
As per - opencontainers/runtime-spec#1253 - opencontainers/runtime-spec#1261 Add some tests (alas it's impossible to test initial CPU affinity without adding debug logging). Signed-off-by: Kir Kolyshkin <[email protected]>
This is to include - opencontainers/runtime-spec#1261 - opencontainers/runtime-spec#1253 Signed-off-by: Kir Kolyshkin <[email protected]>
As per - opencontainers/runtime-spec#1253 - opencontainers/runtime-spec#1261 Add some tests (alas it's impossible to test initial CPU affinity without adding debug logging). Signed-off-by: Kir Kolyshkin <[email protected]>
As per - opencontainers/runtime-spec#1253 - opencontainers/runtime-spec#1261 Add some tests (alas it's impossible to test initial CPU affinity without adding debug logging to nsexec). Signed-off-by: Kir Kolyshkin <[email protected]>
As per - opencontainers/runtime-spec#1253 - opencontainers/runtime-spec#1261 CPU affinity can be set in two ways: 1. When creating/starting a container, in config.json's Process.ExecCPUAffinity, which is when applied to all execs. 2. When running an exec, in process.json's CPUAffinity, which applied to a given exec and overrides the value from (1). Add some basic tests. Note that older kernels (RHEL8, Ubuntu 20.04) change CPU affinity of a process to that of a container's cgroup, as soon as it is moved to that cgroup, while newer kernels (Ubuntu 24.04, Fedora 41) don't do that. Because of the above, - it's impossible to really test initial CPU affinity without adding debug logging to libcontainer/nsenter; - for older kernels, there can be a brief moment when exec's affinity is different than either initial or final affinity being set; - exec's final CPU affinity, if not specified, can be different depending on the kernel, therefore we don't test it. Signed-off-by: Kir Kolyshkin <[email protected]>
This is to include - opencontainers/runtime-spec#1261 - opencontainers/runtime-spec#1253 Signed-off-by: Kir Kolyshkin <[email protected]>
As per - opencontainers/runtime-spec#1253 - opencontainers/runtime-spec#1261 CPU affinity can be set in two ways: 1. When creating/starting a container, in config.json's Process.ExecCPUAffinity, which is when applied to all execs. 2. When running an exec, in process.json's CPUAffinity, which applied to a given exec and overrides the value from (1). Add some basic tests. Note that older kernels (RHEL8, Ubuntu 20.04) change CPU affinity of a process to that of a container's cgroup, as soon as it is moved to that cgroup, while newer kernels (Ubuntu 24.04, Fedora 41) don't do that. Because of the above, - it's impossible to really test initial CPU affinity without adding debug logging to libcontainer/nsenter; - for older kernels, there can be a brief moment when exec's affinity is different than either initial or final affinity being set; - exec's final CPU affinity, if not specified, can be different depending on the kernel, therefore we don't test it. Signed-off-by: Kir Kolyshkin <[email protected]>
This allows to set initial and final CPU affinity for a process being run in a container, which is needed to solve the issue described in opencontainers/runc#3922.
This is an alternative to #1252.