Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When --cpuset-cpus argument is used, processes inspecting CPU configuration in the container see all cores #20770

Closed
benjamincburns opened this issue Feb 29, 2016 · 31 comments

Comments

@benjamincburns
Copy link

Output of docker version:

Client:
 Version:      1.10.2
 API version:  1.22
 Go version:   go1.5.3
 Git commit:   c3959b1
 Built:        Mon Feb 22 16:16:33 2016
 OS/Arch:      linux/amd64

Server:
 Version:      1.10.2
 API version:  1.22
 Go version:   go1.5.3
 Git commit:   c3959b1
 Built:        Mon Feb 22 16:16:33 2016
 OS/Arch:      linux/amd64

Output of docker info:

sudo docker info
Containers: 66
 Running: 55
 Paused: 0
 Stopped: 11
Images: 110
Server Version: 1.10.2
Storage Driver: devicemapper
 Pool Name: docker-253:0-73188844-pool
 Pool Blocksize: 65.54 kB
 Base Device Size: 10.74 GB
 Backing Filesystem: ext4
 Data file: /dev/loop0
 Metadata file: /dev/loop1
 Data Space Used: 5.769 GB
 Data Space Total: 107.4 GB
 Data Space Available: 22.45 GB
 Metadata Space Used: 13.09 MB
 Metadata Space Total: 2.147 GB
 Metadata Space Available: 2.134 GB
 Udev Sync Supported: true
 Deferred Removal Enabled: false
 Deferred Deletion Enabled: false
 Deferred Deleted Device Count: 0
 Data loop file: /var/lib/docker/devicemapper/devicemapper/data
 WARNING: Usage of loopback devices is strongly discouraged for production use. Either use `--storage-opt dm.thinpooldev` or use `--storage-opt dm.no_warn_on_loop_devices=true` to suppress this warning.
 Metadata loop file: /var/lib/docker/devicemapper/devicemapper/metadata
 Library Version: 1.02.107-RHEL7 (2015-12-01)
Execution Driver: native-0.2
Logging Driver: json-file
Plugins: 
 Volume: local
 Network: bridge null host
Kernel Version: 3.10.0-229.14.1.el7.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
CPUs: 32
Total Memory: 251.6 GiB
Name: [redacted]
ID: [redacted]
WARNING: bridge-nf-call-iptables is disabled
WARNING: bridge-nf-call-ip6tables is disabled

Provide additional environment details (AWS, VirtualBox, physical, etc.):
Physical machine

List the steps to reproduce the issue:

  1. Run something like docker run -it --cpuset-cpus=0 centos:centos7
  2. In the container's console, run grep processor /proc/cpuinfo | wc -l

Describe the results you received:
Output: 32

Describe the results you expected:
Output: 1

Provide additional info you think is important:

Per the title, it appears that docker 1.10.2 isn't respecting the --cpuset-cpus argument. We have a number of containers for applications which use thread pools which are sized based on the number of cores available. Since updating to 1.10.2 (from a various array of versions starting somewhere in 1.3.x), the thread counts on our docker hosts are through the roof. [Edit: this wasn't actually linked to the update, but rather we'd deployed a few new containers which ran on mono at around the same time. This is still an issue, however.]

OS version info:

user@host ~ $ cat /etc/*release*
CentOS Linux release 7.1.1503 (Core) 
Derived from Red Hat Enterprise Linux 7.1 (Source)
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

CentOS Linux release 7.1.1503 (Core) 
CentOS Linux release 7.1.1503 (Core) 
cpe:/o:centos:centos:7
@benjamincburns
Copy link
Author

On the surface this issue looks to be similar to what's described as Ubuntu bug ID 1435571, though I can see how this behaviour might manifest from some other root cause. However in this case it may have been a kernel bug, as they've fixed it with these two kernel patches.

Knowing very little about cgroups myself, I'd also wonder if CentOS7 issue 9078 isn't related.

Either way, I raised the issue here on the chance that either this is an issue specific to docker and not the host OS, or that docker would be improved by including a workaround to this issue.

@thaJeztah
Copy link
Member

@benjamincburns can you try running the check-config.sh script? It's possible this is not supported or enabled in your kernel; https://github.com/docker/docker/blob/master/contrib/check-config.sh

@benjamincburns
Copy link
Author

Thanks @thaJeztah.

Before seeing your comment I fired up a fresh install of CentOS 7 and made sure it was up to date. I then installed docker according to the official installation instructions. This issue does not occur in that configuration.

I will run this the check-config script in both locations and compare the output.

If it turns out that this was an issue with this feature not being supported by the kernel, I'd suggest that this script be converted into runtime checks within docker itself so that the docker CLI can fail with an appropriate error message when trying to create a container which would use kernel features that aren't supported.

@benjamincburns
Copy link
Author

I have run the check-config.sh script on the test VM (where things work properly), and on my actual docker host. Full output for the known-good machine is at local-vm-check-config-output.txt.

Their diff:

user@hostname:~$ diff -u docker-host-check-config-output.txt local-vm-check-config-output.txt 
--- docker-host-check-config-output.txt 2016-03-01 15:01:08.238722606 +1300
+++ local-vm-check-config-output.txt    2016-03-01 15:01:26.494242760 +1300
@@ -1,5 +1,5 @@
 warning: /proc/config.gz does not exist, searching other paths for kernel config ...
-info: reading kernel config from /boot/config-3.10.0-229.14.1.el7.x86_64 ...
+info: reading kernel config from /boot/config-3.10.0-327.10.1.el7.x86_64 ...

 Generally Necessary:
 - cgroup hierarchy: properly mounted [/sys/fs/cgroup]

Note of course that the last line is not a deletion, but the hyphen is part of the script output.

I'll see if I can't review patches which have been applied between 3.10.0-229.14.1 and 3.10.0-327.10.1.

@benjamincburns
Copy link
Author

Actually, I think the patch review is unnecessary, as this issue occurs on a different docker host in our prod environment which is already running 3.10.0-327.10.1, and the latest userspace, CentOS 7.2.1511. To avoid (or inadvertently create) confusion, I refer to this host as host-with-latest-userspace-and-kernel below.

Copy & pasted repro output, modified slightly to change hostname:

user@host-with-latest-userspace-and-kernel ~ $ uname -a
Linux host-with-latest-userspace-and-kernel 3.10.0-327.10.1.el7.x86_64 #1 SMP Tue Feb 16 17:03:50 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
user@host-with-latest-userspace-and-kernel ~ $ docker run -it --cpuset-cpus=0 centos:centos7
[root@82cac19350b2 /]# grep processor /proc/cpuinfo | wc -l
12

The output of check-config.sh ran on this host is identical to my test VM.

This also suggests that the exact CentOS version may also not matter much, as both my test VM and host-with-latest-userspace-and-kernel are CentOS 7.2.1511, while the machine upon which I originally reported is CentOS 7.1.1503.

Just for completeness, below you will find the same info requested in the issue template, but for host-with-latest-userspace-and-kernel

Output of docker version:

Client:
 Version:      1.10.2
 API version:  1.22
 Go version:   go1.5.3
 Git commit:   c3959b1
 Built:        Mon Feb 22 16:16:33 2016
 OS/Arch:      linux/amd64

Server:
 Version:      1.10.2
 API version:  1.22
 Go version:   go1.5.3
 Git commit:   c3959b1
 Built:        Mon Feb 22 16:16:33 2016
 OS/Arch:      linux/amd64

Output of docker info:

Containers: 48
 Running: 44
 Paused: 0
 Stopped: 4
Images: 9
Server Version: 1.10.2
Storage Driver: devicemapper
 Pool Name: docker-253:3-134434010-pool
 Pool Blocksize: 65.54 kB
 Base Device Size: 107.4 GB
 Backing Filesystem: ext4
 Data file: /dev/loop0
 Metadata file: /dev/loop1
 Data Space Used: 4.418 GB
 Data Space Total: 107.4 GB
 Data Space Available: 10.34 GB
 Metadata Space Used: 9.925 MB
 Metadata Space Total: 2.147 GB
 Metadata Space Available: 2.138 GB
 Udev Sync Supported: true
 Deferred Removal Enabled: false
 Deferred Deletion Enabled: false
 Deferred Deleted Device Count: 0
 Data loop file: /var/lib/docker/devicemapper/devicemapper/data
 WARNING: Usage of loopback devices is strongly discouraged for production use. Either use `--storage-opt dm.thinpooldev` or use `--storage-opt dm.no_warn_on_loop_devices=true` to suppress this warning.
 Metadata loop file: /var/lib/docker/devicemapper/devicemapper/metadata
 Library Version: 1.02.107-RHEL7 (2015-12-01)
Execution Driver: native-0.2
Logging Driver: json-file
Plugins: 
 Volume: local
 Network: bridge null host
Kernel Version: 3.10.0-327.10.1.el7.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
CPUs: 12
Total Memory: 31.39 GiB
Name: redacted
ID: redacted
WARNING: bridge-nf-call-iptables is disabled
WARNING: bridge-nf-call-ip6tables is disabled

And for good measure, OS release specifics:

user@host-with-latest-userspace-and-kernel:~ $ cat /etc/*release*
CentOS Linux release 7.2.1511 (Core) 
Derived from Red Hat Enterprise Linux 7.2 (Source)
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

CentOS Linux release 7.2.1511 (Core) 
CentOS Linux release 7.2.1511 (Core) 
cpe:/o:centos:centos:7

@benjamincburns
Copy link
Author

To see if I could spot a pattern of some sort, I've tested for the presence of this on the 10 docker hosts to which I have access. The only machine on which I have not observed this issue is the clean VM I set up specifically to test this issue. Below are the configurations of the machines in question (hosts discussed above are included).

Except for the test VM, which is excluded from the machine counts in the table below, all machines tested are bare metal.

Number of Machines Docker Version OS OS Version Kernel Version
1 1.10.1, build 9e83765 Ubuntu 15.10 4.2.0-25-generic
1 1.9.1, build a34a1d5 Ubuntu 15.10 4.2.0-30-generic
1 1.7.1, build 3043001/1.7.1 CentOS 7.1.1503 3.10.0-229.11.1.el7.x86_64
4 1.8.2-el7.centos, build a01dc02/1.8.2 CentOS 7.2.1511 3.10.0-327.3.1.el7.x86_64
1 1.10.2, build c3959b1 CentOS 7.2.1511 3.10.0-327.10.1.el7.x86_64
2 1.10.2, build c3959b1 7.1.1503 3.10.0-229.14.1.el7.x86_64

On the off chance that there's some difference in behaviour between --cpuset and --cpuset-cpus, I also tested --cpuset on one of the 4 machines running the el7 build of Docker 1.8.2. No change in behaviour.

@benjamincburns
Copy link
Author

Argh... forget everything I said about the test VM working correctly. It turns out I'd forgotten that I'd only provisioned one vcpu for the vm. Now that I've switched it to 4 vcpus, the problem occurs there, too.

@benjamincburns
Copy link
Author

I see that the proper value is being set to cpuset.cpus on my test VM, leading me full circle back to thinking it's a kernel issue.

[bburns@localhost ~]$ cat /sys/fs/cgroup/cpuset/docker/e047d1596aac8375c6cf711c3c241c44d2404a5203e79f36469709e131ddee49/cpuset.cpus
0

And after using --cpuset-cpus=0,1 I see:

[bburns@localhost ~]$ cat /sys/fs/cgroup/cpuset/docker/731bf72f01f8c3305f3bbca1a1af4b5bc5fb8b0b752e78720528abc1c773fe2f/cpuset.cpus
0-1

I don't fully understand the patches I linked in my first comment, but I have verified that nothing like them has been applied to the CentOS kernel. In fact, there is no effective_cpus member in the cpuset struct in kernel 3.10.0.

@benjamincburns
Copy link
Author

So it's looking like --cpuset-cpus does assign processor affinity correctly, however code which inspects the machine configuration still thinks it has access to the full core count of the machine.

To determine this I created two containers, one with --cpuset-cpus=0 and the other with no --cpuset-cpus argument. In the container console I then backgrounded 4 bash while true loops, and checked process affinity with ps -o pid,cpuid,comm. On the container which had the --cpuset-cpus=0 arg, all cpuid values were 0, while on the other container multiple cpuid values were listed.

Question: Is solving this issue in scope for docker, or is this a kernel-level problem?

Console session:

user@host ~ $ sudo docker run -it --cpuset-cpus=0 --cpuset-mems=0 centos:centos7
[root@f887dac642a6 /]# while true; do echo blah; done > /dev/null &
[1] 14
[root@f887dac642a6 /]# while true; do echo blah; done > /dev/null &
[2] 15
[root@f887dac642a6 /]# while true; do echo blah; done > /dev/null &
[3] 16
[root@f887dac642a6 /]# while true; do echo blah; done > /dev/null &
[4] 17
[root@f887dac642a6 /]# ps -o pid,cpuid,comm
  PID CPUID COMMAND
    1     0 bash
   14     0 bash
   15     0 bash
   16     0 bash
   17     0 bash
   18     0 ps
[root@f887dac642a6 /]# exit

user@host:~$ docker run -it centos:centos7
[root@9612d2e4c7dd /]# while true; do echo blah; done > /dev/null & 
[1] 14
[root@9612d2e4c7dd /]# while true; do echo blah; done > /dev/null &
[2] 15
[root@9612d2e4c7dd /]# while true; do echo blah; done > /dev/null &
[3] 16
[root@9612d2e4c7dd /]# while true; do echo blah; done > /dev/null &
[4] 17
[root@9612d2e4c7dd /]# ps -o pid,cpuid,comm
  PID CPUID COMMAND
    1     0 bash
   16     0 bash
   17     1 bash
   18     2 bash
   19     3 bash
   20     2 ps
[root@9612d2e4c7dd /]# exit
exit

@benjamincburns
Copy link
Author

From the Ubuntu bug report in my first comment, it looks like docker can work around this issue by creating its cgroup with cpuset.clone_children set to 0.

@benjamincburns
Copy link
Author

Whoops, didn't mean to close.

@benjamincburns benjamincburns reopened this Mar 1, 2016
@thaJeztah
Copy link
Member

hm, interesting, let me ping @LK4D4 and @anusha-ragunathan, perhaps they have some thoughts on that

@benjamincburns benjamincburns changed the title --cpuset-cpus argument appears to be ignored on 1.10.2 under CentOS 7.1.1503 When --cpuset-cpus argument is used, processes inspecting CPU configuration in the container see all cores Mar 1, 2016
@benjamincburns
Copy link
Author

Eh, that might be a red herring. I've tried doing this manually to no effect. Also it appears that cgroup.clone_children is only defaulting to 1 on my Ubuntu boxes. On my CentOS hosts /sys/fs/cgroup/cpuset/docker/cgroup.clone_children was already set to 0.

@thaJeztah
Copy link
Member

What do you get inside the container? i.e.

docker run --rm --cpuset-cpus=0,1 ubuntu sh -c "cat /sys/fs/cgroup/cpuset/cpuset.cpus"

@benjamincburns
Copy link
Author

That command works correctly, which is good news as for the applications for which we control we can inspect this file. However for applications running in vms like mono, this will present some pain. It'd be much simpler overall if the process didn't need to be aware that it was running within a cgroup.

@benjamincburns
Copy link
Author

To add a bit of supporting info to my last statement, I grepped mono's source quickly and found that on systems with a proper glibc, mono detects the core count via sysconf(_SC_NPROCESSORS_ONLN). So, I wrote a quick and dirty c program to call this and print the result, copied it into a container built with --cpuset-cpus=0, and it returns the core count of the full machine.

This can be seen in the mono source at

  • libgc/pthread_support.c
  • mono/io-layer/system.c
  • mono/profiler/proflog.c
  • mono/utils/mono-proclib.c
  • support/map.c

@thaJeztah
Copy link
Member

This sounds similar to #20688, and a nice article describing the situation http://fabiokung.com/2014/03/13/memory-inside-linux-containers/

@benjamincburns
Copy link
Author

Yes, it certainly does. Digging into mono source a bit further it's also parsing /proc/stat in places.

I'll likely open an issue with mono to make the VM cgroup aware, however I agree with @thechile's last comment on #20688 that the container community ought to be working with kernel maintainers to sort out a solution to this problem.

Linus has a pretty famous rule that the kernel shouldn't break userspace. I'd think that the container shouldn't break userspace, either. You might argue that it's not the container, it's cgroups, but if the choice to use cgroups forces containerized processes to become cgroup aware, then from the perspective of the user it's the same result.

It's pain enough for native processes where I control thread pooling and resource allocation, but when you've got a full platform stack that you're trying to drop into a container it gets quite expensive quite quick.

@benjamincburns
Copy link
Author

I've raised a mono issue with the hope that they'll pick it up and at least work around this problem. That said, I'd rather not need to also raise issues for go, python, ruby, java, and so on.

@qlyoung
Copy link

qlyoung commented Jan 18, 2020

@benjamincburns how did you end up working around this? As of Linux 5.1 this still occurs, which is a real pain when doing CPU pinning; inside the container you can still see all the cpus, but only the ones assigned with --cpuset-cpus can be pinned to, the rest will error on the syscall.

@benjamincburns
Copy link
Author

benjamincburns commented Jan 18, 2020

@benjamincburns how did you end up working around this?

@qlyoung as far as I can remember, we didn't.

@jdmarshall
Copy link

So what's the situation with this issue? I have some code that is deciding how many processes to fork based on CPU count and it's getting the wrong number of processors.

@qlyoung
Copy link

qlyoung commented Mar 5, 2020

@jdmarshall based on some additional research it seems the appropriate fix for this will ultimately be, as with all things, a kernel namespace for whatever this resource class is. If you want to know what CPUs you can actually get, you can loop through each "available" core and try to bind to it with sched_setaffinity. If it works then it's available, if not then it's not available to the container. I did this for AFL, if you want an example, patch is here. So maybe for your case fork off N = # cpus processes, try sched_setaffinity in each of them, and simply exit if it fails, then you should be left with the appropriate amount of processes.

Brendan Gregg touches on this a bit in this talk https://www.youtube.com/watch?v=bK9A5ODIgac, although it's in the context of perf events iirc.

@dreamdevil00
Copy link

@DoDoENT
Copy link

DoDoENT commented May 23, 2023

On Intel machines I can get around by reading /sys/fs/cgroup/cpu/cpu.cfs_quota_us and /sys/fs/cgroup/cpu/cpu.cfs_period_us and divide quota with period to get the number of CPU cores allowed to use within the container.

However, this does not seem to work on Aarch64 linux machines: /sys/fs/cgroup/cpu folder doesn't even exist. I've found that /sys/fs/cgroup/cpu.max contains two numbers separated by whitespace which resemble cfs_quota_us and cfs_period_us on Intel.

Any idea why is this discrepancy between Intel and Aarch64?

@jdmarshall
Copy link

I've been using 'nproc' on linux to get better behavior, and 'sysctl -n hw.logicalcpu' on OS X. I found this somewhere on stackoverflow.

Since I only really need this data at startup I just eat the child process overhead. I think standard lib writers are getting wise to this though. I think Node introduced a fix for this in the previous major version.

@felipecrs
Copy link

@benjamincburns
Copy link
Author

@thaJeztah how are we feeling about this issue these days?

Initially I'd hoped that there would be some way that the docker could be made to work with legacy software that was written prior to cgroups existing, as well as software that was written to erroneously assume that it could make use of all cores on the host.

Ultimately it would seem that the path to achieving this goal is rooted in how cgroups restrictions are exposed by the kernel to the user space processes that are subject to those restrictions. As a result of that, I'm not sure that there's anything for the container engine to do here. I'm also no longer sure that the goal as I just stated it is even desirable, let alone achievable. That is, there's a distinct difference between "the set of CPUs available to the host" and "the set of CPUs that a process can access," and that's true in a wide variety of scenarios that have nothing to do with containerization.

With that in mind, I think this is a discussion for the kernel mailing lists, if it's even a discussion worth having. Unfortunately I don't really have the time or motivation right now to champion that conversation, but I'd encourage anyone who finds this issue to be important to take it up there.

In the meantime, I think it's probably best to close this issue. @thaJeztah if you or any other maintainers feel otherwise, please feel free to reopen.

@benjamincburns benjamincburns closed this as not planned Won't fix, can't repro, duplicate, stale Aug 19, 2024
@felipecrs
Copy link

PS:

@benjamincburns
Copy link
Author

Thanks for the additional context, @felipecrs. I wish that lxcfs existed (or that I was aware of its existence) back when I was having this issue in 2016!

Just for clarity, are you advocating for this issue to remain open, in light of the tooling you posted?

I just worry that making this behaviour a default in moby could be problematic. For example, I think it's not uncommon for k8s clusters to set affinity for privileged cluster management & host monitoring jobs to a set of reserved CPUs that aren't used for other workloads (guarantees liveness, minimises the impact of monitoring on latency-sensitive workloads, etc).

If it were something that wasn't on by default, but could be optionally set on a container-by-container basis, that could still add utility, however.

@felipecrs
Copy link

I would love for this feature to be baked into docker, rather than having to rely on external tools that are (very) convoluted to setup.

Being able to specify it on a container-by-container basis would be the ideal, like docker run --mask-procfs.

Then, making it the default would be a whole different conversation that can start once such feature exist. In my limited, personal gut feeling I believe it would be nicer to be the default behavior. But I do not want to argue about it.

Just for clarity, are you advocating for this issue to remain open, in light of the tooling you posted?

To be honest I'm not advocating for this issue to remain open as I have zero hope that Docker would ever implement it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants