-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Docker Driver Fails With Upper Limit of 262144 CPU Shares #7731
Comments
@herter4171 You won't find https://github.com/opencontainers/runc/blob/master/libcontainer/container_linux.go#L349 |
@shishir-a412ed, thanks for pointing me in the right direction. |
@herter4171 just a heads up, that value is the maximum |
Hey @tgross, thanks for the heads-up. This was the first time that troubleshooting led me to a Torvalds repo, and I knew to abandon all hope without being well-versed in operating systems. Hats off to you and your team for the CPU burst capability. Running with 262144 shares on a |
Reopening as the root bug here is Nomad's 1:1 mapping of mhz to shares. I think Nomad can even change that in a way that fixes this bug and preserves backward compatible behavior. A 10:1 or 128:1 or similar mapping should preserve the relative cpu share weights while keeping within the valid value range. This problem is going to get more common as high-core-count machines are used more. |
As brought up previously: #4899 (comment) https://bugs.openjdk.java.net/browse/JDK-8146115
By this logic it seems the openjdk community believes there will never be more than 256 (262144/1024) cores in a machine, or that they're willing to propose a kernel patch when the time comes 😂
I'm worried that your proposed fix is going to just add another layer of broken to this lasagna of madness. It seems to me the better solution would be to go all in on how cpu-shares are relative to other processes running on the machine in the context of the magic_number 1024, or go all in on cfs quotas like k8s has. As far as backwards compatibility is concerned, why not just implement new resource constraint? |
Nomad version
Nomad v0.11.0 (5f8fe0a)
Operating system and Environment details
Amazon Linux 2 with a fixed head node and an auto-scaling group, with scaling driven by Nomad state using a custom cloud metric.
Issue
This came up in the course of troubleshooting issue #7681, and while my intent isn't to issue-spam you guys, I think this is a separate problem that is actively holding back some of my work, unlike the former.
Anyway, I'm experiencing a Docker Driver failure due to an apparent upper limit on CPU shares. I have tested this on
c5.18xlarge
instances with the following result.I have also tested this on
m5a.24xlarge
instances with the following identical result.I can't even find
process_linux.go
in the source, so I'm really at a loss here. Any help is greatly appreciated.Reproduction steps
Submit a job that has more than
262144
CPU shares allocated on a large enough instance to have the job placed, and the Docker driver should fail in the manner I've described.The text was updated successfully, but these errors were encountered: