Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

num of worker_processes set to max num of cores of cluster node with cgroups-v2 #11518

Open
figaw opened this issue Jun 30, 2024 · 7 comments
Open
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. needs-priority triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@figaw
Copy link

figaw commented Jun 30, 2024

Am I holding it wrong?

I'm reading a comment on here, which says to adjust worker process to no more than 24, mine is automatically adjusted to 128, which causes weird things to happen.

#3574 (comment)

The problem goes away, when I set worker_processes in the helm chart.

controller:
  config:
    worker-processes: 24  

Where is this documented? I've tried to search around for comments on ulimits and ingress-nginx, but I'm not finding a lot.

What happened:

From the logs of the ingress-nginx-controller I'm reading..

2024/06/29 20:31:34 [alert] 42#42: socketpair() failed while spawning "worker process" (24: No file descriptors available)
2024/06/29 20:31:34 [alert] 42#42: socketpair() failed while spawning "worker process" (24: No file descriptors available)
2024/06/29 20:31:34 [alert] 42#42: socketpair() failed while spawning "worker process" (24: No file descriptors available)
2024/06/29 20:31:34 [alert] 42#42: socketpair() failed while spawning "worker process" (24: No file descriptors available)

This all went away when I configured worker_processes 24 in the helm chart.

Maybe this is related to #7107?

What you expected to happen:

NGINX automagically configures a proper number of worker process'.
I expect this has something to do with the 128 cores..

When I'm running ulimit inside the container, I'm getting quite low values,

ingress-nginx-private-controller-f56b88476-b8tpq:/etc/nginx$ ulimit -Hn
524288
ingress-nginx-private-controller-f56b88476-b8tpq:/etc/nginx$ ulimit -Sn
1024

Despite having configured the host,

$ cat /etc/security/limits.conf
# /etc/security/limits.conf
* soft nofile 65535
* hard nofile 65535
$ ulimit -Hn
65535
$ ulimit -Sn
65535

And also having configured containerd:

$ cat /etc/containerd/config.toml | grep runc.options -A 20
          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
            .....
            Ulimits = [
              { Name = "nofile", Hard = 65535, Soft = 65535 }
            ]

I also tried using an initContainer with the helm chart, to no avail..

  extraInitContainers:
    - name: init-myservice
      image: busybox
      command: ["sh", "-c", "ulimit -n 65535"]

I'm "pretty sure" all of the machines in our cluster will have at least 24 cores, so this is "probably" not a problem to configure statically.

NGINX Ingress controller version (exec ...):

NGINX Ingress controller
Release: v1.10.1
Build: 4fb5aac
Repository: https://github.com/kubernetes/ingress-nginx
nginx version: nginx/1.25.3


Kubernetes version (use kubectl version):

Client Version: v1.26.0
Kustomize Version: v4.5.7
Server Version: v1.29.0

Environment:

  • Cloud provider or hardware configuration:

Bare metal, super micro, AMD EPYC 7763 64-Core Processor, 256G RAM

  • OS (e.g. from /etc/os-release):
PRETTY_NAME="Ubuntu 22.04.4 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.4 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy
  • Kernel (e.g. uname -a):
    Linux b-w-3 5.15.0-113-generic #123-Ubuntu SMP Mon Jun 10 08:16:17 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

  • Install tools:

$ kubeadm version
kubeadm version: &version.Info{Major:"1", Minor:"29", GitVersion:"v1.29.5", GitCommit:"59755ff595fa4526236b0cc03aa2242d941a5171", GitTreeState:"clean", BuildDate:"2024-05-14T10:44:51Z", GoVersion:"go1.21.9", Compiler:"gc", Platform:"linux/amd64"}
  • Basic cluster related info:
$ kubectl get nodes -o wide
NAME    STATUS   ROLES           AGE   VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION       CONTAINER-RUNTIME
b-w-1   Ready    control-plane   16h   v1.29.5   172.17.90.1   <none>        Ubuntu 22.04.4 LTS   5.15.0-112-generic   containerd://1.7.13
b-w-2   Ready    control-plane   16h   v1.29.5   172.17.90.3   <none>        Ubuntu 22.04.4 LTS   5.15.0-113-generic   containerd://1.7.13
b-w-3   Ready    control-plane   16h   v1.29.5   172.17.90.5   <none>        Ubuntu 22.04.4 LTS   5.15.0-113-generic   containerd://1.7.13
b-w-4   Ready    <none>          16h   v1.29.5   172.17.90.7   <none>        Ubuntu 22.04.4 LTS   5.15.0-113-generic   containerd://1.7.13
  • How was the ingress-nginx-controller installed:
    • If helm was used then please show output of helm ls -A | grep -i ingress
$ helm ls -A | grep -i ingress
ingress-nginx-private   ingress-nginx-private   6               2024-06-30 11:24:48.843691613 +0200 CEST        deployed        ingress-nginx-4.10.1            1.10.1
@figaw figaw added the kind/bug Categorizes issue or PR as related to a bug. label Jun 30, 2024
@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority labels Jun 30, 2024
@longwuyuan
Copy link
Contributor

duplicate #9665
/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 30, 2024
@longwuyuan
Copy link
Contributor

/retitle num of worker_processes set to max num of cores of cluster node with cgroups-v2

@k8s-ci-robot k8s-ci-robot changed the title "No file descriptors available" on machine with a high number (128) of cores num of worker_processes set to max num of cores of cluster node with cgroups-v2 Jun 30, 2024
@strongjz
Copy link
Member

we need to update our support for cgroups v2, to my knowledge this is the package that figures out CPUs, and its not been updated it in 6 years

https://github.com/kubernetes/ingress-nginx/blame/125ffd47b132fa7d18c4aa81501736ff89cc0676/pkg/util/runtime/cpu_linux.go#L30

Copy link

This is stale, but we won't close it automatically, just bare in mind the maintainers may be busy with other tasks and will reach your issue ASAP. If you have any question or request to prioritize this, please reach #ingress-nginx-dev on Kubernetes Slack.

@github-actions github-actions bot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Aug 15, 2024
@domainname
Copy link

domainname commented Oct 18, 2024

Hi @strongjz Is there any plan to fix this bug for cgroup v2?

@fullykubed
Copy link

@domainname @figaw @strongjz

Also ran into this issue.

I saw a related PR (#11778) that seems to attempt to resolve the issue here, and it says this was included in the v1.11.3 release.

After upgrading to that version and adding a CPU limit to the pods, I saw that worker_processes was being set to the CPU limit of the pod, not the number of CPU cores on the node.

I believe this issue is resolved.

@shayrybak
Copy link

Same here, ran to the same issue version 1.11.3 fixes the number of workers to make sense with the cpu limit of the pod and fixed the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. needs-priority triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Development

No branches or pull requests

7 participants