Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] salt-master port not closed cause a "Too many open files" error #63040

Closed
2 of 9 tasks
Denislb opened this issue Nov 8, 2022 · 15 comments
Closed
2 of 9 tasks

[BUG] salt-master port not closed cause a "Too many open files" error #63040

Denislb opened this issue Nov 8, 2022 · 15 comments
Assignees
Labels
Bug broken, incorrect, or confusing behavior memory-leak needs-triage Transport

Comments

@Denislb
Copy link

Denislb commented Nov 8, 2022

Description
salt-master does not close properly the receiving port from minions. This causes a "Too many open files" error and we cannot contact any minion after that.
I don't see errors on the master/minion (even in debug mode) and don't know how to reproduce this bug.

On the master if I check opens sockets:

$> netstat -laptune | grep 4506 | grep .135 | wc -l
655

$> netstat -laptune | grep .135
tcp        0      0 X.X.X.X:4506        X.X.X.135:48130    ESTABLISHED 1002       46738      2084/python3
tcp        0      0 X.X.X.X:4506        X.X.X.135:48254    ESTABLISHED 1002       46016      2084/python3
tcp        0      0 X.X.X.X:4506        X.X.X.135:48496    ESTABLISHED 1002       51677      2084/python3
tcp        0      0 X.X.X.X:4506        X.X.X.135:45780    ESTABLISHED 1002       76243      2084/python3
and more...

On the minion side:

$> netstat -laptune | grep 4506 | wc -l
1

$> netstat -laptune | grep 4506
tcp        0      0 X.X.X.135:45780    X.X.X.X:4506       ESTABLISHED 0          1021351    86014/python3

Setup
Please be as specific as possible and give set-up details.

  • on-prem machine
  • VM (Virtualbox, KVM, etc. please specify)
  • VM running on a cloud service, please be explicit and add details
  • container (Kubernetes, Docker, containerd, etc. please specify)
  • or a combination, please be explicit
  • jails if it is FreeBSD
  • classic packaging
  • onedir packaging
  • used bootstrap to install

Expected behavior
The open port should be closed properly after execution

Versions Report

salt --versions-report
Salt Version:
          Salt: 3005.1

Dependency Versions:
          cffi: Not Installed
      cherrypy: Not Installed
      dateutil: Not Installed
     docker-py: Not Installed
         gitdb: Not Installed
     gitpython: Not Installed
        Jinja2: 2.11.1
       libgit2: Not Installed
      M2Crypto: 0.35.2
          Mako: Not Installed
       msgpack: 0.6.2
  msgpack-pure: Not Installed
  mysql-python: Not Installed
     pycparser: Not Installed
      pycrypto: Not Installed
  pycryptodome: Not Installed
        pygit2: Not Installed
        Python: 3.6.8 (default, Nov 16 2020, 16:55:22)
  python-gnupg: Not Installed
        PyYAML: 3.13
         PyZMQ: 18.0.1
         smmap: Not Installed
       timelib: Not Installed
       Tornado: 4.5.3
           ZMQ: 4.1.4

System Versions:
          dist: centos 7 Core
        locale: UTF-8
       machine: x86_64
       release: 5.4.223-1.el7.elrepo.x86_64
        system: Linux
       version: CentOS Linux 7 Core
@Denislb Denislb added Bug broken, incorrect, or confusing behavior needs-triage labels Nov 8, 2022
@OrangeDog
Copy link
Contributor

Are you doing anything to trigger this, or does an idle system just keep leaking sockets?

@Denislb
Copy link
Author

Denislb commented Nov 8, 2022

I'm still trying to figure out how that happens...
I added some monitoring to check when it's happening and if a specific state or reactor is responsible.

Right now I can tell you that in ~12h we went from ~1k sockets opened to ~20k

image

@OrangeDog
Copy link
Contributor

Is it also leaking processes?
Similar things seem to happen here: #62706

@Denislb
Copy link
Author

Denislb commented Nov 8, 2022

No, I don't have leaking processes. It's stable, and they don't increase.
It seems that the salt-master has done a job (or maybe something happened) but does not close the port after.

@Denislb
Copy link
Author

Denislb commented Nov 29, 2022

Hello,

After a few investigations.
It seems that we have a leaking process, indeed.
As I show you below, EventPublisher and ReqServer MWorkerQueue have a huge leak.

image
image

@Denislb
Copy link
Author

Denislb commented Nov 29, 2022

FYI
When ReqServer MWorkerQueue is at 100% CPU the process memory starts to increase for ~2 days.

image
image

@Denislb
Copy link
Author

Denislb commented Nov 30, 2022

Hello,

As you can see below, the process MWorkerQueue use 100% CPU and workers does nothing :(
image

Do you know where I can add more logs to help for debug ?

@dmurphy18
Copy link
Contributor

This is largely fixed in Salt 3006.3, still a minor issue which is being investigated, but the main problem has been resolved.

@Denislb
Copy link
Author

Denislb commented Nov 22, 2023

Hello,

Thanks. I will do some tests.

@bschoening
Copy link

@Denislb @dmurphy18 has anyone checked if pyflakes and pylint run w/out detecting anything related to open file handles?

@dmurphy18
Copy link
Contributor

@bschoening As part of the Salt process, there is a pre-commit step which runs various tests against the intended code, and pre-commit has to be clean before you can even successfully get through the 'git commit' step.
The pre-commit file is https://github.com/saltstack/salt/blob/master/.pre-commit-config.yaml

Static analysis is done by Bandit, pylint, black, etc.

In an upcoming release, there have been changes made to allow for warnings in the log files, where detection is made for some outstanding remaining rare conditions with file handles not getting closed, however the majority of the file handle issues not being closed was resolved by the 3006.3 release of Salt.

@dmurphy18
Copy link
Contributor

@Denislb The issue has been fixed in the Salt 3006.3 release and recommend upgrading to the latest release which is 3006.4.

Salt 3005 reaches End-Of-Life in February 2024, 2 and 1/2 months from now, hence the suggestion to upgrade. If the answer is satisfactory, please consider closing this issue, or provide additional information.

@dmurphy18 dmurphy18 self-assigned this Nov 29, 2023
@dmurphy18
Copy link
Contributor

Here are the PR's for that attempted fix for majority of the 'too many open files', see #65508, #65061, #65247 and #65559 which will help to identify places that may have been missed and what is causing them.

@dmurphy18
Copy link
Contributor

@Denislb Can you test this against the latest Salt 3006.8 and close the issue if it is resolved. A number of changes have been made to deal with "too many open files" issues in earlier version of 3006.x.

@dmurphy18
Copy link
Contributor

@Denislb Closing this due to unresponsive, please feel free to reopen if more information is provided.
Note the latest is Salt 3006.9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug broken, incorrect, or confusing behavior memory-leak needs-triage Transport
Projects
None yet
Development

No branches or pull requests

4 participants