-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] v3004 leaks fds/pipes causing Too many open files
crash
#61521
Comments
Too many open files
crash
Looking at the coredump with zmq symbols from Debian's debuginfod, there are a lot of threads that are hung on
|
|
It makes some sense because the singletons were based on the IOLoop being passed into the transport. Often, we won't pass a loop into the init method, meaning we don't end up with a singleton. This was one of the factors which have lead us away from the singleton approach. In the future we're aiming to properly manage our network connections without the need for any kind of magic (e.g singletons). |
One thing I don't understand is why this affects |
I backported both #61450 and #61468 onto v3004 and the leak still persists. I don't even know how to go about debugging this any more. |
Just a heads up, this may not be a v3004 issue in particular. We saw the same or very similar failures with While we were able to resolve this by simply upping open file limits compared to where they were on 2019.2.8, the only box it has occurred on is a heavily active master-of-masters that is also a minion of itself. Like frebib, we hit this very quickly using We did see this crop up once more when processing many events and doubled limits again. Unfortunately, I don't have stack traces from the time saved, but I can tell you that they look just like the ones above, including:
|
I feel like I'm getting close to this one. Here are some of my findings:
Apparently a I don't really understand the io_loop code, nor the tear-down of the related assets. Calling out to @dwoz for possible suggestions on what in the IPCClient may be causing a file descriptor leak? At a guess, the data is never being read out of the socket so zmq holds it open until a reader comes and empties the buffer? |
Scratch that. The open file handles are TCP sockets to the master:
Strangely that number only accounts for ~10% of the open fds according to Generated with this diff diff --git salt/modules/event.py salt/modules/event.py
index 4b655713e3a..873a06e3b34 100644
--- salt/modules/event.py
+++ salt/modules/event.py
@@ -108,14 +108,29 @@ def fire(data, tag):
salt '*' event.fire '{"data":"my event data"}' 'tag'
"""
+ import os, psutil
+ def procstuff():
+ try:
+ pid = os.getpid()
+ proc = psutil.Process(pid=pid)
+ files = proc.connections(kind="all")
+ log.debug(f"Process {pid}: open files {len(files)}: {files}")
+ except Exception as ex:
+ log.exception(f"pid shiz failed: {ex}", ex)
+
try:
+ procstuff()
with salt.utils.event.get_event(
"minion", # was __opts__['id']
sock_dir=__opts__["sock_dir"],
opts=__opts__,
listen=False,
) as event:
- return event.fire_event(data, tag)
+ procstuff()
+ fired = event.fire_event(data, tag)
+ procstuff()
+ procstuff()
+ return fired
except Exception: # pylint: disable=broad-except
exc_type, exc_value, exc_traceback = sys.exc_info()
lines = traceback.format_exception(exc_type, exc_value, exc_traceback)
diff --git salt/utils/schedule.py salt/utils/schedule.py
index 5f461a47a6c..56db117baf6 100644
--- salt/utils/schedule.py
+++ salt/utils/schedule.py
@@ -948,8 +948,6 @@ class Schedule:
"""
- log.trace("==== evaluating schedule now %s =====", now)
-
jids = []
loop_interval = self.opts["loop_interval"]
if not isinstance(loop_interval, datetime.timedelta): |
@frebib Can you confirm is no longer an issue in |
@dwoz This is a pretty tricky thing to replicate, so it might take us a while to testing it. That combined with the need to rebase our patchset on top of Is there a particular patch/PR that was merged that you think might have fixed the issue? |
@frebib Are you able to provide the output of |
@frebib Also, if you are not able to upgrade is it possible to get a working GDB installation on your current setup. One where you can get this |
@frebib Is this still an issue in 3006.5? |
We're working on rolling out 3006.x still. We've run into a few issues/regressions, but once we get those sorted, we'll validate this and get back to you. Sorry for the delay |
It seems that on 3006.5 this is no longer an issue. I don't really see any more than ~24 fds open most of the time |
Description
Several nodes all running salt-minion 3004 all seem to cause this crash before highstate completes. We've seen this crash both in cmd.run and also zmq code.
coredumpctl
seems to think it's zeromq holding open the sockets. There are a lot of these in the coredump backtrace.This was observed doing a full highstate on an unconfigured machine with several thousand states.
I've also noticed that v3004 has started dumping a handful of
[ERROR ] Unable to connect pusher: Stream is closed
errors into the highstate, which didn't happen in 2019.2.8. I wonder if this could be related?Possibly points at 4cf62fb ?
Setup
(Please provide relevant configs and/or SLS files (be sure to remove sensitive info. There is no general set-up of Salt.)
Please be as specific as possible and give set-up details.
Steps to Reproduce the behavior
salt-call state.highstate
seems to trigger it. It's unclear if the persistent minion process is affected so far.??? Unclear as of yet.
Expected behavior
No leaked fds/pipes.
Additional context
We came from 2019.2.8 running on python3 that didn't exhibit this behaviour (as far as I know).
I can also reproduce this on
master
branch tip too.The text was updated successfully, but these errors were encountered: