-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
supervisorctl takes too long to execute command #131
Comments
It's happening on 3.0a12 and 3.0a8 |
Here are some strace. This is on a debian squeeze box. Note that we don't see the problem on a CentOS5 supervisord strace: => Appears immediately after the stop command is issued, no other mention of process 1494 Chilld process strace: => Appears immediately after the stop command is issued => The process is marked as 'STOPPED' and the process does not exists anymore almost immediately after the command is issued The stop command Note: sometimes the start command takes a long time to complete too: Note: supervisord.log shows the process stopped almost immediately after the command is executed: |
I'm seeing the same problem on one server (but not on others):
Yes, more than one hour. supervisor.log:
This is a huge problem for us, since it basically makes an automated deployment impossible. If you need more information, just let me know. |
I'm also experiencing this issue using 3.0a12 and Python 2.7.2 on Ubuntu 11.10. |
Does anyone have a solution or work-around for this? We're seeing it with supervisor 3.0a12 on a variety of platforms (Centos 6.2, 6.3, and Amazon Linux 2012.03). |
@dcrosta in my case I reported above, it turned out that one task was restarting all the time, and blocking supervisord. I'm not sure why supervisor didn't back off restarting the task, my guess is that it took slightly longer to start up and die than the "it's alive!"-threshold. But I didn't have the time to investigate it further. |
@piquadrat Hmm, I don't think we have that issue on my end. In some cases, there's just one long-lived process managed by supervisord (though admittedly that process takes sometimes up to 10s to quit on its own after receiving SIGTERM). If someone can point me in the right direction, I'd be happy to help gather more data about what's going on. |
+1 for this issue on CentOS 6 |
Same problem on CentOS 6 + gunicorn. Supervisorctl succesfully start/stop task and hangs
logs
|
Has anyone found a solution to this problem? I'm facing similar issues with supervisor version 3.0a8 running on Ubuntu 12.04. Installed supervisor using |
me to. any way out of this? |
okay, in my case it seems that one of the processes i am watching was constantly printing out messages. |
Suffering from this as well on Ubuntu 12.04 with 3.0a8 and 3.0b1, sometimes up to 5 minutes to restart a process. |
Any progress on this bug? We're seeing it on CentOS 5 and 6, supervisor 3.0a12. |
Same here. We though we originally fixed it by fixing some bogus things with a child process but that just turned out to be a side-effect. The quick fix to this issue is restart supervisord, after we do that it starts responding normally again and can take upto a month or two before it starts to exhibit this behaviour. I'm still convinced this is somehow triggered by a child process doing something weird but we haven't been able to pin point it yet. We do have a strace to accompany the issue:
As you can see, which is also clear from @gilles backtrace we're just waiting, to infinity and beyond. This timeouts after approximately 15 minutes. This all happens for us on superivsor 3.0b1 and odly enough is only happening on the machines that supervise nodejs processes, the ruby-only clusters are completely unaffected in our case (or the bug just hasn't shown up yet). |
+1 (Same problem occurs sometimes on Solaris) |
Any improvement on this issue? |
No improvement though we don't seem to hit the issue anymore at all, oddly enough. Fortunately for us everything running under supervisord is redundant so we can fairly carelessly restart a supervisord if it grid locks again, but that's really not a very good solution. I'm currently looking at Mozilla's circus as a potential way out. They also seem to have Python 3 compatibility on their list. |
I had some pretty bad CPU usage issues with circus. It was a while ago and I didn't have time to investigate the issue. I started using Upstart and was really happy with that decision. |
I would like to fix this, but I haven't experienced it myself yet. It would help a lot if someone could find a way to reproduce it. |
Anytime I've seen this issue it's been with multiple process groups writing a lot to stdout/err. I haven't tested this though. |
I think if the subprocesses write a lot to stdout, supervisord hangs in reading those output and leaves the callback in the queue unattended. But since I need a mechanism where after a code upload from jenkins, jenkins should trigger a supervisorctl restart group, where supervisorctl cant hang on for ever. So I resorted to blocking synchronous way for stopAllProcesses, stopProcessGroup, startProcessGroups and stopAllProcesses module in rpcinterface. The modules execute almost instantly. I know its an unclean way, but I am thinking of using this as a tradeoff. Suggestions are welcome on this. Please have a look at the diff |
Hi all, you may want to try http://lxyu.github.io/supervisor-quick/ plugin, it may ease your pain as an temporary workaround. I wrote it because of the same problem, and it works for me very well. |
@lxyu Thanks a bunch for the plugin - it makes using supervisor much less painful. |
@lxyu Thanks. But still have problem with other commands. |
I have the same issue with the |
Same problem on Amazon AMI 2014.03 (supervisor v2.1) |
Workaround (tested on Amazon 2014.03): use
In my case, |
@hgdeoro try supervisor-quick, it'll ease your problem. |
@mcdonc The issue is just as present in 3 and that's the version this thread was started with so upgrading won't really solve the issue, unfortunately :(. |
There's not just one issue described in this thread, however (by my count there are at least three, which are related but not identical), so upgrading is probably a wise idea anyway. |
I should also note that it would be great if someone could help us reproduce "this" (whichever subproblem described in this thread that is happening to you) on our own systems. |
@mcdonc I've just uploaded a Dockerfile to https://github.com/hgdeoro/test-supervisord/, with some screenshots, and instructions to reproduce. The same is happenging in CentOS 6.5 and Amazon AMI servers. The Dockerfile creates a CentOS 6 container and uses supervisor 2.1 from EPEL |
Awesome! |
Work to make start and stop of multiple processes much faster has been merged into master, and will make its debut in a 4.0 release. |
@mcdonc Hello, is there a 4.0 release date? We'd very much like to use fast restart. |
... or any chance to hotfix 3.x branch? Maybe patch for 3.x will be easier and faster to release? |
No, sorry. The master is currently about 3-4X slower (it uses more CPU when logging) than the 3.X branch and will need to be fixed before we make an official release; not sure how long it will take. In the meantime, you can of course use a checkout. |
The changes for this issue (50d1857, d948fc5, 11ffa51, 5366309) were released in Supervisor 3.2.0. |
CentOS 6.7 ships with supervisord 2.1, which is now 9 years old and unsupported. This wouldn't necessarily be a problem, except that supervisorctl frequently hangs when trying to start or stop a service[1]. Install the latest version with pip instead. [1]: Supervisor/supervisor#131
The previously used version, Ubuntu 15.10, has an incredibly old version of supervisord that had an issue where subprocesses generating lots of output (hello `--vmodule=raft=5`) created substantial delays in starting and stopping other processes: Supervisor/supervisor#131 Among other things, 16.04 has a newer version of supervisord.
This prevents tasks from hanging indefinitely if supervisorctl hangs, too. See e.g. Supervisor/supervisor#131
Seems like this is a never ending bug. I came across this recently and seems like nothing is working for me. Does anybody have the experience of working with Upstart? |
Stop all your jobs, and the start one by one. You'll find which job causes problem |
Supervisor Now Takes 3-4 Minute to start when system boot (tested in ubuntu 20-22), this makes rebooting server nerve waking! |
Situation: The server suddenly shutdown and reboot, after that supervisor service not working properly due to bad config (some config point to deleted directory previously). After restarting the service, any My workaround that probably fixed it:
|
[root@ip-10-245-174-225] ~ $ supervisorctl stop XXX
^CTraceback (most recent call last):
File "/usr/local/bin/supervisorctl", line 9, in
load_entry_point('supervisor==3.0a12', 'console_scripts', 'supervisorctl')()
File "/usr/local/lib/python2.6/dist-packages/supervisor/supervisorctl.py", line 1114, in main
c.onecmd(" ".join(options.args))
File "/usr/local/lib/python2.6/dist-packages/supervisor/supervisorctl.py", line 144, in onecmd
return do_func(arg)
File "/usr/local/lib/python2.6/dist-packages/supervisor/supervisorctl.py", line 732, in do_stop
result = supervisor.stopProcess(name)
File "/usr/lib/python2.6/xmlrpclib.py", line 1199, in call
return self.__send(self.__name, args)
File "/usr/lib/python2.6/xmlrpclib.py", line 1489, in __request
verbose=self.__verbose
File "/usr/local/lib/python2.6/dist-packages/supervisor/xmlrpc.py", line 463, in request
r = self.connection.getresponse()
File "/usr/lib/python2.6/httplib.py", line 990, in getresponse
response.begin()
File "/usr/lib/python2.6/httplib.py", line 391, in begin
version, status, reason = self._read_status()
File "/usr/lib/python2.6/httplib.py", line 349, in _read_status
line = self.fp.readline()
File "/usr/lib/python2.6/socket.py", line 427, in readline
data = recv(1)
KeyboardInterrupt
supervisorctl status show 'STOPPED' and the process does not exists.
The log file shows:
2012-06-19 14:31:08,722 INFO stopped: XXX (terminated by SIGTERM)
I can turn on more debugging if you need
The text was updated successfully, but these errors were encountered: