supervisorctl takes too long to execute command #131

gilles · 2012-06-19T18:32:01Z

[root@ip-10-245-174-225] ~ $ supervisorctl stop XXX
^CTraceback (most recent call last):
File "/usr/local/bin/supervisorctl", line 9, in
load_entry_point('supervisor==3.0a12', 'console_scripts', 'supervisorctl')()
File "/usr/local/lib/python2.6/dist-packages/supervisor/supervisorctl.py", line 1114, in main
c.onecmd(" ".join(options.args))
File "/usr/local/lib/python2.6/dist-packages/supervisor/supervisorctl.py", line 144, in onecmd
return do_func(arg)
File "/usr/local/lib/python2.6/dist-packages/supervisor/supervisorctl.py", line 732, in do_stop
result = supervisor.stopProcess(name)
File "/usr/lib/python2.6/xmlrpclib.py", line 1199, in call
return self.__send(self.__name, args)
File "/usr/lib/python2.6/xmlrpclib.py", line 1489, in __request
verbose=self.__verbose
File "/usr/local/lib/python2.6/dist-packages/supervisor/xmlrpc.py", line 463, in request
r = self.connection.getresponse()
File "/usr/lib/python2.6/httplib.py", line 990, in getresponse
response.begin()
File "/usr/lib/python2.6/httplib.py", line 391, in begin
version, status, reason = self._read_status()
File "/usr/lib/python2.6/httplib.py", line 349, in _read_status
line = self.fp.readline()
File "/usr/lib/python2.6/socket.py", line 427, in readline
data = recv(1)
KeyboardInterrupt

supervisorctl status show 'STOPPED' and the process does not exists.
The log file shows:
2012-06-19 14:31:08,722 INFO stopped: XXX (terminated by SIGTERM)

I can turn on more debugging if you need

gilles · 2012-06-19T18:32:32Z

It's happening on 3.0a12 and 3.0a8

gilles · 2012-06-28T16:59:28Z

Here are some strace. This is on a debian squeeze box. Note that we don't see the problem on a CentOS5

supervisord strace: strace -p 1467 2>&1 | grep SIG
kill(1494, SIGTERM) = 0
--- SIGCHLD (Child exited) @ 0 (0) ---
wait4(-1, [{WIFSIGNALED(s) && WTERMSIG(s) == SIGTERM}], WNOHANG, NULL) = 1494

=> Appears immediately after the stop command is issued, no other mention of process 1494

Chilld process strace: strace -p 1494
accept(296, {sa_family=AF_INET6, sin6_port=htons(49644), inet_pton(AF_INET6, "::ffff:127.0.0.1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 15
futex(0x21a01f0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x18d9680, FUTEX_WAKE_PRIVATE, 1) = 1
accept(296, {sa_family=AF_INET6, sin6_port=htons(49676), inet_pton(AF_INET6, "::ffff:127.0.0.1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 83
futex(0x1610370, FUTEX_WAKE_PRIVATE, 1) = 1
accept(296, {sa_family=AF_INET6, sin6_port=htons(49715), inet_pton(AF_INET6, "::ffff:127.0.0.1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 15
futex(0x1ea9a10, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x18d9680, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x21ad150, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x18d9680, FUTEX_WAKE_PRIVATE, 1) = 1
accept(296, 0x7fff2be93050, [28]) = ? ERESTARTSYS (To be restarted)
--- SIGTERM (Terminated) @ 0 (0) ---
Process 1494 detached

=> Appears immediately after the stop command is issued

=> The process is marked as 'STOPPED' and the process does not exists anymore almost immediately after the command is issued

The stop command supervisorctl stop service:service_X appears to hang but just takes a long time to complete:
date; supervisorctl stop service:service_X; date
Thu Jun 28 13:07:36 EDT 2012
service:service_X: stopped
Thu Jun 28 13:18:10 EDT 2012

Note: sometimes the start command takes a long time to complete too:
date; supervisorctl start service:service_X; date
Thu Jun 28 13:23:02 EDT 2012
service:service_X: started
Thu Jun 28 13:23:29 EDT 2012

Note: supervisord.log shows the process stopped almost immediately after the command is executed:
2012-06-28 13:07:36,390 INFO stopped: service_X (terminated by SIGTERM)

beniwohli · 2012-07-11T17:35:15Z

I'm seeing the same problem on one server (but not on others):

# date; supervisorctl stop project_main_gunicorn; date
Wed Jul 11 17:50:07 CEST 2012
project_main_gunicorn: stopped
Wed Jul 11 18:52:14 CEST 2012

Yes, more than one hour.

supervisor.log:

2012-07-11 17:50:08,468 INFO stopped: project_main_gunicorn (exit status 0)

This is a huge problem for us, since it basically makes an automated deployment impossible. If you need more information, just let me know.

thieman · 2012-09-13T18:41:01Z

I'm also experiencing this issue using 3.0a12 and Python 2.7.2 on Ubuntu 11.10.

dcrosta · 2012-10-03T16:28:58Z

Does anyone have a solution or work-around for this? We're seeing it with supervisor 3.0a12 on a variety of platforms (Centos 6.2, 6.3, and Amazon Linux 2012.03).

beniwohli · 2012-10-04T08:12:55Z

@dcrosta in my case I reported above, it turned out that one task was restarting all the time, and blocking supervisord. I'm not sure why supervisor didn't back off restarting the task, my guess is that it took slightly longer to start up and die than the "it's alive!"-threshold. But I didn't have the time to investigate it further.

dcrosta · 2012-10-04T11:50:00Z

@piquadrat Hmm, I don't think we have that issue on my end. In some cases, there's just one long-lived process managed by supervisord (though admittedly that process takes sometimes up to 10s to quit on its own after receiving SIGTERM).

If someone can point me in the right direction, I'd be happy to help gather more data about what's going on.

caioariede · 2012-11-23T13:37:02Z

+1 for this issue on CentOS 6

RealJTG · 2013-01-09T16:12:10Z

Same problem on CentOS 6 + gunicorn. Supervisorctl succesfully start/stop task and hangs

supervisorctl stop mysite
Zzzzzzz

^CTraceback (most recent call last):
  File "/usr/bin/supervisorctl", line 6, in <module>
    main()
  File "/usr/lib/python2.6/site-packages/supervisor/supervisorctl.py", line 598, in main
    c.onecmd(" ".join(options.args))
  File "/usr/lib/python2.6/site-packages/supervisor/supervisorctl.py", line 86, in onecmd
    return func(arg)
  File "/usr/lib/python2.6/site-packages/supervisor/supervisorctl.py", line 433, in do_stop
    result = supervisor.stopProcess(processname)
  File "/usr/lib64/python2.6/xmlrpclib.py", line 1199, in __call__
    return self.__send(self.__name, args)
  File "/usr/lib64/python2.6/xmlrpclib.py", line 1489, in __request
    verbose=self.__verbose
  File "/usr/lib/python2.6/site-packages/supervisor/options.py", line 1308, in request
    errcode, errmsg, headers = h.getreply()
  File "/usr/lib64/python2.6/httplib.py", line 1064, in getreply
    response = self._conn.getresponse()
  File "/usr/lib64/python2.6/httplib.py", line 990, in getresponse
    response.begin()
  File "/usr/lib64/python2.6/httplib.py", line 391, in begin
    version, status, reason = self._read_status()
  File "/usr/lib64/python2.6/httplib.py", line 349, in _read_status
    line = self.fp.readline()
  File "/usr/lib64/python2.6/socket.py", line 433, in readline
    data = recv(1)
KeyboardInterrupt

logs

2013-01-09 18:04:23,044 DEBG XML-RPC method called: supervisor.getVersion()
2013-01-09 18:04:23,044 DEBG XML-RPC method supervisor.getVersion() returned successfully
2013-01-09 18:04:23,044 INFO localhost:0 - - [09/Jan/2013:16:04:23 +0200] "POST /RPC2 HTTP/1.0" 200 254
2013-01-09 18:04:23,045 DEBG XML-RPC method called: supervisor.stopProcess()
2013-01-09 18:04:23,045 DEBG XML-RPC method supervisor.stopProcess() returned successfully
2013-01-09 18:04:23,045 DEBG killing mysite (pid 6320) with signal SIGTERM
2013-01-09 18:04:23,376 INFO stopped: mysite (exit status 0)
2013-01-09 18:04:23,377 INFO received SIGCLD indicating a child quit

tsharju · 2013-03-04T05:41:08Z

Has anyone found a solution to this problem? I'm facing similar issues with supervisor version 3.0a8 running on Ubuntu 12.04. Installed supervisor using apt. Problem is exactly same as described above. I run supervisorctl stop myapp and the application stops, but supervisorctl just hangs. Any help is greatly appreciated here.

tobsch · 2013-04-11T19:22:39Z

me to. any way out of this?

tobsch · 2013-04-11T19:58:42Z

okay, in my case it seems that one of the processes i am watching was constantly printing out messages.
this seems to keep supervisor too busy to return the rpc call!?

philipcristiano · 2013-04-18T20:36:43Z

Suffering from this as well on Ubuntu 12.04 with 3.0a8 and 3.0b1, sometimes up to 5 minutes to restart a process.

amcmanus · 2013-05-28T19:19:40Z

Any progress on this bug? We're seeing it on CentOS 5 and 6, supervisor 3.0a12.

daenney · 2013-06-05T11:58:42Z

Same here. We though we originally fixed it by fixing some bogus things with a child process but that just turned out to be a side-effect.

The quick fix to this issue is restart supervisord, after we do that it starts responding normally again and can take upto a month or two before it starts to exhibit this behaviour. I'm still convinced this is somehow triggered by a child process doing something weird but we haven't been able to pin point it yet.

We do have a strace to accompany the issue:

recvfrom(3, "<?xml version='1.0'?>\n<methodRes"..., 129, 0, NULL, NULL) = 129
stat("/usr/lib/pymodules/python2.6/supervisor/rpcinterface", 0x7fffec5d61a0) = -1 ENOENT (No such file or directory)
open("/usr/lib/pymodules/python2.6/supervisor/rpcinterface.so", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/lib/pymodules/python2.6/supervisor/rpcinterfacemodule.so", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/lib/pymodules/python2.6/supervisor/rpcinterface.py", O_RDONLY) = 4
fstat(4, {st_mode=S_IFREG|0644, st_size=30019, ...}) = 0
open("/usr/lib/pymodules/python2.6/supervisor/rpcinterface.pyc", O_RDONLY) = 5
fstat(5, {st_mode=S_IFREG|0644, st_size=27233, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f7f05911000
read(5, "\321\362\r\n(\2135Pc\0\0\0\0\0\0\0\0\31\0\0\0@\0\0\0s\221\1\0\0d\0"..., 4096) = 4096
fstat(5, {st_mode=S_IFREG|0644, st_size=27233, ...}) = 0
read(5, "dules/python2.6/supervisor/rpcin"..., 20480) = 20480
read(5, "RB\0\0\0(\3\0\0\0R\24\0\0\0t\4\0\0\0typeR&\0\0\0(\0\0"..., 4096) = 2657
read(5, "", 4096)                       = 0
close(5)                                = 0
munmap(0x7f7f05911000, 4096)            = 0
stat("/usr/lib/pymodules/python2.6/supervisor/events", 0x7fffec5d2c00) = -1 ENOENT (No such file or directory)
open("/usr/lib/pymodules/python2.6/supervisor/events.so", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/lib/pymodules/python2.6/supervisor/eventsmodule.so", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/lib/pymodules/python2.6/supervisor/events.py", O_RDONLY) = 5
fstat(5, {st_mode=S_IFREG|0644, st_size=6224, ...}) = 0
open("/usr/lib/pymodules/python2.6/supervisor/events.pyc", O_RDONLY) = 6
fstat(6, {st_mode=S_IFREG|0644, st_size=11881, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f7f05911000
read(6, "\321\362\r\n(\2135Pc\0\0\0\0\0\0\0\0\4\0\0\0@\0\0\0s\237\2\0\0d\0"..., 4096) = 4096
fstat(6, {st_mode=S_IFREG|0644, st_size=11881, ...}) = 0
read(6, "or/events.pyR\23\0\0\0H\0\0\0s\4\0\0\0\0\1\t\1c\1"..., 4096) = 4096
read(6, "on2.6/supervisor/events.pyR-\0\0\0\217"..., 4096) = 3689
read(6, "", 4096)                       = 0
close(6)                                = 0
munmap(0x7f7f05911000, 4096)            = 0
close(5)                                = 0
close(4)                                = 0
close(3)                                = 0
socket(PF_FILE, SOCK_STREAM, 0)         = 3
connect(3, {sa_family=AF_FILE, path="/var/run/supervisor.sock"}, 26) = 0
sendto(3, "POST /RPC2 HTTP/1.1\r\nHost: local"..., 186, 0, NULL, 0) = 186
sendto(3, "<?xml version='1.0'?>\n<methodCal"..., 177, 0, NULL, 0) = 177
recvfrom(3,

As you can see, which is also clear from @gilles backtrace we're just waiting, to infinity and beyond. This timeouts after approximately 15 minutes.

This all happens for us on superivsor 3.0b1 and odly enough is only happening on the machines that supervise nodejs processes, the ruby-only clusters are completely unaffected in our case (or the bug just hasn't shown up yet).

palmkevin · 2013-06-10T11:16:58Z

+1 (Same problem occurs sometimes on Solaris)

kalyanceg · 2013-10-15T13:51:13Z

Any improvement on this issue?
How do folks facing this issue, run supervisord in production environment?

daenney · 2013-10-15T14:22:18Z

No improvement though we don't seem to hit the issue anymore at all, oddly enough.

Fortunately for us everything running under supervisord is redundant so we can fairly carelessly restart a supervisord if it grid locks again, but that's really not a very good solution.

I'm currently looking at Mozilla's circus as a potential way out. They also seem to have Python 3 compatibility on their list.

tsharju · 2013-10-15T15:44:06Z

I had some pretty bad CPU usage issues with circus. It was a while ago and I didn't have time to investigate the issue. I started using Upstart and was really happy with that decision.

mnaberez · 2013-10-15T16:19:14Z

I would like to fix this, but I haven't experienced it myself yet. It would help a lot if someone could find a way to reproduce it.

philipcristiano · 2013-10-16T14:41:34Z

Anytime I've seen this issue it's been with multiple process groups writing a lot to stdout/err. I haven't tested this though.

kalyanceg · 2013-10-17T08:43:54Z

I think if the subprocesses write a lot to stdout, supervisord hangs in reading those output and leaves the callback in the queue unattended. But since I need a mechanism where after a code upload from jenkins, jenkins should trigger a supervisorctl restart group, where supervisorctl cant hang on for ever. So I resorted to blocking synchronous way for stopAllProcesses, stopProcessGroup, startProcessGroups and stopAllProcesses module in rpcinterface. The modules execute almost instantly. I know its an unclean way, but I am thinking of using this as a tradeoff. Suggestions are welcome on this. Please have a look at the diff
kalyanceg@935760a#diff-1

lxyu · 2014-03-20T08:55:41Z

Hi all, you may want to try http://lxyu.github.io/supervisor-quick/ plugin, it may ease your pain as an temporary workaround.

I wrote it because of the same problem, and it works for me very well.

alicee · 2014-05-07T17:44:43Z

@lxyu Thanks a bunch for the plugin - it makes using supervisor much less painful.

wesdu · 2014-06-20T04:07:23Z

@lxyu Thanks. But still have problem with other commands.

alexeiz · 2014-06-30T20:49:10Z

I have the same issue with the start command. When I start a group of processes and any (or all) of them enters the FATAL state, supervisorctl tries to restart the whole group over and over again (as apparent by running the maintail command), and the start command never finishes. Starting the processes individually finishes very quickly. I believe it's the same issue, because it's caused by the callback stack never getting exhausted.

hgdeoro · 2014-07-29T02:49:38Z

Same problem on Amazon AMI 2014.03 (supervisor v2.1)

hgdeoro · 2014-07-29T03:05:54Z

Workaround (tested on Amazon 2014.03): use timeout.

$ timeout 10s supervisorctl start nginx

In my case, supervisorctl works fine, the only problem is that the process never returns, and sending the process to the background leaves those supervisorctl process running, and I didn't like that. Running supervisorctl with timeout is a ugly hack, but works...

lxyu · 2014-07-29T03:08:40Z

@hgdeoro try supervisor-quick, it'll ease your problem.

daenney · 2014-07-30T19:19:55Z

@mcdonc The issue is just as present in 3 and that's the version this thread was started with so upgrading won't really solve the issue, unfortunately :(.

mcdonc · 2014-07-30T20:37:35Z

There's not just one issue described in this thread, however (by my count there are at least three, which are related but not identical), so upgrading is probably a wise idea anyway.

mcdonc · 2014-07-30T20:39:29Z

I should also note that it would be great if someone could help us reproduce "this" (whichever subproblem described in this thread that is happening to you) on our own systems.

hgdeoro · 2014-07-30T22:36:25Z

@mcdonc I've just uploaded a Dockerfile to https://github.com/hgdeoro/test-supervisord/, with some screenshots, and instructions to reproduce. The same is happenging in CentOS 6.5 and Amazon AMI servers. The Dockerfile creates a CentOS 6 container and uses supervisor 2.1 from EPEL

mcdonc · 2014-07-30T22:42:11Z

Awesome!

mcdonc · 2014-08-09T22:40:58Z

Work to make start and stop of multiple processes much faster has been merged into master, and will make its debut in a 4.0 release.

wooparadog · 2014-08-29T08:18:48Z

@mcdonc Hello, is there a 4.0 release date? We'd very much like to use fast restart.

marcinn · 2014-08-29T09:15:51Z

... or any chance to hotfix 3.x branch? Maybe patch for 3.x will be easier and faster to release?

mcdonc · 2014-08-29T16:05:35Z

No, sorry. The master is currently about 3-4X slower (it uses more CPU when logging) than the 3.X branch and will need to be fixed before we make an official release; not sure how long it will take. In the meantime, you can of course use a checkout.

Conflicts: supervisor/rpcinterface.py

mnaberez · 2015-12-21T21:05:04Z

The changes for this issue (50d1857, d948fc5, 11ffa51, 5366309) were released in Supervisor 3.2.0.

CentOS 6.7 ships with supervisord 2.1, which is now 9 years old and unsupported. This wouldn't necessarily be a problem, except that supervisorctl frequently hangs when trying to start or stop a service[1]. Install the latest version with pip instead. [1]: Supervisor/supervisor#131

The previously used version, Ubuntu 15.10, has an incredibly old version of supervisord that had an issue where subprocesses generating lots of output (hello `--vmodule=raft=5`) created substantial delays in starting and stopping other processes: Supervisor/supervisor#131 Among other things, 16.04 has a newer version of supervisord.

This prevents tasks from hanging indefinitely if supervisorctl hangs, too. See e.g. Supervisor/supervisor#131

hnrindani · 2017-09-25T06:33:37Z

Seems like this is a never ending bug. I came across this recently and seems like nothing is working for me. Does anybody have the experience of working with Upstart?

verseal · 2019-05-15T17:49:10Z

Stop all your jobs, and the start one by one. You'll find which job causes problem

saaiful · 2022-11-16T13:52:30Z

Supervisor Now Takes 3-4 Minute to start when system boot (tested in ubuntu 20-22), this makes rebooting server nerve waking!

wtfz · 2023-10-23T17:00:41Z

Situation: The server suddenly shutdown and reboot, after that supervisor service not working properly due to bad config (some config point to deleted directory previously). After restarting the service, any supervisorctl command takes too long to execute and sometimes hang.

My workaround that probably fixed it:

Reinstall supervisor service
Remove all supervisor.sock files in /var/run directory
Move all config files in /etc/supervisor/conf.d to another directory (added it later one by one to check for bad config)

lxyu mentioned this issue Mar 20, 2014

[share] supervisor-quick plugin #404

Closed

mcdonc mentioned this issue Aug 3, 2014

Lower select timeout #263

Closed

mcdonc added a commit that referenced this issue Aug 4, 2014

work towards faster start and stop (#131)

50d1857

mcdonc added a commit that referenced this issue Aug 9, 2014

add changenote wrt #131

b4a6b2e

mcdonc closed this as completed Aug 9, 2014

wooparadog mentioned this issue Sep 3, 2014

Change timeout, thread joining. lxyu/supervisor-quick#4

Merged

mnaberez mentioned this issue Sep 7, 2014

stop service cost too much time #374

Closed

mcdonc added a commit that referenced this issue Sep 8, 2014

work towards faster start and stop (#131)

c5d5cad

Conflicts: supervisor/rpcinterface.py

mcdonc added a commit that referenced this issue Sep 8, 2014

add changenote wrt #131

2730480

mnaberez mentioned this issue Nov 2, 2014

supervisor is still marked as non-support to Python 3 on https://python3wos.appspot.com #510

Closed

smarnach mentioned this issue Dec 6, 2015

Work around a supervisord bug causing the provisioning to hang intermittently. openedx-unsupported/configuration#2557

Merged

mnaberez mentioned this issue Feb 18, 2016

unicode decoding error while reading from stderr / stdout #638

Closed

jaraco pushed a commit to yougov/vr.server that referenced this issue Oct 4, 2016

Add a timeout to supervisorctl

ea56020

This prevents tasks from hanging indefinitely if supervisorctl hangs, too. See e.g. Supervisor/supervisor#131

supervisorctl takes too long to execute command #131

supervisorctl takes too long to execute command #131

Comments

gilles commented Jun 19, 2012

gilles commented Jun 19, 2012

gilles commented Jun 28, 2012

beniwohli commented Jul 11, 2012

thieman commented Sep 13, 2012

dcrosta commented Oct 3, 2012

beniwohli commented Oct 4, 2012

dcrosta commented Oct 4, 2012

caioariede commented Nov 23, 2012

RealJTG commented Jan 9, 2013

tsharju commented Mar 4, 2013

tobsch commented Apr 11, 2013

tobsch commented Apr 11, 2013

philipcristiano commented Apr 18, 2013

amcmanus commented May 28, 2013

daenney commented Jun 5, 2013

palmkevin commented Jun 10, 2013

kalyanceg commented Oct 15, 2013

daenney commented Oct 15, 2013

tsharju commented Oct 15, 2013

mnaberez commented Oct 15, 2013

philipcristiano commented Oct 16, 2013

kalyanceg commented Oct 17, 2013

lxyu commented Mar 20, 2014

alicee commented May 7, 2014

wesdu commented Jun 20, 2014

alexeiz commented Jun 30, 2014

hgdeoro commented Jul 29, 2014

hgdeoro commented Jul 29, 2014

lxyu commented Jul 29, 2014

daenney commented Jul 30, 2014

mcdonc commented Jul 30, 2014

mcdonc commented Jul 30, 2014

hgdeoro commented Jul 30, 2014

mcdonc commented Jul 30, 2014

mcdonc commented Aug 9, 2014

wooparadog commented Aug 29, 2014

marcinn commented Aug 29, 2014

mcdonc commented Aug 29, 2014

mnaberez commented Dec 21, 2015

hnrindani commented Sep 25, 2017

verseal commented May 15, 2019

saaiful commented Nov 16, 2022

wtfz commented Oct 23, 2023