Restart handler mechanism appears to have problems #48

nuwang · 2017-12-11T06:31:53Z

Problem:

The restart handler mechanism doesn't appear to work consistently and/or work well with other roles.

Background:

See discussion here: #47

Issues:

There are several attendant issues that also need to be solved:

The build should fail when test tool installation during testing fails. However, it currently does not.
The restart handler name used to be "restart galaxy", which is the same handler name used in the galaxy playbook: https://github.com/galaxyproject/ansible-galaxy/blob/6544db9b17cc4db6cfbaf6940545fb57edc37cdf/handlers/main.yml#L5.
but this can cause the incorrect handler to be called when both galaxy and galaxy-tools roles are used together. We could use the same handler name if we use a "generic topic" as introduced in ansible 2.2 onwards or make the handler name unique, which is what was done in PR: Make sure ansible-galaxy-tools version of the restart handler is called #47. However, the latter seems to have caused a build break for unknown reasons.
Supervisor seems to be having trouble restarting Galaxy as shown in: Make sure ansible-galaxy-tools version of the restart handler is called #47 (comment)
Since a restart command should succeed, whether Galaxy is running or not, there may also be a supervisor config issue in the extras playbook.

Fixes 1st part of issue: galaxyproject#48

mvdbeek · 2017-12-11T08:35:33Z

Just wanted to add that since 17.09 (also before, but we fixed some additional issues in 17.09) galaxy restarts are not necessary after installing tools or data managers, in case that helps.

nuwang · 2017-12-12T13:41:57Z

@mvdbeek Thanks - it would be great to remove the restarter functionality from this role - it does seem very out of place. I had a chat with Enis about this and he suggested that we mark it as deprecated and leave it for another release, because 17.09 is still fairly new.

Having said that, I think the issue is that all this time, the restarter generally did not get called unless you used the playbook in isolation. The reason is that the ansible-galaxy role also has a handler with the same name, which was effectively hiding the handler defined in ansible-galaxy-tools. This can be seen in the kick start playbook. This is the Travis build log prior to merging PR #47:
https://travis-ci.org/ARTbio/GalaxyKickStart/jobs/314131564#L1627.
As the log shows, the actual restarter that gets called is the stubbed out restarter in ansible-galaxy here: https://github.com/galaxyproject/ansible-galaxy/blob/master/handlers/main.yml#L4

However, PR #47 fixes this and now the correct handler gets called. However, this caused a downstream build failure in kickstarter here: https://travis-ci.org/ARTbio/GalaxyKickStart/jobs/312871378#L1844

So the underlying reason I think is that supervisorctl restart galaxy doesn't work properly as intended. All other restarters (docker and run.sh) work, because they are tested in this role and elsewhere. So fixing the supervisor process definition downstream seems to be the correct fix for that. Does that sound right to you?

Partially addresses: #48

mvdbeek · 2017-12-12T13:57:24Z

Sure, that's fine. You may also want to look at the API version to decide whether or not to restart.

nuwang · 2017-12-12T14:43:33Z

@mvdbeek Sounds good. We were also thinking it would be good to not totally lose this functionality, and instead, move it to the ansible-galaxy role, where it's currently awaiting an implementation. All 3 restart mechanisms are fairly generic, so it should be fairly usable at that level. We can also use a topic listener to allow downstream roles to intercept should they wish to do so.

drosofff · 2017-12-12T16:44:21Z

@mvdbeek @nuwang Sounds good to me too. Yet, two things: generally, the galaxy-tools role is the most downstream isn't it ? The other is that the reason why the most recent handler is failing is far from clear for me: what is puzzling is that there is a supervisorctl stop then start task, for instance here https://github.com/ARTbio/GalaxyKickStart/blob/master/roles/galaxy.movedata/tasks/export.yml, and I think we can say it works.

The only difference I see, considering that a handler is also a suite of tasks, is that @nuwang used state=restarted whereas in the pointed file, its a state=stopped followed by another task with state=started.

I am not sure but I may remember having noticed that the state=restarted for supervisorctl ansible module was quite slow.

nuwang · 2017-12-12T18:03:39Z

@drosofff Yes, the handler here uses restarted instead of a start followed by a stop:

ansible-galaxy-tools/tasks/restart_galaxy_supervisor.yml

Line 2 in 061809b

supervisorctl: name="galaxy:" state=restarted

However, since a restart is in effect the same, the probable reason is that galaxy has not fully stopped yet before attempting to restart. In the kick start playbook, you have several intermediary actions between the stop and start, which probably gives Galaxy enough time to shut down.

I didn't author the original restart handler, so I've never checked this, but it's most likely a misconfiguration issue in the supervisor config. You may need to check whether supervisorctl restart galaxy works correctly manually to isolate that as a cause, as I suggested earlier.

drosofff · 2017-12-12T18:41:47Z

@nuwang Yeah actually I forgot to reply to this: You cannot imagine the number of time I already did supervisorctl restart galaxy: since I work on GKS, and this, in multiple situations :-)

It's true that sometimes uwsgi is not properly killed by supervisorctl stop or restarted, which causes a port conflict. However, this is sporadic (I would say 1/20 or 1/30).

I noticed that in the last situation it's 100% (I did not properly tested the recent PR yet , sorry)

This said, if we can improve the supervisor/conf.d/galaxy.conf or related config files, I would be more than happy ! I'll give it a shot when I have time, which unfortunately is not the thing I have the most these days

Cheers

nuwang · 2017-12-12T19:21:02Z

@drosofff Thanks for testing that. It's great if we can get supervisor's uwsgi control fixed - please keep us posted on that!

cheers!

drosofff · 2017-12-13T22:57:18Z

@nuwang @mvdbeek does adding the following task at the beginning of restart_galaxy_supervisor.yml look acceptable ? I know it's patchy but may do the job (only partially tested)

- name: Ensure hanging uwsgi is killed before restart galaxy:
  command: killall uwsgi
  ignore_errors: yes

nuwang · 2017-12-14T07:18:25Z

@drosofff A killall on uwsgi may not work because this role could be run on a machine which has other uwsgi processes running, and processes other than galaxy could get killed. There are some good suggestions in this post about changes to the supervisor config that sound promising. e.g. the --master parameter to uwsgi and stopsignal=QUIT in supervisor.conf.

I assume you're using the supervisor config in galaxy-extras?

As a temporary measure, I've added a flag in this PR: #50 which allows control over whether or not to run the restart handler, and defaulted it to no, so that the original behaviour is restored.

drosofff · 2017-12-14T07:57:56Z

Hi @nuwang Thanks for the suggestion, I'll test it asasp. Yes we are using galaxy-extras, thus I guess that any change of that sort should be proposed for this role right ?

On another note, I can confirm that currently the handler is triggering as many supervisorctl restarts as the number of installed tools. 2 times here https://travis-ci.org/ARTbio/ansible-galaxy-tools/builds/316153504#L702

and in separate test not in Travis, I could get 20 restarts with twenty tools ! I know this is not the expected behavior, but it does happen !

nuwang · 2017-12-14T12:23:48Z

@drosofff Yes, I think a PR against galaxy-extras would be great.

WRT to the handler being triggered unnecessarily, I'm seeing that in the old build logs too and it looks like this bug in Ansible which has been resolved in Ansible 2.3.0: ansible/ansible#18178

Which version of Ansible are you using? I've also updated the Ansible version used by travis so this may have resolved the problem, but it's no longer immediately apparent because the restart handlers don't run automatically anymore.

drosofff · 2017-12-14T14:00:32Z

We use ansible==2.2, but I don't think it would be a problem to update to 2.3.0 or even 2.4, apart from the warnings for syntax changes. I tried a couple of time and it worked. We would just need to lint all roles at some point.

drosofff · 2017-12-14T16:56:25Z

I can confirm that switching to ansible==2.3.0 fixes the behavior of the handler.
This said there is no reason for putting the notification for this handler in the looping task rather than in the looped task.
On the other side, I just tested extensively tested stopsignal=QUIT: this is clearly better (tested about 30 supervisorctl restarts with QUIT or INT) and enough to cop with uwsgi.

nuwang mentioned this issue Dec 11, 2017

Make sure ansible-galaxy-tools version of the restart handler is called #47

Merged

nuwang added a commit to gvlproject/ansible-galaxy-tools that referenced this issue Dec 11, 2017

Added travis build badge and made tool install failure = test failure

16f725e

Fixes 1st part of issue: galaxyproject#48

nuwang mentioned this issue Dec 12, 2017

Partially addresses: https://github.com/galaxyproject/ansible-galaxy-tools/issues/48 #49

Merged

afgane added a commit that referenced this issue Dec 12, 2017

Merge pull request #49 from gvlproject/master

061809b

Partially addresses: #48

drosofff mentioned this issue Dec 14, 2017

Change signal for stopping galaxy:galaxy_web to stopsignal=QUIT galaxyproject/ansible-galaxy-extras#180

Merged

afgane mentioned this issue Dec 14, 2017

Added flag galaxy_restart_handler_enabled to control restart handling #50

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restart handler mechanism appears to have problems #48

Restart handler mechanism appears to have problems #48

nuwang commented Dec 11, 2017

mvdbeek commented Dec 11, 2017

nuwang commented Dec 12, 2017 •

edited

Loading

mvdbeek commented Dec 12, 2017

nuwang commented Dec 12, 2017 •

edited

Loading

drosofff commented Dec 12, 2017

nuwang commented Dec 12, 2017

drosofff commented Dec 12, 2017

nuwang commented Dec 12, 2017

drosofff commented Dec 13, 2017

nuwang commented Dec 14, 2017

drosofff commented Dec 14, 2017 •

edited

Loading

nuwang commented Dec 14, 2017

drosofff commented Dec 14, 2017

drosofff commented Dec 14, 2017

Restart handler mechanism appears to have problems #48

Restart handler mechanism appears to have problems #48

Comments

nuwang commented Dec 11, 2017

mvdbeek commented Dec 11, 2017

nuwang commented Dec 12, 2017 • edited Loading

mvdbeek commented Dec 12, 2017

nuwang commented Dec 12, 2017 • edited Loading

drosofff commented Dec 12, 2017

nuwang commented Dec 12, 2017

drosofff commented Dec 12, 2017

nuwang commented Dec 12, 2017

drosofff commented Dec 13, 2017

nuwang commented Dec 14, 2017

drosofff commented Dec 14, 2017 • edited Loading

nuwang commented Dec 14, 2017

drosofff commented Dec 14, 2017

drosofff commented Dec 14, 2017

nuwang commented Dec 12, 2017 •

edited

Loading

nuwang commented Dec 12, 2017 •

edited

Loading

drosofff commented Dec 14, 2017 •

edited

Loading