Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restart handler mechanism appears to have problems #48

Open
nuwang opened this issue Dec 11, 2017 · 14 comments
Open

Restart handler mechanism appears to have problems #48

nuwang opened this issue Dec 11, 2017 · 14 comments

Comments

@nuwang
Copy link
Member

nuwang commented Dec 11, 2017

Problem:

The restart handler mechanism doesn't appear to work consistently and/or work well with other roles.

Background:

See discussion here: #47

Issues:

There are several attendant issues that also need to be solved:

  1. The build should fail when test tool installation during testing fails. However, it currently does not.
  2. The restart handler name used to be "restart galaxy", which is the same handler name used in the galaxy playbook: https://github.com/galaxyproject/ansible-galaxy/blob/6544db9b17cc4db6cfbaf6940545fb57edc37cdf/handlers/main.yml#L5.
    but this can cause the incorrect handler to be called when both galaxy and galaxy-tools roles are used together. We could use the same handler name if we use a "generic topic" as introduced in ansible 2.2 onwards or make the handler name unique, which is what was done in PR: Make sure ansible-galaxy-tools version of the restart handler is called #47. However, the latter seems to have caused a build break for unknown reasons.
  3. Supervisor seems to be having trouble restarting Galaxy as shown in: Make sure ansible-galaxy-tools version of the restart handler is called #47 (comment)
    Since a restart command should succeed, whether Galaxy is running or not, there may also be a supervisor config issue in the extras playbook.
@mvdbeek
Copy link
Member

mvdbeek commented Dec 11, 2017

Just wanted to add that since 17.09 (also before, but we fixed some additional issues in 17.09) galaxy restarts are not necessary after installing tools or data managers, in case that helps.

@nuwang
Copy link
Member Author

nuwang commented Dec 12, 2017

@mvdbeek Thanks - it would be great to remove the restarter functionality from this role - it does seem very out of place. I had a chat with Enis about this and he suggested that we mark it as deprecated and leave it for another release, because 17.09 is still fairly new.

Having said that, I think the issue is that all this time, the restarter generally did not get called unless you used the playbook in isolation. The reason is that the ansible-galaxy role also has a handler with the same name, which was effectively hiding the handler defined in ansible-galaxy-tools. This can be seen in the kick start playbook. This is the Travis build log prior to merging PR #47:
https://travis-ci.org/ARTbio/GalaxyKickStart/jobs/314131564#L1627.
As the log shows, the actual restarter that gets called is the stubbed out restarter in ansible-galaxy here: https://github.com/galaxyproject/ansible-galaxy/blob/master/handlers/main.yml#L4

However, PR #47 fixes this and now the correct handler gets called. However, this caused a downstream build failure in kickstarter here: https://travis-ci.org/ARTbio/GalaxyKickStart/jobs/312871378#L1844

So the underlying reason I think is that supervisorctl restart galaxy doesn't work properly as intended. All other restarters (docker and run.sh) work, because they are tested in this role and elsewhere. So fixing the supervisor process definition downstream seems to be the correct fix for that. Does that sound right to you?

@mvdbeek
Copy link
Member

mvdbeek commented Dec 12, 2017

Sure, that's fine. You may also want to look at the API version to decide whether or not to restart.

@nuwang
Copy link
Member Author

nuwang commented Dec 12, 2017

@mvdbeek Sounds good. We were also thinking it would be good to not totally lose this functionality, and instead, move it to the ansible-galaxy role, where it's currently awaiting an implementation. All 3 restart mechanisms are fairly generic, so it should be fairly usable at that level. We can also use a topic listener to allow downstream roles to intercept should they wish to do so.

@drosofff
Copy link
Member

@mvdbeek @nuwang Sounds good to me too. Yet, two things: generally, the galaxy-tools role is the most downstream isn't it ? The other is that the reason why the most recent handler is failing is far from clear for me: what is puzzling is that there is a supervisorctl stop then start task, for instance here https://github.com/ARTbio/GalaxyKickStart/blob/master/roles/galaxy.movedata/tasks/export.yml, and I think we can say it works.

The only difference I see, considering that a handler is also a suite of tasks, is that @nuwang used state=restarted whereas in the pointed file, its a state=stopped followed by another task with state=started.

I am not sure but I may remember having noticed that the state=restarted for supervisorctl ansible module was quite slow.

@nuwang
Copy link
Member Author

nuwang commented Dec 12, 2017

@drosofff Yes, the handler here uses restarted instead of a start followed by a stop:

supervisorctl: name="galaxy:" state=restarted

However, since a restart is in effect the same, the probable reason is that galaxy has not fully stopped yet before attempting to restart. In the kick start playbook, you have several intermediary actions between the stop and start, which probably gives Galaxy enough time to shut down.

I didn't author the original restart handler, so I've never checked this, but it's most likely a misconfiguration issue in the supervisor config. You may need to check whether supervisorctl restart galaxy works correctly manually to isolate that as a cause, as I suggested earlier.

@drosofff
Copy link
Member

@nuwang Yeah actually I forgot to reply to this: You cannot imagine the number of time I already did supervisorctl restart galaxy: since I work on GKS, and this, in multiple situations :-)

It's true that sometimes uwsgi is not properly killed by supervisorctl stop or restarted, which causes a port conflict. However, this is sporadic (I would say 1/20 or 1/30).

I noticed that in the last situation it's 100% (I did not properly tested the recent PR yet , sorry)

This said, if we can improve the supervisor/conf.d/galaxy.conf or related config files, I would be more than happy ! I'll give it a shot when I have time, which unfortunately is not the thing I have the most these days

Cheers

@nuwang
Copy link
Member Author

nuwang commented Dec 12, 2017

@drosofff Thanks for testing that. It's great if we can get supervisor's uwsgi control fixed - please keep us posted on that!

cheers!

@drosofff
Copy link
Member

@nuwang @mvdbeek does adding the following task at the beginning of restart_galaxy_supervisor.yml look acceptable ? I know it's patchy but may do the job (only partially tested)

- name: Ensure hanging uwsgi is killed before restart galaxy:
  command: killall uwsgi
  ignore_errors: yes

@nuwang
Copy link
Member Author

nuwang commented Dec 14, 2017

@drosofff A killall on uwsgi may not work because this role could be run on a machine which has other uwsgi processes running, and processes other than galaxy could get killed. There are some good suggestions in this post about changes to the supervisor config that sound promising. e.g. the --master parameter to uwsgi and stopsignal=QUIT in supervisor.conf.

I assume you're using the supervisor config in galaxy-extras?

As a temporary measure, I've added a flag in this PR: #50 which allows control over whether or not to run the restart handler, and defaulted it to no, so that the original behaviour is restored.

@drosofff
Copy link
Member

drosofff commented Dec 14, 2017

Hi @nuwang Thanks for the suggestion, I'll test it asasp. Yes we are using galaxy-extras, thus I guess that any change of that sort should be proposed for this role right ?

On another note, I can confirm that currently the handler is triggering as many supervisorctl restarts as the number of installed tools. 2 times here https://travis-ci.org/ARTbio/ansible-galaxy-tools/builds/316153504#L702

and in separate test not in Travis, I could get 20 restarts with twenty tools ! I know this is not the expected behavior, but it does happen !

@nuwang
Copy link
Member Author

nuwang commented Dec 14, 2017

@drosofff Yes, I think a PR against galaxy-extras would be great.

WRT to the handler being triggered unnecessarily, I'm seeing that in the old build logs too and it looks like this bug in Ansible which has been resolved in Ansible 2.3.0: ansible/ansible#18178

Which version of Ansible are you using? I've also updated the Ansible version used by travis so this may have resolved the problem, but it's no longer immediately apparent because the restart handlers don't run automatically anymore.

@drosofff
Copy link
Member

We use ansible==2.2, but I don't think it would be a problem to update to 2.3.0 or even 2.4, apart from the warnings for syntax changes. I tried a couple of time and it worked. We would just need to lint all roles at some point.

@drosofff
Copy link
Member

I can confirm that switching to ansible==2.3.0 fixes the behavior of the handler.
This said there is no reason for putting the notification for this handler in the looping task rather than in the looped task.
On the other side, I just tested extensively tested stopsignal=QUIT: this is clearly better (tested about 30 supervisorctl restarts with QUIT or INT) and enough to cop with uwsgi.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants