Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error with rejoining node to cluster after lost connection to postgres #4294

Closed
byumov opened this issue Jul 9, 2019 · 9 comments
Closed

Comments

@byumov
Copy link

byumov commented Jul 9, 2019

ISSUE TYPE
  • Bug Report
SUMMARY

Node can't join to cluster after lost network connection to a postgres database.

ENVIRONMENT
  • AWX version: 3.0.1
    But bug still preset at 6.0.0
STEPS TO REPRODUCE
  1. Setup HA AWX cluster with 3+ nodes and external Postgres database
  2. Shutdown network on one of nodes and wait 120 sec(grace_period)
  3. Look at work nodes log, you will see, that node without network was removed from cluster:
Jul  4 17:16:59 1.awx.node.dc2 dispatcher[207]: 2019-07-04 14:16:59,220 INFO     awx.main.tasks Host 1.awx.node.dc1 Automatically Deprovisioned.

And from postgres database:

awx=> SELECT id FROM main_instance;
 id
-----
 543
 550
(2 rows)
  1. Fix network on node and look at it logs:
Jul  4 17:20:39 1.awx.node.dc1 dispatcher[505]: 2019-07-04 14:20:39,433 ERROR    awx.main.dispatch failed to write inbound message
Jul  4 17:20:39 1.awx.node.dc1 dispatcher[505]: Traceback (most recent call last):
Jul  4 17:20:39 1.awx.node.dc1 dispatcher[505]: File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/awx/main/dispatch/pool.py", line 388, in write
Jul  4 17:20:39 1.awx.node.dc1 dispatcher[505]: self.cleanup()
Jul  4 17:20:39 1.awx.node.dc1 dispatcher[505]: File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/awx/main/dispatch/pool.py", line 373, in cleanup
Jul  4 17:20:39 1.awx.node.dc1 dispatcher[505]: reaper.reap(excluded_uuids=running_uuids)
Jul  4 17:20:39 1.awx.node.dc1 dispatcher[505]: File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/awx/main/dispatch/reaper.py", line 35, in reap
Jul  4 17:20:39 1.awx.node.dc1 dispatcher[505]: me = instance or Instance.objects.me()
Jul  4 17:20:39 1.awx.node.dc1 dispatcher[505]: File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/awx/main/managers.py", line 88, in me
Jul  4 17:20:39 1.awx.node.dc1 dispatcher[505]: raise RuntimeError("No instance found with the current cluster host id")
Jul  4 17:20:39 1.awx.node.dc1 dispatcher[505]: RuntimeError: No instance found with the current cluster host id
EXPECTED RESULTS

After restoring network connection to the database, node successfully rejoins to cluster.

ACTUAL RESULTS

Node never rejoin to cluster without instance restarting

ADDITIONAL INFORMATION

Node can't return to cluster, because it calls function cleanup from pool.py on each heartbeat.
cleanup calls reaper.reap() and it fails, because can't get instance id(awx delete node from database at reproduce step 3):

me = instance or Instance.objects.me()

I created a pull request, with probably fix: #4268

@awxbot awxbot added the type:bug label Jul 9, 2019
@YuriGrigorov
Copy link

YuriGrigorov commented Jul 9, 2019

The problem of reap is that the function wants the working instance.

def reap(instance=None, status='failed', excluded_uuids=[]):
'''
Reap all jobs in waiting|running for this instance.
'''
me = instance or Instance.objects.me()

But there may be cases when the instance, for some reason, is automatically deprovisioned. If that happens then the instance can't be provisioned back.

I found two places where reap breaks automatic node provisioning.

reaper.reap(excluded_uuids=running_uuids)

We don't need to call cleanup if instance is not in cluster.

# when the cluster heartbeat occurs, clean up internally
if isinstance(body, dict) and 'cluster_node_heartbeat' in body['task']:
self.cleanup()

The second place where reap may be called when instance is not in cluster (in case of automatic deprovisioning) is here:

so if you try to restart dispatcher it will not start.

@megakoresh
Copy link

@byumov This is happening to me on a single-node setup as well:

2019-07-17 08:47:27,611 INFO success: dispatcher entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
Traceback (most recent call last):
  File "/usr/bin/awx-manage", line 11, in <module>
    load_entry_point('awx==6.0.0.0', 'console_scripts', 'awx-manage')()
  File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/awx/__init__.py", line 140, in manage
    execute_from_command_line(sys.argv)
  File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/django/core/management/__init__.py", line 364, in execute_from_command_line
    utility.execute()
  File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/django/core/management/__init__.py", line 356, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/django/core/management/base.py", line 283, in run_from_argv
    self.execute(*args, **cmd_options)
  File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/django/core/management/base.py", line 330, in execute
    output = self.handle(*args, **options)
  File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/awx/main/management/commands/run_dispatcher.py", line 123, in handle
    reaper.reap()
  File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/awx/main/dispatch/reaper.py", line 36, in reap
    me = instance or Instance.objects.me()
  File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/awx/main/managers.py", line 116, in me
    raise RuntimeError("No instance found with the current cluster host id")
RuntimeError: No instance found with the current cluster host id

Task container logs are full of this. And while I can navigate the UI, I cannot run any jobs.

@bjmingyang
Copy link

@megakoresh Root cause is while doing a backup on a Tower instance, it is not excluding rabbitmq.py and hence while doing a restore on a different Ansible Tower instance it restores the original rabbitmq.py, which breaks the rabbitmq clustering.

use this command fix it

sudo -u awx scl enable rh-python36 rh-postgresql10 "awx-manage create_preload_data"
sudo -u awx scl enable rh-python36 rh-postgresql10 "awx-manage provision_instance --hostname=$(hostname)"
sudo -u awx scl enable rh-python36 rh-postgresql10 "awx-manage register_queue --queuename=tower --hostnames=$(hostname)"

systemctl restart awx-cbreceiver
systemctl restart awx-dispatcher
systemctl restart awx-channels-worker
systemctl restart awx-daphne
systemctl restart awx-web

@megakoresh
Copy link

megakoresh commented Jul 26, 2019

@bjmingyang Root cause is hardcoded hostnames in configuration files, namely settings.py. Changing awx_task_hostname inventory variable changes service discovery hostname, while settings still refer to awx. This breaks the installation. And any solution that involves poking around in a running container is not a solution at all. This must be fixed properly.

@tamirshaul
Copy link

I'm having the same problem on awx 6.1.0.
Running on Openshift, I need to restart pods in my awx cluster frequently because they can't rejoin the cluster themselves.

@ryanpetrello
Copy link
Contributor

ryanpetrello commented Sep 27, 2019

At its core, this issue can be condensed down to a very simple reproduction:

  1. Install single-node AWX (any deployment method).
  2. Once everything is running, delete the Instance from the database (Instance.objects.first().delete()).
  3. Observe tracebacks like this one forever:
...
  File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/awx/main/managers.py", line 116, in me
    raise RuntimeError("No instance found with the current cluster host id")
RuntimeError: No instance found with the current cluster host id

The practical scenario where you'll see this (as described in this issue) is in a k8s/OpenShift deployment with multiple pods (in this environment, settings.AWX_AUTO_DEPROVISION_INSTANCES = True). When a node goes missing (for any number of reasons) for (by default) 120s, the record for that node is removed from the main_instances table:

https://github.com/ansible/awx/blob/devel/awx/main/tasks.py#L444

At a later point in time, when its connectivity is restored, the dispatcher is still running, and so we see the RuntimeError: No instance found with the current cluster host id error. The appropriate change here would be to update the periodic cleanup/reaping process to detect a missing instance record and automatically re-perform auto-registration.

@grahamneville
Copy link

Thanks for fixing this @ryanpetrello, we've been running in to this a lot on our kubernetes cluster.
Any chance of baking and publishing a new awx image once your PR gets merged please?

@ryanpetrello
Copy link
Contributor

👋 @grahamneville thank @byumov, he figured out what was up and contributed the fix.

We have a few features landing in AWX soon, and we intend to cut a new release at some point after that (which will include this fix).

@Spredzy Spredzy self-assigned this Oct 1, 2019
@Spredzy Spredzy closed this as completed Oct 1, 2019
ryanpetrello pushed a commit to ryanpetrello/awx that referenced this issue May 8, 2020
…lations

[WIP] UI translation strings for release_3.7.0 branch
@AlanCoding
Copy link
Member

The specific traceback given here should have been fixed with #11955

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests