Error with rejoining node to cluster after lost connection to postgres #4294

byumov · 2019-07-09T08:26:39Z

ISSUE TYPE

Bug Report

SUMMARY

Node can't join to cluster after lost network connection to a postgres database.

ENVIRONMENT

AWX version: 3.0.1
But bug still preset at 6.0.0

STEPS TO REPRODUCE

Setup HA AWX cluster with 3+ nodes and external Postgres database
Shutdown network on one of nodes and wait 120 sec(grace_period)
Look at work nodes log, you will see, that node without network was removed from cluster:

Jul  4 17:16:59 1.awx.node.dc2 dispatcher[207]: 2019-07-04 14:16:59,220 INFO     awx.main.tasks Host 1.awx.node.dc1 Automatically Deprovisioned.

And from postgres database:

awx=> SELECT id FROM main_instance;
 id
-----
 543
 550
(2 rows)

Fix network on node and look at it logs:

Jul  4 17:20:39 1.awx.node.dc1 dispatcher[505]: 2019-07-04 14:20:39,433 ERROR    awx.main.dispatch failed to write inbound message
Jul  4 17:20:39 1.awx.node.dc1 dispatcher[505]: Traceback (most recent call last):
Jul  4 17:20:39 1.awx.node.dc1 dispatcher[505]: File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/awx/main/dispatch/pool.py", line 388, in write
Jul  4 17:20:39 1.awx.node.dc1 dispatcher[505]: self.cleanup()
Jul  4 17:20:39 1.awx.node.dc1 dispatcher[505]: File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/awx/main/dispatch/pool.py", line 373, in cleanup
Jul  4 17:20:39 1.awx.node.dc1 dispatcher[505]: reaper.reap(excluded_uuids=running_uuids)
Jul  4 17:20:39 1.awx.node.dc1 dispatcher[505]: File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/awx/main/dispatch/reaper.py", line 35, in reap
Jul  4 17:20:39 1.awx.node.dc1 dispatcher[505]: me = instance or Instance.objects.me()
Jul  4 17:20:39 1.awx.node.dc1 dispatcher[505]: File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/awx/main/managers.py", line 88, in me
Jul  4 17:20:39 1.awx.node.dc1 dispatcher[505]: raise RuntimeError("No instance found with the current cluster host id")
Jul  4 17:20:39 1.awx.node.dc1 dispatcher[505]: RuntimeError: No instance found with the current cluster host id

EXPECTED RESULTS

After restoring network connection to the database, node successfully rejoins to cluster.

ACTUAL RESULTS

Node never rejoin to cluster without instance restarting

ADDITIONAL INFORMATION

Node can't return to cluster, because it calls function cleanup from pool.py on each heartbeat.
cleanup calls reaper.reap() and it fails, because can't get instance id(awx delete node from database at reproduce step 3):

me = instance or Instance.objects.me()

I created a pull request, with probably fix: #4268

The text was updated successfully, but these errors were encountered:

YuriGrigorov · 2019-07-09T10:34:27Z

The problem of reap is that the function wants the working instance.

awx/awx/main/dispatch/reaper.py

Lines 32 to 36 in 2aa32f6

    
           def reap(instance=None, status='failed', excluded_uuids=[]): 
        
               ''' 
        
               Reap all jobs in waiting|running for this instance. 
        
               ''' 
        
               me = instance or Instance.objects.me()

But there may be cases when the instance, for some reason, is automatically deprovisioned. If that happens then the instance can't be provisioned back.

I found two places where reap breaks automatic node provisioning.

awx/awx/main/dispatch/pool.py

Line 377 in 2aa32f6

reaper.reap(excluded_uuids=running_uuids)

We don't need to call cleanup if instance is not in cluster.

awx/awx/main/dispatch/pool.py

Lines 390 to 392 in 2aa32f6

    
           # when the cluster heartbeat occurs, clean up internally 
        
           if isinstance(body, dict) and 'cluster_node_heartbeat' in body['task']: 
        
               self.cleanup()

The second place where reap may be called when instance is not in cluster (in case of automatic deprovisioning) is here:

awx/awx/main/management/commands/run_dispatcher.py

Line 123 in 2aa32f6

reaper.reap()

so if you try to restart dispatcher it will not start.

megakoresh · 2019-07-17T09:03:05Z

@byumov This is happening to me on a single-node setup as well:

2019-07-17 08:47:27,611 INFO success: dispatcher entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
Traceback (most recent call last):
  File "/usr/bin/awx-manage", line 11, in <module>
    load_entry_point('awx==6.0.0.0', 'console_scripts', 'awx-manage')()
  File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/awx/__init__.py", line 140, in manage
    execute_from_command_line(sys.argv)
  File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/django/core/management/__init__.py", line 364, in execute_from_command_line
    utility.execute()
  File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/django/core/management/__init__.py", line 356, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/django/core/management/base.py", line 283, in run_from_argv
    self.execute(*args, **cmd_options)
  File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/django/core/management/base.py", line 330, in execute
    output = self.handle(*args, **options)
  File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/awx/main/management/commands/run_dispatcher.py", line 123, in handle
    reaper.reap()
  File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/awx/main/dispatch/reaper.py", line 36, in reap
    me = instance or Instance.objects.me()
  File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/awx/main/managers.py", line 116, in me
    raise RuntimeError("No instance found with the current cluster host id")
RuntimeError: No instance found with the current cluster host id

Task container logs are full of this. And while I can navigate the UI, I cannot run any jobs.

bjmingyang · 2019-07-26T06:17:01Z

@megakoresh Root cause is while doing a backup on a Tower instance, it is not excluding rabbitmq.py and hence while doing a restore on a different Ansible Tower instance it restores the original rabbitmq.py, which breaks the rabbitmq clustering.

use this command fix it

sudo -u awx scl enable rh-python36 rh-postgresql10 "awx-manage create_preload_data"
sudo -u awx scl enable rh-python36 rh-postgresql10 "awx-manage provision_instance --hostname=$(hostname)"
sudo -u awx scl enable rh-python36 rh-postgresql10 "awx-manage register_queue --queuename=tower --hostnames=$(hostname)"

systemctl restart awx-cbreceiver
systemctl restart awx-dispatcher
systemctl restart awx-channels-worker
systemctl restart awx-daphne
systemctl restart awx-web

megakoresh · 2019-07-26T07:12:50Z

@bjmingyang Root cause is hardcoded hostnames in configuration files, namely settings.py. Changing awx_task_hostname inventory variable changes service discovery hostname, while settings still refer to awx. This breaks the installation. And any solution that involves poking around in a running container is not a solution at all. This must be fixed properly.

tamirshaul · 2019-09-17T12:31:00Z

I'm having the same problem on awx 6.1.0.
Running on Openshift, I need to restart pods in my awx cluster frequently because they can't rejoin the cluster themselves.

see: ansible#4294

ryanpetrello · 2019-09-27T12:11:04Z

At its core, this issue can be condensed down to a very simple reproduction:

Install single-node AWX (any deployment method).
Once everything is running, delete the Instance from the database (Instance.objects.first().delete()).
Observe tracebacks like this one forever:

...
  File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/awx/main/managers.py", line 116, in me
    raise RuntimeError("No instance found with the current cluster host id")
RuntimeError: No instance found with the current cluster host id

The practical scenario where you'll see this (as described in this issue) is in a k8s/OpenShift deployment with multiple pods (in this environment, settings.AWX_AUTO_DEPROVISION_INSTANCES = True). When a node goes missing (for any number of reasons) for (by default) 120s, the record for that node is removed from the main_instances table:

https://github.com/ansible/awx/blob/devel/awx/main/tasks.py#L444

At a later point in time, when its connectivity is restored, the dispatcher is still running, and so we see the RuntimeError: No instance found with the current cluster host id error. The appropriate change here would be to update the periodic cleanup/reaping process to detect a missing instance record and automatically re-perform auto-registration.

grahamneville · 2019-09-27T12:45:25Z

Thanks for fixing this @ryanpetrello, we've been running in to this a lot on our kubernetes cluster.
Any chance of baking and publishing a new awx image once your PR gets merged please?

ryanpetrello · 2019-09-27T12:47:33Z

👋 @grahamneville thank @byumov, he figured out what was up and contributed the fix.

We have a few features landing in AWX soon, and we intend to cut a new release at some point after that (which will include this fix).

…lations [WIP] UI translation strings for release_3.7.0 branch

AlanCoding · 2022-06-29T14:23:05Z

The specific traceback given here should have been fixed with #11955

awxbot added the type:bug label Jul 9, 2019

byumov closed this as completed Jul 9, 2019

byumov reopened this Jul 9, 2019

byumov mentioned this issue Jul 11, 2019

Fix error with rejoining node to cluster after lost connection to postgres #4268

Closed

ghjm added component:api priority:medium labels Jul 12, 2019

ryanpetrello added a commit to ryanpetrello/awx that referenced this issue Sep 27, 2019

work around an esoteric bug re: the task reaper and k8s auto-deprovision

13bf2bb

see: ansible#4294

This was referenced Sep 27, 2019

work around an esoteric bug re: the task reaper and k8s auto-deprovision #4828

Closed

Fix error with rejoining node to cluster after lost connection to postgres #4829

Merged

ryanpetrello added state:in_progress and removed state:needs_devel labels Sep 27, 2019

This was referenced Sep 27, 2019

No instance found with the current cluster host id Exceptions after clean install with Docker #3959

Open

Web Service not working due to missing instance for the host #2482

Closed

ryanpetrello added state:needs_test and removed state:in_progress labels Sep 27, 2019

Spredzy self-assigned this Oct 1, 2019

Spredzy closed this as completed Oct 1, 2019

Spredzy removed the state:needs_test label Oct 1, 2019

ryanpetrello mentioned this issue Oct 9, 2019

No instance found with the current cluster host id on AWS ECS #1898

Closed

ryanpetrello pushed a commit to ryanpetrello/awx that referenced this issue May 8, 2020

Merge pull request ansible#4294 from ansible/i18n_release_3.7.0_trans…

c494c38

…lations [WIP] UI translation strings for release_3.7.0 branch

shanemcd mentioned this issue Jul 6, 2022

1 replica suddenlty failed with "awx.main.wsbroadcast Unable to return currently active instance: No instance found with the current cluster host id" #12471

Closed

9 tasks

hesmithrh assigned AlanCoding and unassigned Spredzy Jul 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error with rejoining node to cluster after lost connection to postgres #4294

Error with rejoining node to cluster after lost connection to postgres #4294

byumov commented Jul 9, 2019 •

edited

Loading

YuriGrigorov commented Jul 9, 2019 •

edited

Loading

megakoresh commented Jul 17, 2019

bjmingyang commented Jul 26, 2019

megakoresh commented Jul 26, 2019 •

edited

Loading

tamirshaul commented Sep 17, 2019

ryanpetrello commented Sep 27, 2019 •

edited

Loading

grahamneville commented Sep 27, 2019

ryanpetrello commented Sep 27, 2019

AlanCoding commented Jun 29, 2022

Error with rejoining node to cluster after lost connection to postgres #4294

Error with rejoining node to cluster after lost connection to postgres #4294

Comments

byumov commented Jul 9, 2019 • edited Loading

ISSUE TYPE

SUMMARY

ENVIRONMENT

STEPS TO REPRODUCE

EXPECTED RESULTS

ACTUAL RESULTS

ADDITIONAL INFORMATION

YuriGrigorov commented Jul 9, 2019 • edited Loading

megakoresh commented Jul 17, 2019

bjmingyang commented Jul 26, 2019

megakoresh commented Jul 26, 2019 • edited Loading

tamirshaul commented Sep 17, 2019

ryanpetrello commented Sep 27, 2019 • edited Loading

grahamneville commented Sep 27, 2019

ryanpetrello commented Sep 27, 2019

AlanCoding commented Jun 29, 2022

byumov commented Jul 9, 2019 •

edited

Loading

YuriGrigorov commented Jul 9, 2019 •

edited

Loading

megakoresh commented Jul 26, 2019 •

edited

Loading

ryanpetrello commented Sep 27, 2019 •

edited

Loading