-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error with rejoining node to cluster after lost connection to postgres #4294
Comments
The problem of awx/awx/main/dispatch/reaper.py Lines 32 to 36 in 2aa32f6
But there may be cases when the instance, for some reason, is automatically deprovisioned. If that happens then the instance can't be provisioned back. I found two places where Line 377 in 2aa32f6
We don't need to call Lines 390 to 392 in 2aa32f6
The second place where
so if you try to restart dispatcher it will not start. |
@byumov This is happening to me on a single-node setup as well:
Task container logs are full of this. And while I can navigate the UI, I cannot run any jobs. |
@megakoresh Root cause is while doing a backup on a Tower instance, it is not excluding rabbitmq.py and hence while doing a restore on a different Ansible Tower instance it restores the original rabbitmq.py, which breaks the rabbitmq clustering. use this command fix it sudo -u awx scl enable rh-python36 rh-postgresql10 "awx-manage create_preload_data" systemctl restart awx-cbreceiver |
@bjmingyang Root cause is hardcoded hostnames in configuration files, namely |
I'm having the same problem on awx 6.1.0. |
At its core, this issue can be condensed down to a very simple reproduction:
The practical scenario where you'll see this (as described in this issue) is in a k8s/OpenShift deployment with multiple pods (in this environment, https://github.com/ansible/awx/blob/devel/awx/main/tasks.py#L444 At a later point in time, when its connectivity is restored, the dispatcher is still running, and so we see the |
Thanks for fixing this @ryanpetrello, we've been running in to this a lot on our kubernetes cluster. |
👋 @grahamneville thank @byumov, he figured out what was up and contributed the fix. We have a few features landing in AWX soon, and we intend to cut a new release at some point after that (which will include this fix). |
…lations [WIP] UI translation strings for release_3.7.0 branch
The specific traceback given here should have been fixed with #11955 |
ISSUE TYPE
SUMMARY
Node can't join to cluster after lost network connection to a postgres database.
ENVIRONMENT
But bug still preset at 6.0.0
STEPS TO REPRODUCE
grace_period
)And from postgres database:
EXPECTED RESULTS
After restoring network connection to the database, node successfully rejoins to cluster.
ACTUAL RESULTS
Node never rejoin to cluster without instance restarting
ADDITIONAL INFORMATION
Node can't return to cluster, because it calls function
cleanup
frompool.py
on each heartbeat.cleanup
callsreaper.reap()
and it fails, because can't get instance id(awx delete node from database at reproduce step 3):I created a pull request, with probably fix: #4268
The text was updated successfully, but these errors were encountered: