Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1 replica suddenlty failed with "awx.main.wsbroadcast Unable to return currently active instance: No instance found with the current cluster host id" #12471

Closed
4 of 9 tasks
mick1627 opened this issue Jul 6, 2022 · 5 comments
Assignees
Labels

Comments

@mick1627
Copy link

mick1627 commented Jul 6, 2022

Please confirm the following

  • I agree to follow this project's code of conduct.
  • I have checked the current issues for duplicates.
  • I understand that AWX is open source software provided for free and that I might not receive a timely response.

Bug Summary

AWX is deployed using the operator.
Replica is set to 2 in the AWX crd.
Everything works fine during 1 or 2 weeks than suddenly the container awx-web is restarted with error => awx.main.wsbroadcast Unable to return currently active instance: No instance found with the current cluster host id, retry in 5s...

Here the logs of the awx-web container during a restart

AWX version

21.1.0

Select the relevant components

  • UI
  • API
  • Docs
  • Collection
  • CLI
  • Other

Installation method

kubernetes

Modifications

yes

Ansible version

No response

Operating system

No response

Web browser

No response

Steps to reproduce

Deploy awx with several replica.
After 1 or 2 weeks, one replica had issue

Expected results

AWX works with several replica

Actual results

1 replica suddenlty failed with "awx.main.wsbroadcast Unable to return currently active instance: No instance found with the current cluster host id"

Additional information

Custom docker image for awx-ee

@fosterseth
Copy link
Member

this possibly was fixed by ansible/awx-operator#935

awx-web container should not have been running the wsbroadcast services to begin with, on the awx-task container. The error can mostly be ignored.

which operator version are you using? the latest one may fix the problem

@shanemcd
Copy link
Member

shanemcd commented Jul 6, 2022

We made a change here that I think was problematic https://github.com/ansible/awx/pull/11955/files#diff-8e61129abecf62afc8b9c155037a95674c6459c25bb298253b9e0d401e37b086R38-R42

At the time, @AlanCoding and I both agreed this was the wrong place to handle this, but it seems like we reintroduced the bug we were trying to fix in #4294.

After some discussion, we think we'd rather re-register ourselves somewhere near the top of cluster_node_heartbeat instead of reverting the change linked above.

@mick1627
Copy link
Author

mick1627 commented Jul 7, 2022

this possibly was fixed by ansible/awx-operator#935

awx-web container should not have been running the wsbroadcast services to begin with, on the awx-task container. The error can mostly be ignored.

which operator version are you using? the latest one may fix the problem

I am using operator 0.22.0 that contains ansible/awx-operator#935

@AlanCoding
Copy link
Member

@CFSNM this is covered in the Kyndryl hotfix stuff. I changed the version in the project to 2.2.1 to reflect that.

@CFSNM
Copy link
Contributor

CFSNM commented Aug 23, 2022

Validated from QE side.

@CFSNM CFSNM closed this as completed Aug 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants