Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add liveness/readiness probes to web/task - fixes #414 #1188

Closed
wants to merge 10 commits into from

Conversation

erz4
Copy link
Contributor

@erz4 erz4 commented Jan 16, 2023

SUMMARY

Added liveness & readiness probes to the awx-web container.

fixes #414

ISSUE TYPE
  • New or Enhanced Feature
ADDITIONAL INFORMATION

@erz4 erz4 changed the title Add liveness/readiness probes to web - fixes #414 Add liveness/readiness probes to web/task - fixes #414 Jan 17, 2023
@erz4 erz4 marked this pull request as ready for review January 18, 2023 09:38
@erz4
Copy link
Contributor Author

erz4 commented Jan 18, 2023

Hi @shanemcd how can we advance that PR?

Copy link

@sergeipri sergeipri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tanganellilore
Copy link
Contributor

We can consider aslo a liveness for awx-task container?

A command like this awx-manage run_dispatcher --running | grep '\[\]' return 0 if awx-task work properly on propagation of message and return 1 if there is some issue on comunication beetween task and postgres (for example when there is some connection interruption).

We still discuss it on matrix with @TheRealHaoLiu to also find some solutions to roolback connection with postgre.

@erz4
Copy link
Contributor Author

erz4 commented Feb 1, 2023

@tanganellilore

We can consider aslo a liveness for awx-task container?

A command like this awx-manage run_dispatcher --running | grep '\[\]' return 0 if awx-task work properly on propagation of message and return 1 if there is some issue on comunication beetween task and postgres (for example when there is some connection interruption).

We still discuss it on matrix with @TheRealHaoLiu to also find some solutions to roolback connection with postgre.

added for task, what do you think about the defaults?

@tanganellilore
Copy link
Contributor

I'm not sure about period, because command require some seconds (like 2 or 3) so i think that for the task we can use something like 10/15 seconds.
Let me say, when task container not work, everythings behind UI, will not work, and you can see all tasks in pending (or failing).
With 10 seconds and 3 consecutive failure means that after 35/40 seconds container will be restared in case of disconnection with db, so for me should be fine.
In any case, users can customize these option on operator side.

- /bin/bash
- -c
- |
awx-manage run_dispatcher --running | grep '\[\]'

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will result exit code 1 in case of running jobs, as shown below

bash-5.1# awx-manage run_dispatcher --running 
2023-02-01 11:40:24,449 WARNING  [-] awx.main.dispatch checking dispatcher running for awx-8495cbdf8d-ph8gp
['02b0dcb3-3d12-4f37-9ce8-28ffb61bee6e', '9abcb670-4140-4a4f-b7df-5c68017ecbd3', 'aa18e470-14f5-4735-9a62-4d7383b3946e']
bash-5.1# awx-manage run_dispatcher --running | grep '\[\]'
2023-02-01 11:40:29,592 WARNING  [-] awx.main.dispatch checking dispatcher running for awx-8495cbdf8d-ph8gp
bash-5.1# echo $?
1

So the probe will fail.

Maybe we should rely on status command awx-manage run_dispatcher --status

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tested "awx-manage run_dispatcher --status" and it seems to work.

returns 1 status when connection to database is refused

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to me is not correct. I tried now to simulate a failover.
This is from awx-task log
image

this with command suggested, before/after failover:
image

This before failover with running command:
image

This after failover with running command:
image

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tanganellilore can you explain how you stimulated a failover?

i did it by redirecting all traffic to database on pod to localhost using iptables.

Copy link
Contributor

@tanganellilore tanganellilore Feb 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a Ha external postgres with vip address on leader. Simply stop the leader and waiting the election of replica (5/10seconds).
This jeans that connection is really dropped and it's required a new connection from awx-task.
Actually I'm trying to set a reconnection logic on awx, but is not very simple

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So every time the election process will accure the awx-task needs to reload?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

anyway, it seems we are not really sure which command to use to set in the liveness probe command.

maybe we need to mix them?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can combine both command, like or condition. If one of the is 1, we can count it as broke container

@gundalow
Copy link

To avoid Molecule destroying the environment run:
molecule test --destroy=never

@TheRealHaoLiu
Copy link
Member

@erz4 from the community meeting

  • we would like to see the readiness and liveness probe parameter should be nested under a top level parameter and be hidden
  • give ability to disable readiness and liveness probe

@TheRealHaoLiu
Copy link
Member

i will help troubleshoot the CI failure

@erz4
Copy link
Contributor Author

erz4 commented Feb 16, 2023

@erz4 from the community meeting

  • we would like to see the readiness and liveness probe parameter should be nested under a top level parameter and be hidden
  • give ability to disable readiness and liveness probe

@TheRealHaoLiu
so every probe should have to parameter in the crd

  1. enable/disable - enable by default
  2. parameters for the probe - with default as we already set

@rooftopcellist
Copy link
Member

rooftopcellist commented Mar 1, 2023

@erz4 Re: nesting the variables, currently it shows like this:

    task_liveness_failure_threshold: 3
    task_liveness_initial_delay: 3
    task_liveness_period: 3
    task_liveness_success_threshold: 1
    task_liveness_timeout: 10
    task_privileged: false
    task_readiness_failure_threshold: 3
    task_readiness_initial_delay: 3
    task_readiness_period: 3
    task_readiness_success_threshold: 1
    task_readiness_timeout: 10
    web_liveness_failure_threshold: 3
    web_liveness_initial_delay: 3
    web_liveness_period: 3
    web_liveness_success_threshold: 1
    web_liveness_timeout: 10
    web_readiness_failure_threshold: 3
    web_readiness_initial_delay: 3
    web_readiness_period: 3
    web_readiness_success_threshold: 1
    web_readiness_timeout: 5

We are hoping to nest these variables to declutter the AWX CR a bit.

    task.liveness.failure_threshold: 3
    task.liveness.initial_delay: 3
    task.liveness.period: 3
    task.liveness.success_threshold: 1
    task.liveness.timeout: 10
    task.readiness.failure_threshold: 3
    task.readiness.initial_delay: 3
    task.readiness.period: 3
    task.readiness.success_threshold: 1
    task.readiness.timeout: 10
    web.liveness.failure_threshold: 3
    web.liveness.initial_delay: 3
    web.liveness.period: 3
    web.liveness.success_threshold: 1
    web.liveness.timeout: 10
    web.readiness.failure_threshold: 3
    web.readiness.initial_delay: 3
    web.readiness.period: 3
    web.readiness.success_threshold: 1
    web.readiness.timeout: 5

When testing this out, it fails on the "Apply deployment resources" task, presumably because the probe timed out. The timeout may be too low. The timeout is 10 seconds and the database migrations take much longer than that to run. Probably 60-70 seconds if I had to guess.

@TheRealHaoLiu
Copy link
Member

Hi @erz4 we are prioritizing to get this in next.

due to the recent change to the deployment of awx (web-task-split) the PR need some heavy rebasing and update

would u be able to get to this?

@rooftopcellist
Copy link
Member

There is an open PR actively being worked on here to implement this:

@rooftopcellist
Copy link
Member

This feature has been merged as part of #1674

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add livenessProbe, startupProbe do AWX deployment
8 participants