Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Job terminated in error after 4 hours #14457

Closed
4 of 11 tasks
Mrmel94 opened this issue Sep 18, 2023 · 15 comments
Closed
4 of 11 tasks

Job terminated in error after 4 hours #14457

Mrmel94 opened this issue Sep 18, 2023 · 15 comments

Comments

@Mrmel94
Copy link

Mrmel94 commented Sep 18, 2023

Please confirm the following

  • I agree to follow this project's code of conduct.
  • I have checked the current issues for duplicates.
  • I understand that AWX is open source software provided for free and that I might not receive a timely response.
  • I am NOT reporting a (potential) security vulnerability. (These should be emailed to [email protected] instead.)

Bug Summary

Hi I have a problem already known by this community. The bug of a job that stop after 4 hours of execution with an error.
The principal issue is known there : #11594

The bug is fix with this PR : ansible/receptor#683 and I read the pre-requisites but still doesn't work and problem still there.

I made my installation with this https://github.com/kurokobo/awx-on-k3s and I'm using custom awx-ee

My job for testing is just a sleep for 18000sec (5h), and I enabled the K8S Ansible Runner Keep-Alive Message Interval in awx parameters for 1800sec (30min).

Version of my components :
K3S server v1.25.12+k3s1

And this is my custom-ee :

---
version: 3

images:
  base_image:
    name: quay.io/rockylinux/rockylinux:9

dependencies:
  ansible_core:
    package_pip: ansible-core==2.15.3
  ansible_runner:
    package_pip: ansible-runner==2.3.4
  galaxy: requirements.yml
  python: requirements.txt
  system: bindep.txt

additional_build_files:
    - src: ansible.cfg
      dest: configs

additional_build_steps:
  append_base:
    - RUN $PYCMD -m pip install -U pip
  append_final:
    - COPY --from=quay.io/ansible/receptor:v1.4.1 /usr/bin/receptor /usr/bin/receptor
    - RUN mkdir -p /var/run/receptor
    - RUN git lfs install --system
  prepend_final:
    - COPY _build/configs/ansible.cfg /etc/ansible/ansible.cfg

options:
    user: root

I also tried k3s v1.25.4 and add the environment variable RECEPTOR_KUBE_SUPPORT_RECONNECT in the file : awx-on-k3s/base/awx.yaml :

---
apiVersion: awx.ansible.com/v1beta1
kind: AWX
metadata:
  name: awx
spec:
{...}
  ee_extra_env: |
    - name: RECEPTOR_KUBE_SUPPORT_RECONNECT
      value: enabled

and did kubectl apply -k base in the folder awx-on-k3s/ but bug still there

What am i doing wrong? Have I forgotten something?

AWX version

23.0.0

Select the relevant components

  • UI
  • UI (tech preview)
  • API
  • Docs
  • Collection
  • CLI
  • Other

Installation method

kubernetes

Modifications

yes

Ansible version

ansible-core==2.15.3

Operating system

quay.io/rockylinux/rockylinux:9

Web browser

No response

Steps to reproduce

Launch job test with 5 hours sleep

Expected results

Job terminated with errors

Actual results

Job finish in success

Additional information

No response

@fosterseth
Copy link
Member

just to rule it out, does your job complete if you run a sleep command on the default awx-ee image (quay.io/ansible/awx-ee:latest)?

@Mrmel94
Copy link
Author

Mrmel94 commented Sep 21, 2023

I just did the test, I have the same issue with awx-ee:latest . So I think the variable RECEPTOR_KUBE_SUPPORT_RECONNECT is not enable by default (I don't know why because my k3s cluster version (1.25.12) is >= v1.25.4).

But even when I tried to enable this variable, I still have the issue (maybe I don't put in the right place ?). I don't how to check what is the value of RECEPTOR_KUBE_SUPPORT_RECONNECT in my deployment.

@fosterseth
Copy link
Member

We don't enable it by default, so you have to manually set it.

To check the whether the variable is set correctly:

RECEPTOR_KUBE_SUPPORT_RECONNECT is passed into the awx-ee container as an environment variable

so you can kubectl exec into the awx-ee container and run "printenv" and you should see this variable set to enabled

@fosterseth
Copy link
Member

Also, there are issues when you have a silent playbook that doesn't emit new messages for long periods of time, i.e. sleep commands. Things tend to timeout when idle

there is a setting in UI Job Settings called K8S Ansible Runner Keep-Alive Message Interval

Try setting that to 30

@Mrmel94
Copy link
Author

Mrmel94 commented Sep 27, 2023

Thanks, for the answer I just checked and indeed, when I did printenv in my custom awx-ee the variable RECEPTOR_KUBE_SUPPORT_RECONNECT is not set (I can't find it).

Even tough I set it in the awx-on-k3s/base/awx.yaml file like this :

---
apiVersion: awx.ansible.com/v1beta1
kind: AWX
metadata:
  name: awx
spec:
{...}
  ee_extra_env: |
    - name: RECEPTOR_KUBE_SUPPORT_RECONNECT
      value: enabled

Is this not the way I should do it ?
And as for the parameter K8S Ansible Runner Keep-Alive Message Interval is currently set to 1800. I just set it to 30 like you said.

@kurokobo
Copy link
Contributor

@Mrmel94
Hi, thanks for using my guide.
Since RECEPTOR_KUBE_SUPPORT_RECONNECT is not enabled by default as @fosterseth mentioned, you have to explicitly specify that value using ee_extra_env.

Could you provide the output from following commands?

# Find which image is used for awx-ee container in awx-task pod;
$ kubectl -n awx get deployment/awx-task -o json | jq '.spec.template.spec.containers[] | select(.name == "awx-ee") | .image'
"quay.io/ansible/awx-ee:latest"

# Find created date for awx-ee image
$ sudo $(which k3s) crictl inspecti <YOUR IMAGE FROM ABOVE COMMAND> | jq '.info.imageSpec.created'
"2023-09-14T00:18:41.724658319Z"

# Find RECEPTOR_KUBE_SUPPORT_RECONNECT env vars in awx-ee container in awx-task pod:
# This is specified by `ee_extra_env` in `awx.yaml`
$ kubectl -n awx exec -it deployment/awx-task -c awx-ee -- env | grep RECEPTOR_KUBE_SUPPORT_RECONNECT
RECEPTOR_KUBE_SUPPORT_RECONNECT=enabled

# Find Receptor version in awx-ee container in awx-task pod:
$ kubectl -n awx exec -it deployment/awx-task -c awx-ee -- receptor --version
1.4.1+g3a84c22

# Find Ansible Runner version in awx-ee container in awx-task pod:
$ kubectl -n awx exec -it deployment/awx-task -c awx-ee -- ansible-runner --version
2.3.4

Then launch your job that contains long sleep, and dig into the automation job pod.

# Find the pod that has the name `automation-job-*`.
# This is Execution Environment that your playbook launched on.
$ kubectl -n awx get pod
NAME                                               READY   STATUS              RESTARTS   AGE
awx-operator-controller-manager-566b76fc7f-pt56f   2/2     Running             0          7d10h
awx-postgres-13-0                                  1/1     Running             0          7d10h
awx-web-6975d884c5-dhkgw                           3/3     Running             0          7d10h
awx-task-7d5964cf8d-b2rrg                          4/4     Running             0          16m
automation-job-5-kf7h9                             0/1     ContainerCreating   0          2s

# Find which image is used for your EE
$ kubectl -n awx get pod <YOUR POD NAME> -o json | jq '.spec.containers[].image'
"quay.io/ansible/awx-ee:latest"

# Find created date for the image for your EE
$ sudo $(which k3s) crictl inspecti <YOUR IMAGE FROM ABOVE COMMAND> | jq '.info.imageSpec.created'
"2023-09-14T00:18:41.724658319Z"

# Find ANSIBLE_RUNNER_KEEPALIVE_SECONDS in your EE
$ kubectl -n awx exec -it <YOUR POD NAME> -- env | grep ANSIBLE_RUNNER_KEEPALIVE_SECONDS
ANSIBLE_RUNNER_KEEPALIVE_SECONDS=1800

# Find Receptor version in your EE
$ kubectl -n awx exec -it <YOUR POD NAME> -- receptor --version
1.4.1+g3a84c22

# Find Ansible Runner version your EE
$ kubectl -n awx exec -it <YOUR POD NAME> -- ansible-runner --version
2.3.4

@Mrmel94
Copy link
Author

Mrmel94 commented Sep 28, 2023

Hi thanks for your answer. There is my output :

# kubectl -n awx get deployment/awx-task -o json | jq '.spec.template.spec.containers[] | select(.name == "awx-ee") | .image'
"quay.io/ansible/awx-ee:latest"

# $(which k3s) crictl inspecti quay.io/ansible/awx-ee:latest | jq '.info.imageSpec.created'
"2023-09-27T12:11:12.082979748Z"

# kubectl -n awx exec -it deployment/awx-task -c awx-ee -- env | grep RECEPTOR_KUBE_SUPPORT_RECONNECT
RECEPTOR_KUBE_SUPPORT_RECONNECT=enabled

# kubectl -n awx exec -it deployment/awx-task -c awx-ee -- receptor --version
1.2.0+g8b12890

# kubectl -n awx exec -it deployment/awx-task -c awx-ee -- ansible-runner --version
2.2.1

So if I understand, the variable RECEPTOR_KUBE_SUPPORT_RECONNECT is enabled in in awx-ee container in awx-task pod.
This the ouput for launching my sleep job with my custom awx-ee :

# kubectl -n awx get pod automation-job-847803-czqd5 -o json | jq '.spec.containers[].image'
"registry-gitlab.xxxx.io/exploitation/ansible/awx-runner:stable"

# $(which k3s) crictl inspecti registry-gitlab.xxxx.io/exploitation/ansible/awx-runner:stable | jq '.info.imageSpec.created'
"2023-09-13T12:34:11.699195454Z"

# kubectl -n awx exec -it automation-job-847803-czqd5 -- env | grep ANSIBLE_RUNNER_KEEPALIVE_SECONDS
ANSIBLE_RUNNER_KEEPALIVE_SECONDS=30

# kubectl -n awx exec -it automation-job-847803-czqd5 -- receptor --version
1.4.1

# kubectl -n awx exec -it automation-job-847803-czqd5 -- ansible-runner --version
2.3.4

# kubectl -n awx exec -it automation-job-847803-czqd5 -- env | grep RECEPTOR_KUBE_SUPPORT_RECONNECT

The variable RECEPTOR_KUBE_SUPPORT_RECONNECT looks like not enabled..

@Mrmel94
Copy link
Author

Mrmel94 commented Sep 28, 2023

And I also did an interesting test, trying to launch my sleep job with the public awx-ee (AWX-EE-latest quay.io/ansible/awx-ee:latest) and this is the output :

# kubectl -n awx get pod automation-job-847805-svbd5 -o json | jq '.spec.containers[].image'
"quay.io/ansible/awx-ee:latest"

# $(which k3s) crictl inspecti quay.io/ansible/awx-ee:latest | jq '.info.imageSpec.created'
"2023-09-28T00:19:12.975824118Z"

# kubectl -n awx exec -it automation-job-847805-svbd5 -- env | grep ANSIBLE_RUNNER_KEEPALIVE_SECONDS
ANSIBLE_RUNNER_KEEPALIVE_SECONDS=30

# kubectl -n awx exec -it automation-job-847805-svbd5 -- receptor --version
1.4.1+g5f5094f

# kubectl -n awx exec -it automation-job-847805-svbd5 -- ansible-runner --version
2.3.4

# kubectl -n awx exec -it automation-job-847805-svbd5 -- env | grep RECEPTOR_KUBE_SUPPORT_RECONNECT

So we can see even no matter if I launch my sleep job with the public awx-ee or my custom awx-ee the variable RECEPTOR_KUBE_SUPPORT_RECONNECT is not enabled..

@kurokobo
Copy link
Contributor

@Mrmel94
Thanks for the information.

# kubectl -n awx get deployment/awx-task -o json | jq '.spec.template.spec.containers[] | select(.name == "awx-ee") | .image'
"quay.io/ansible/awx-ee:latest"

# kubectl -n awx exec -it deployment/awx-task -c awx-ee -- receptor --version
1.2.0+g8b12890

# kubectl -n awx exec -it deployment/awx-task -c awx-ee -- ansible-runner --version
2.2.1

The EE image for your control plane is quite old. This old Receptor can't handle RECEPTOR_KUBE_SUPPORT_RECONNECT.

So, if your K3s host can pull the image from the internet, try deleting existing (cached) old image and force to pull truly latest one.

# Delete cached image
$ sudo $(which k3s) crictl rmi quay.io/ansible/awx-ee:lates

# Shut task pod down and start it again to pull latest image.
# First, find `awx-task` pod, 
$ kubectl -n awx get pod
NAME                                               READY   STATUS    RESTARTS       AGE
...
awx-task-5bcdb6cb46-rxdtv                          4/4     Running   0              23h
...

# Then delete it.
$ kubectl -n awx delete pod awx-task-5bcdb6cb46-rxdtv
pod "awx-task-5bcdb6cb46-rxdtv" deleted

# Just wait for the new pod to be created and be in Running state, and the Operator finishes reconcile.
$ kubectl -n awx get pod
NAME                                               READY   STATUS    RESTARTS       AGE
...
awx-task-5bcdb6cb46-z9r45                          4/4     Running   0              103s
...

$ kubectl -n awx logs -f deployments/awx-operator-controller-manager
...
PLAY RECAP *********************************************************************
localhost                  : ok=84   changed=0    unreachable=0    failed=0    skipped=78   rescued=0    ignored=1 

After this, check the receptor and ansible runner version in your control plane again. If they are updated, try your job again.

$ kubectl -n awx exec -it deployment/awx-task -c awx-ee -- receptor --version
$ kubectl -n awx exec -it deployment/awx-task -c awx-ee -- ansible-runner --version

@Mrmel94
Copy link
Author

Mrmel94 commented Sep 28, 2023

Ok, I just did the update of the image and looks better now :

# kubectl -n awx exec -it deployment/awx-task -c awx-ee -- receptor --version
1.4.1+g5f5094f

# kubectl -n awx exec -it deployment/awx-task -c awx-ee -- ansible-runner --version
2.3.4

But my job pod still doesn't have the variable :

# With the image AWX-EE (latest) (quay.io/ansible/awx-ee:latest)
# kubectl -n awx exec -it automation-job-847814-rk5nq -- env | grep RECEPTOR_KUBE_SUPPORT_RECONNECT


# With my custom AWX-EE (registry-gitlab.xxxx.io/exploitation/ansible/awx-runner:stable)
# kubectl -n awx exec -it automation-job-847813-d7zmj -- env | grep RECEPTOR_KUBE_SUPPORT_RECONNECT


@kurokobo
Copy link
Contributor

It's okay, RECEPTOR_KUBE_SUPPORT_RECONNECT is for the receptor in the control plane only. It's not passed to the automation job pod.

Time to test long sleep :)

@Mrmel94
Copy link
Author

Mrmel94 commented Sep 28, 2023

Oh ok I understand now ! I will test it and let you know !

@Mrmel94
Copy link
Author

Mrmel94 commented Sep 28, 2023

It's working !!! Thank you very much @kurokobo & @fosterseth ! Last thing, how can I be sure my EE of my control plane still up to date ?

Because my pod awx-task was deleted and recreated 8 days ago (during my AWX UI upgrade I think), before I manually delete the image awx-ee:latest with the command you give me above @kurokobo, but still not up to date.

In my AWX UI, in Execution Environment tab, I just put "Always" for extract the image (before nothing was selected).

image

Is this the right way to make my control plane update its image when the awx-task pod restarts ?

Thanks again

@kurokobo
Copy link
Contributor

@Mrmel94
That is not for imagePullPolicy for awx-ee in awx-task but for whom specify Control Plane Execution Environment as EE for standard inventory sync or job template. So just changing that configuration does not help you.

You have some choices to achieve your goal.

  1. My recommendation is add image_pull_policy: Always to your awx.yaml and redeploy your AWX. This will change imagePullPolocy for your awx-ee in awx-task pod as Always. See: https://ansible.readthedocs.io/projects/awx-operator/en/latest/user-guide/advanced-configuration/deploying-a-specific-version-of-awx.html
  2. If you don't want to choice 1, before restarting awx-task, in AWX, set Always for your EE and launch a job that uses the EE. This will force to pull latest image for your k3s host, and new image will be used for newly created awx-task pod. You can revert pull policy for the EE after restart.
  3. In your awx.yaml, specify control_plane_ee_image: quay.io/ansible/awx-ee:x.y.z instead of latest and redeploy your AWX. This will force to use specific version of awx-ee. This is easy to understand which version is used, but note that when you want to upgrade your AWX, you should specify new tag and redeploy.

@Mrmel94
Copy link
Author

Mrmel94 commented Oct 2, 2023

Thanks a lot for your help and your time @kurokobo I'll go with the first option.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants