-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Job terminated in error after 4 hours #14457
Comments
just to rule it out, does your job complete if you run a sleep command on the default awx-ee image (quay.io/ansible/awx-ee:latest)? |
I just did the test, I have the same issue with awx-ee:latest . So I think the variable RECEPTOR_KUBE_SUPPORT_RECONNECT is not enable by default (I don't know why because my k3s cluster version (1.25.12) is >= v1.25.4). But even when I tried to enable this variable, I still have the issue (maybe I don't put in the right place ?). I don't how to check what is the value of RECEPTOR_KUBE_SUPPORT_RECONNECT in my deployment. |
We don't enable it by default, so you have to manually set it. To check the whether the variable is set correctly:
so you can kubectl exec into the awx-ee container and run "printenv" and you should see this variable set to |
Also, there are issues when you have a silent playbook that doesn't emit new messages for long periods of time, i.e. sleep commands. Things tend to timeout when idle there is a setting in UI Job Settings called Try setting that to 30 |
Thanks, for the answer I just checked and indeed, when I did printenv in my custom awx-ee the variable Even tough I set it in the awx-on-k3s/base/awx.yaml file like this :
Is this not the way I should do it ? |
@Mrmel94 Could you provide the output from following commands? # Find which image is used for awx-ee container in awx-task pod;
$ kubectl -n awx get deployment/awx-task -o json | jq '.spec.template.spec.containers[] | select(.name == "awx-ee") | .image'
"quay.io/ansible/awx-ee:latest"
# Find created date for awx-ee image
$ sudo $(which k3s) crictl inspecti <YOUR IMAGE FROM ABOVE COMMAND> | jq '.info.imageSpec.created'
"2023-09-14T00:18:41.724658319Z"
# Find RECEPTOR_KUBE_SUPPORT_RECONNECT env vars in awx-ee container in awx-task pod:
# This is specified by `ee_extra_env` in `awx.yaml`
$ kubectl -n awx exec -it deployment/awx-task -c awx-ee -- env | grep RECEPTOR_KUBE_SUPPORT_RECONNECT
RECEPTOR_KUBE_SUPPORT_RECONNECT=enabled
# Find Receptor version in awx-ee container in awx-task pod:
$ kubectl -n awx exec -it deployment/awx-task -c awx-ee -- receptor --version
1.4.1+g3a84c22
# Find Ansible Runner version in awx-ee container in awx-task pod:
$ kubectl -n awx exec -it deployment/awx-task -c awx-ee -- ansible-runner --version
2.3.4 Then launch your job that contains long sleep, and dig into the automation job pod. # Find the pod that has the name `automation-job-*`.
# This is Execution Environment that your playbook launched on.
$ kubectl -n awx get pod
NAME READY STATUS RESTARTS AGE
awx-operator-controller-manager-566b76fc7f-pt56f 2/2 Running 0 7d10h
awx-postgres-13-0 1/1 Running 0 7d10h
awx-web-6975d884c5-dhkgw 3/3 Running 0 7d10h
awx-task-7d5964cf8d-b2rrg 4/4 Running 0 16m
automation-job-5-kf7h9 0/1 ContainerCreating 0 2s
# Find which image is used for your EE
$ kubectl -n awx get pod <YOUR POD NAME> -o json | jq '.spec.containers[].image'
"quay.io/ansible/awx-ee:latest"
# Find created date for the image for your EE
$ sudo $(which k3s) crictl inspecti <YOUR IMAGE FROM ABOVE COMMAND> | jq '.info.imageSpec.created'
"2023-09-14T00:18:41.724658319Z"
# Find ANSIBLE_RUNNER_KEEPALIVE_SECONDS in your EE
$ kubectl -n awx exec -it <YOUR POD NAME> -- env | grep ANSIBLE_RUNNER_KEEPALIVE_SECONDS
ANSIBLE_RUNNER_KEEPALIVE_SECONDS=1800
# Find Receptor version in your EE
$ kubectl -n awx exec -it <YOUR POD NAME> -- receptor --version
1.4.1+g3a84c22
# Find Ansible Runner version your EE
$ kubectl -n awx exec -it <YOUR POD NAME> -- ansible-runner --version
2.3.4 |
Hi thanks for your answer. There is my output :
So if I understand, the variable
The variable |
And I also did an interesting test, trying to launch my sleep job with the public awx-ee (AWX-EE-latest quay.io/ansible/awx-ee:latest) and this is the output :
So we can see even no matter if I launch my sleep job with the public awx-ee or my custom awx-ee the variable |
@Mrmel94
The EE image for your control plane is quite old. This old Receptor can't handle So, if your K3s host can pull the image from the internet, try deleting existing (cached) old image and force to pull truly latest one. # Delete cached image
$ sudo $(which k3s) crictl rmi quay.io/ansible/awx-ee:lates
# Shut task pod down and start it again to pull latest image.
# First, find `awx-task` pod,
$ kubectl -n awx get pod
NAME READY STATUS RESTARTS AGE
...
awx-task-5bcdb6cb46-rxdtv 4/4 Running 0 23h
...
# Then delete it.
$ kubectl -n awx delete pod awx-task-5bcdb6cb46-rxdtv
pod "awx-task-5bcdb6cb46-rxdtv" deleted
# Just wait for the new pod to be created and be in Running state, and the Operator finishes reconcile.
$ kubectl -n awx get pod
NAME READY STATUS RESTARTS AGE
...
awx-task-5bcdb6cb46-z9r45 4/4 Running 0 103s
...
$ kubectl -n awx logs -f deployments/awx-operator-controller-manager
...
PLAY RECAP *********************************************************************
localhost : ok=84 changed=0 unreachable=0 failed=0 skipped=78 rescued=0 ignored=1 After this, check the receptor and ansible runner version in your control plane again. If they are updated, try your job again. $ kubectl -n awx exec -it deployment/awx-task -c awx-ee -- receptor --version
$ kubectl -n awx exec -it deployment/awx-task -c awx-ee -- ansible-runner --version |
Ok, I just did the update of the image and looks better now :
But my job pod still doesn't have the variable :
|
It's okay, RECEPTOR_KUBE_SUPPORT_RECONNECT is for the receptor in the control plane only. It's not passed to the automation job pod. Time to test long sleep :) |
Oh ok I understand now ! I will test it and let you know ! |
It's working !!! Thank you very much @kurokobo & @fosterseth ! Last thing, how can I be sure my EE of my control plane still up to date ? Because my pod awx-task was deleted and recreated 8 days ago (during my AWX UI upgrade I think), before I manually delete the image awx-ee:latest with the command you give me above @kurokobo, but still not up to date. In my AWX UI, in Execution Environment tab, I just put "Always" for extract the image (before nothing was selected). Is this the right way to make my control plane update its image when the awx-task pod restarts ? Thanks again |
@Mrmel94 You have some choices to achieve your goal.
|
Thanks a lot for your help and your time @kurokobo I'll go with the first option. |
Please confirm the following
[email protected]
instead.)Bug Summary
Hi I have a problem already known by this community. The bug of a job that stop after 4 hours of execution with an error.
The principal issue is known there : #11594
The bug is fix with this PR : ansible/receptor#683 and I read the pre-requisites but still doesn't work and problem still there.
I made my installation with this https://github.com/kurokobo/awx-on-k3s and I'm using custom awx-ee
My job for testing is just a sleep for 18000sec (5h), and I enabled the
K8S Ansible Runner Keep-Alive Message Interval
in awx parameters for 1800sec (30min).Version of my components :
K3S server v1.25.12+k3s1
And this is my custom-ee :
I also tried k3s v1.25.4 and add the environment variable
RECEPTOR_KUBE_SUPPORT_RECONNECT
in the file : awx-on-k3s/base/awx.yaml :and did
kubectl apply -k base
in the folder awx-on-k3s/ but bug still thereWhat am i doing wrong? Have I forgotten something?
AWX version
23.0.0
Select the relevant components
Installation method
kubernetes
Modifications
yes
Ansible version
ansible-core==2.15.3
Operating system
quay.io/rockylinux/rockylinux:9
Web browser
No response
Steps to reproduce
Launch job test with 5 hours sleep
Expected results
Job terminated with errors
Actual results
Job finish in success
Additional information
No response
The text was updated successfully, but these errors were encountered: