Sleep 650 job throwing error within 5 minutes of its launch #13161

deep7861 · 2022-11-06T18:18:15Z

Please confirm the following

I agree to follow this project's code of conduct.
I have checked the current issues for duplicates.
I understand that AWX is open source software provided for free and that I might not receive a timely response.

Bug Summary

With default job timeout of 0 and job idle timeout of 0, putting a task with 'sleep 650' is making the job fail with no much explanation.
Idle timeout 0 should mean 600 seconds timeout, but this job is failing within 5 or so minutes. Here is the stdout:

PLAY [Play to test sleep function] *********************************************

TASK [date] ********************************************************************
changed: [localhost]

TASK [debug] *******************************************************************
ok: [localhost] => {
"date_before.stdout": "Sun Nov 6 17:50:11 UTC 2022"
}

TASK [sleep] *******************************************************************

API showing this error reason:

AWX version

21.2.0

Select the relevant components

Installation method

kubernetes

Modifications

no

Ansible version

2.12

Operating system

No response

Web browser

Chrome

Steps to reproduce

Simple playbook with single task of 'command: sleep 650'.
Launch and see it erroring out within 300-350 seconds. Error reason in API: Job terminated due to error.
Stdout getting stuck at sleep task, meaning, not showing the playbook summary results.

Expected results

Task and playbook should run smooth and fail with idle timeout error, showing all logs meaningfully

Actual results

Playbook erroring out with no logs

Additional information

No response

shanemcd · 2022-11-06T19:07:29Z

Hello - where is your Kubernetes running? Azure by chance? See #12530 (comment)

In addition to the suggestions in that comment, we're also currently working on a PR to Receptor that might help this ansible/receptor#683

TheRealHaoLiu · 2022-11-07T16:45:07Z

#12530

this might be related

for context #12530 (comment)

deep7861 · 2022-11-10T06:56:12Z

@TheRealHaoLiu @shanemcd Thank you!! I'm running it on AKS, yes! The one you mentioned matches with issue pattern.
I don't see any workaround though to deal with it in settings anywhere on AKS or AWX-- one way I understood is to break the pause or sleep to less than 5 minutes. We had someone trying 15 mins before and now broken it into 4 minutes interval.
The Pause issue is sort of bypassed.

But another strange issue is that the job would fail randomly after executing for 1.5 hours or so, without any proper error log and stdout stops working in similar fashion.

fosterseth · 2022-11-23T17:19:09Z

| job would fail randomly after executing for 1.5 hours or so

this sounds like log rotation issue -- this could be mitigated by increasing the max container log size in your k8s configuration, see this comment on how I did it on minikube and you can follow a similar solution

#12644 (comment)

this also should be addressed with ansible/receptor#683

deep7861 · 2022-11-24T04:37:44Z

Hi @fosterseth Thank you for suggestion. However, I did more investigation and found the log rotation actually wasn't the roadblock for AWX job - but it's still the 5 minutes timeout somewhere (per earlier comments, Konnectivity module in AKS - which I'm still trying to find workaround for).
I used multiple methods to replicate the timeout issue and to see whether log rotation also was an issue:

Simple pause/sleep for more than 5 minutes and job would fail.
A playbook task which runs for more than 5 minutes without an output on screen: this could be simple shell infinite while loop command:
command: 'while true; do ls; sleep 1; done'
Simulate a job that fills the log file (I have 50MB configured):
using loop in ansible task to keep printing endlessly (range with real high number)
Watch for container log file to be rotated and exact at this moment, the job continues BUT - kubectl logs -f automation_pod stops showing further logs.
Though AWX doesn't kill the task or stop showing logs at this moment, the only thing happens is the timeout of 5 minute starts here.
Look at the job killed after 5 minutes.

For the logs, I have created a new nodepool with larger container log size, but what do we do with this 5 minutes timeout? It's getting really frustrating for users.
Additionally, seems like the timeouts started recently (sometime from October) - the same job would run for more than 15 minutes earlier with no output being printed on screen.
We didn't change anything from AKS or AWX versioning during the timeframe.

kurokobo · 2022-11-24T20:28:07Z

@deep7861
The fundamental solution is to wait for ansible/receptor#683 to be merged and be available on awx-ee.
As a dirty hack for a short term workaround, but modifying entrypoint to force to echo logs periodicaly from EE (see #12530 (comment)) seems to work.

luckass1 · 2023-03-06T16:30:06Z

I have the latest version of AWX 21.12.0 running in OKE and I have the same problem. What is the procedure to implement the solution proposed in ansible/ansible-runner#1187 ??

github-actions bot added component:ui needs_triage type:bug community labels Nov 6, 2022

fosterseth self-assigned this Nov 23, 2022

fosterseth removed the needs_triage label Nov 23, 2022

yuliym mentioned this issue Feb 14, 2023

Job failed with just error and without log output #13469

Open

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sleep 650 job throwing error within 5 minutes of its launch #13161

Sleep 650 job throwing error within 5 minutes of its launch #13161

deep7861 commented Nov 6, 2022 •

edited

Loading

shanemcd commented Nov 6, 2022

TheRealHaoLiu commented Nov 7, 2022

deep7861 commented Nov 10, 2022

fosterseth commented Nov 23, 2022

deep7861 commented Nov 24, 2022

kurokobo commented Nov 24, 2022

luckass1 commented Mar 6, 2023

Sleep 650 job throwing error within 5 minutes of its launch #13161

Sleep 650 job throwing error within 5 minutes of its launch #13161

Comments

deep7861 commented Nov 6, 2022 • edited Loading

Please confirm the following

Bug Summary

AWX version

Select the relevant components

Installation method

Modifications

Ansible version

Operating system

Web browser

Steps to reproduce

Expected results

Actual results

Additional information

shanemcd commented Nov 6, 2022

TheRealHaoLiu commented Nov 7, 2022

deep7861 commented Nov 10, 2022

fosterseth commented Nov 23, 2022

deep7861 commented Nov 24, 2022

kurokobo commented Nov 24, 2022

luckass1 commented Mar 6, 2023

deep7861 commented Nov 6, 2022 •

edited

Loading