Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sleep 650 job throwing error within 5 minutes of its launch #13161

Open
4 of 9 tasks
deep7861 opened this issue Nov 6, 2022 · 7 comments
Open
4 of 9 tasks

Sleep 650 job throwing error within 5 minutes of its launch #13161

deep7861 opened this issue Nov 6, 2022 · 7 comments

Comments

@deep7861
Copy link

deep7861 commented Nov 6, 2022

Please confirm the following

  • I agree to follow this project's code of conduct.
  • I have checked the current issues for duplicates.
  • I understand that AWX is open source software provided for free and that I might not receive a timely response.

Bug Summary

With default job timeout of 0 and job idle timeout of 0, putting a task with 'sleep 650' is making the job fail with no much explanation.
Idle timeout 0 should mean 600 seconds timeout, but this job is failing within 5 or so minutes. Here is the stdout:

PLAY [Play to test sleep function] *********************************************

TASK [date] ********************************************************************
changed: [localhost]

TASK [debug] *******************************************************************
ok: [localhost] => {
"date_before.stdout": "Sun Nov 6 17:50:11 UTC 2022"
}

TASK [sleep] *******************************************************************

image

API showing this error reason:

image

AWX version

21.2.0

Select the relevant components

  • UI
  • API
  • Docs
  • Collection
  • CLI
  • Other

Installation method

kubernetes

Modifications

no

Ansible version

2.12

Operating system

No response

Web browser

Chrome

Steps to reproduce

Simple playbook with single task of 'command: sleep 650'.
Launch and see it erroring out within 300-350 seconds. Error reason in API: Job terminated due to error.
Stdout getting stuck at sleep task, meaning, not showing the playbook summary results.

Expected results

Task and playbook should run smooth and fail with idle timeout error, showing all logs meaningfully

Actual results

Playbook erroring out with no logs

Additional information

No response

@shanemcd
Copy link
Member

shanemcd commented Nov 6, 2022

Hello - where is your Kubernetes running? Azure by chance? See #12530 (comment)

In addition to the suggestions in that comment, we're also currently working on a PR to Receptor that might help this ansible/receptor#683

@TheRealHaoLiu
Copy link
Member

#12530

this might be related

for context #12530 (comment)

@deep7861
Copy link
Author

@TheRealHaoLiu @shanemcd Thank you!! I'm running it on AKS, yes! The one you mentioned matches with issue pattern.
I don't see any workaround though to deal with it in settings anywhere on AKS or AWX-- one way I understood is to break the pause or sleep to less than 5 minutes. We had someone trying 15 mins before and now broken it into 4 minutes interval.
The Pause issue is sort of bypassed.

But another strange issue is that the job would fail randomly after executing for 1.5 hours or so, without any proper error log and stdout stops working in similar fashion.

@fosterseth fosterseth self-assigned this Nov 23, 2022
@fosterseth
Copy link
Member

| job would fail randomly after executing for 1.5 hours or so

this sounds like log rotation issue -- this could be mitigated by increasing the max container log size in your k8s configuration, see this comment on how I did it on minikube and you can follow a similar solution

#12644 (comment)

this also should be addressed with ansible/receptor#683

@deep7861
Copy link
Author

Hi @fosterseth Thank you for suggestion. However, I did more investigation and found the log rotation actually wasn't the roadblock for AWX job - but it's still the 5 minutes timeout somewhere (per earlier comments, Konnectivity module in AKS - which I'm still trying to find workaround for).
I used multiple methods to replicate the timeout issue and to see whether log rotation also was an issue:

  1. Simple pause/sleep for more than 5 minutes and job would fail.
  2. A playbook task which runs for more than 5 minutes without an output on screen: this could be simple shell infinite while loop command:
    command: 'while true; do ls; sleep 1; done'
  3. Simulate a job that fills the log file (I have 50MB configured):
    using loop in ansible task to keep printing endlessly (range with real high number)
    Watch for container log file to be rotated and exact at this moment, the job continues BUT - kubectl logs -f automation_pod stops showing further logs.
    Though AWX doesn't kill the task or stop showing logs at this moment, the only thing happens is the timeout of 5 minute starts here.
    Look at the job killed after 5 minutes.

For the logs, I have created a new nodepool with larger container log size, but what do we do with this 5 minutes timeout? It's getting really frustrating for users.
Additionally, seems like the timeouts started recently (sometime from October) - the same job would run for more than 15 minutes earlier with no output being printed on screen.
We didn't change anything from AKS or AWX versioning during the timeframe.

@kurokobo
Copy link
Contributor

@deep7861
The fundamental solution is to wait for ansible/receptor#683 to be merged and be available on awx-ee.
As a dirty hack for a short term workaround, but modifying entrypoint to force to echo logs periodicaly from EE (see #12530 (comment)) seems to work.

@luckass1
Copy link

luckass1 commented Mar 6, 2023

I have the latest version of AWX 21.12.0 running in OKE and I have the same problem. What is the procedure to implement the solution proposed in ansible/ansible-runner#1187 ??

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants