Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jobs are killed after 4 hours #14870

Closed
6 of 11 tasks
benapetr opened this issue Feb 13, 2024 · 9 comments
Closed
6 of 11 tasks

Jobs are killed after 4 hours #14870

benapetr opened this issue Feb 13, 2024 · 9 comments

Comments

@benapetr
Copy link

Please confirm the following

  • I agree to follow this project's code of conduct.
  • I have checked the current issues for duplicates.
  • I understand that AWX is open source software provided for free and that I might not receive a timely response.
  • I am NOT reporting a (potential) security vulnerability. (These should be emailed to [email protected] instead.)

Bug Summary

It seems that this bug is back - #11805

Every job that runs longer than 4 hours gets killed by AWX exactly when 4 hours "limit" is reached

AWX version

23.7.0

Select the relevant components

  • UI
  • UI (tech preview)
  • API
  • Docs
  • Collection
  • CLI
  • Other

Installation method

kubernetes

Modifications

no

Ansible version

No response

Operating system

Oracle Linux

Web browser

No response

Steps to reproduce

Start a job that runs longer than 4 hours, it gets killed

Expected results

Jobs don't get killed

Actual results

Job gets killed

Additional information

No response

@benapetr
Copy link
Author

The jobs are in error status with this information:

Failed to JSON parse a line from worker stream. Error: Expecting value: line 1 column 1 (char 0) Line with invalid JSON data: b''

@TheRealHaoLiu
Copy link
Member

this is due to kube apiserver connection time limit and can be fixed by setting

ee_extra_env: |
  - name: RECEPTOR_KUBE_SUPPORT_RECONNECT
    value: enabled

please refer to ansible/receptor#683 for further detail

@benapetr
Copy link
Author

ok, but why did it start happening only recently? older versions of AWX didn't have this problem? I will try to add it to kustomize manifests that install AWX, but I am surprised why is linked receptor issue merged and marked resolved, yet it still affects AWX?

@fosterseth
Copy link
Member

@benapetr the feature landed but users still need to manually enable the flag on the awx spec file to apply the fix. Eventually we will be able to default with this flag enabled, once all users/customers are on the prerequisite k8s version

ok, but why did it start happening only recently?

is it possible that before your jobs did not run for 4 hours?

@Commifreak
Copy link

Commifreak commented Feb 20, 2024

I also observed this "new" behavior after latest one or two updates of AWX/Operator. And the job did also run more than 4 hrs before:
grafik

(that was with):

ansible-playbook [core 2.15.9]
  python version = 3.9.18 (main, Jan  4 2024, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)] (/usr/bin/python3)
  jinja version = 3.1.3
  libyaml = True

and kube 1.28.3

But dont ask me which awx version that was. Some recent.

Strange. I will also try the ee_extra_env.

@benapetr
Copy link
Author

Yes, exactly these jobs ran always for over 10 hours, no problems, unfortunatelly we only run them like once a month or two. Now suddenly they started having problems. We did OS update and AWX updates meanwhile, so I can't track down /when/ it started happening, but I know for sure it worked in the past and now it doesn't by default.

The fix mentioned by @TheRealHaoLiu definitely fixes it though.

@fosterseth
Copy link
Member

@Commifreak enabling RECEPTOR_KUBE_SUPPORT_RECONNECT is certainly recommended, let us know if this helps your long running jobs

@Commifreak
Copy link

Guess what? Without setting the env var, its working again!

grafik

What changed in the meantime? => I updated (regular update) to AWX 23.8.1. I dont know if that helped. But I guess its not bad to set this env var anyway.

@TheRealHaoLiu
Copy link
Member

confused Hao is confused.... we recently flip the default behavior for reconnect to true since we bumped the required kube version

also there were couple bugs we fixed that was caused by some receptor refactoring

closing this issue...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants