-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Jobs are killed after 4 hours #14870
Comments
The jobs are in error status with this information:
|
this is due to kube apiserver connection time limit and can be fixed by setting
please refer to ansible/receptor#683 for further detail |
ok, but why did it start happening only recently? older versions of AWX didn't have this problem? I will try to add it to kustomize manifests that install AWX, but I am surprised why is linked receptor issue merged and marked resolved, yet it still affects AWX? |
@benapetr the feature landed but users still need to manually enable the flag on the awx spec file to apply the fix. Eventually we will be able to default with this flag enabled, once all users/customers are on the prerequisite k8s version
is it possible that before your jobs did not run for 4 hours? |
Yes, exactly these jobs ran always for over 10 hours, no problems, unfortunatelly we only run them like once a month or two. Now suddenly they started having problems. We did OS update and AWX updates meanwhile, so I can't track down /when/ it started happening, but I know for sure it worked in the past and now it doesn't by default. The fix mentioned by @TheRealHaoLiu definitely fixes it though. |
@Commifreak enabling RECEPTOR_KUBE_SUPPORT_RECONNECT is certainly recommended, let us know if this helps your long running jobs |
confused Hao is confused.... we recently flip the default behavior for reconnect to true since we bumped the required kube version also there were couple bugs we fixed that was caused by some receptor refactoring closing this issue... |
Please confirm the following
[email protected]
instead.)Bug Summary
It seems that this bug is back - #11805
Every job that runs longer than 4 hours gets killed by AWX exactly when 4 hours "limit" is reached
AWX version
23.7.0
Select the relevant components
Installation method
kubernetes
Modifications
no
Ansible version
No response
Operating system
Oracle Linux
Web browser
No response
Steps to reproduce
Start a job that runs longer than 4 hours, it gets killed
Expected results
Jobs don't get killed
Actual results
Job gets killed
Additional information
No response
The text was updated successfully, but these errors were encountered: