You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I understand that AWX is open source software provided for free and that I might not receive a timely response.
Bug Summary
Hi,
we are currently facing some issues when running a high amount (160-180) of jobs executed in parallel.
We are using container groups for starting the jobs.
The AWX version we use here is 21.10.2.
We implemented the fix for RECEPTOR_KUBE_SUPPORT_RECONNECT on K8S 1.23.14 to not run into problems with failing jobs.
Now we are receiving rate limiting errors from time to time which lead to a an Error state of the job:
ERROR 2023/02/08 09:46:15 [p7FzHay4] Error reading from pod awx/automation-job-8243-zbln6: http2: server sent GOAWAY and closed the connection; LastStreamID=5867, ErrCode=NO_ERROR, debug=""
DEBUG 2023/02/08 13:27:09 [9yVZ0vL2] Detected EOF for pod awx/automation-job-9786-9k2zw. Will retry 5 more times. Error: EOF
WARNING 2023/02/08 13:27:10 [9yVZ0vL2] Error opening log stream for pod awx/automation-job-9786-9k2zw. Will retry 5 more times.
Error: Get "https://<K8S>:10250/containerLogs/awx/automation-job-9786-9k2zw/worker?follow=true&sinceTime=2023-02-08T13%3A27%3A09Z×tamps=true": dial tcp <K8S>:9443: connect: connection refused
The questions for me would be now:
Is there a way to influence the log-gathering behavior of the control pods (requests per second/minute to the K8S API) to lower the rate of pulling logs?
Is there another way of pulling the logs besides the K8S API?
Can we deactivate the pulling of the logs completely and just fetch some final state from the POD? (in case we log directly to Elastic search from the worker via a plugin)
AWX version
21.10.2
Select the relevant components
UI
API
Docs
Collection
CLI
Other
Installation method
kubernetes
Modifications
no
Ansible version
No response
Operating system
No response
Web browser
No response
Steps to reproduce
Start massive amount of jobs in parallel which produce output and running for 5-30 minutes.
Expected results
Logs can be fetched without hitting any rate limit.
Actual results
Rate limit kicks in and Jobs getting marked with error state.
Additional information
No response
The text was updated successfully, but these errors were encountered:
Lets imagine we a amount of 180 jobs in AWX at the same time.
If in this point in time a network hiccup, log file rotation or something else happens what would disconnect the receptor process from the K8S API, we would have a reconnect of all receptor processes at the same time according to the code in the receptor from this PR: ansible/receptor#683
Means connections which were build up over time due to jobs which were started every 5 minutes for example would not produce those errors, but maybe the reconnect of all at once is the root cause.
So as a possible solution we should randomize the amount of seconds the receptor has to wait with the reconnect to the K8S API. So lets just say every client process has to pick a random amount of seconds from 1 to 10 to distribute the reconnect attempts across this time frame.
To make it even more flexible we could introduce a second environment variable next to RECEPTOR_KUBE_SUPPORT_RECONNECT, which hold an optional value to specify the maximum wait time for the client processes.
Maybe @TheRealHaoLiu could have a look here and check if i am pointing in the right direction.
Please confirm the following
Bug Summary
Hi,
we are currently facing some issues when running a high amount (160-180) of jobs executed in parallel.
We are using container groups for starting the jobs.
The AWX version we use here is 21.10.2.
We implemented the fix for RECEPTOR_KUBE_SUPPORT_RECONNECT on K8S 1.23.14 to not run into problems with failing jobs.
Now we are receiving rate limiting errors from time to time which lead to a an Error state of the job:
The questions for me would be now:
AWX version
21.10.2
Select the relevant components
Installation method
kubernetes
Modifications
no
Ansible version
No response
Operating system
No response
Web browser
No response
Steps to reproduce
Start massive amount of jobs in parallel which produce output and running for 5-30 minutes.
Expected results
Logs can be fetched without hitting any rate limit.
Actual results
Rate limit kicks in and Jobs getting marked with error state.
Additional information
No response
The text was updated successfully, but these errors were encountered: