K8S rate limit hits when count of parallel jobs is high #13550

Cl0udius · 2023-02-10T10:08:39Z

Please confirm the following

I agree to follow this project's code of conduct.
I have checked the current issues for duplicates.
I understand that AWX is open source software provided for free and that I might not receive a timely response.

Bug Summary

Hi,

we are currently facing some issues when running a high amount (160-180) of jobs executed in parallel.
We are using container groups for starting the jobs.
The AWX version we use here is 21.10.2.
We implemented the fix for RECEPTOR_KUBE_SUPPORT_RECONNECT on K8S 1.23.14 to not run into problems with failing jobs.

Now we are receiving rate limiting errors from time to time which lead to a an Error state of the job:

ERROR 2023/02/08 09:46:15 [p7FzHay4] Error reading from pod awx/automation-job-8243-zbln6: http2: server sent GOAWAY and closed the connection; LastStreamID=5867, ErrCode=NO_ERROR, debug=""

DEBUG 2023/02/08 13:27:09 [9yVZ0vL2] Detected EOF for pod awx/automation-job-9786-9k2zw. Will retry 5 more times. Error: EOF 

WARNING 2023/02/08 13:27:10 [9yVZ0vL2] Error opening log stream for pod awx/automation-job-9786-9k2zw. Will retry 5 more times. 

Error: Get "https://<K8S>:10250/containerLogs/awx/automation-job-9786-9k2zw/worker?follow=true&sinceTime=2023-02-08T13%3A27%3A09Z&timestamps=true": dial tcp <K8S>:9443: connect: connection refused

The questions for me would be now:

Is there a way to influence the log-gathering behavior of the control pods (requests per second/minute to the K8S API) to lower the rate of pulling logs?
Is there another way of pulling the logs besides the K8S API?
Can we deactivate the pulling of the logs completely and just fetch some final state from the POD? (in case we log directly to Elastic search from the worker via a plugin)

AWX version

21.10.2

Select the relevant components

Installation method

kubernetes

Modifications

no

Ansible version

No response

Operating system

No response

Web browser

No response

Steps to reproduce

Start massive amount of jobs in parallel which produce output and running for 5-30 minutes.

Expected results

Logs can be fetched without hitting any rate limit.

Actual results

Rate limit kicks in and Jobs getting marked with error state.

Additional information

No response

The text was updated successfully, but these errors were encountered:

Cl0udius · 2023-02-13T07:38:43Z

Hi. This is just a guess from my side:

Lets imagine we a amount of 180 jobs in AWX at the same time.
If in this point in time a network hiccup, log file rotation or something else happens what would disconnect the receptor process from the K8S API, we would have a reconnect of all receptor processes at the same time according to the code in the receptor from this PR: ansible/receptor#683

Means connections which were build up over time due to jobs which were started every 5 minutes for example would not produce those errors, but maybe the reconnect of all at once is the root cause.

So as a possible solution we should randomize the amount of seconds the receptor has to wait with the reconnect to the K8S API. So lets just say every client process has to pick a random amount of seconds from 1 to 10 to distribute the reconnect attempts across this time frame.

This could theoretically be done for example here and in places where we also use sleep: https://github.com/ansible/receptor/blob/9473ee061c2708e230d230662d07d049f58b04b6/pkg/workceptor/kubernetes.go#L609

To make it even more flexible we could introduce a second environment variable next to RECEPTOR_KUBE_SUPPORT_RECONNECT, which hold an optional value to specify the maximum wait time for the client processes.

Maybe @TheRealHaoLiu could have a look here and check if i am pointing in the right direction.

BR

github-actions bot added needs_triage type:bug community labels Feb 10, 2023

Cl0udius closed this as completed Oct 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

K8S rate limit hits when count of parallel jobs is high #13550

K8S rate limit hits when count of parallel jobs is high #13550

Cl0udius commented Feb 10, 2023

Cl0udius commented Feb 13, 2023

K8S rate limit hits when count of parallel jobs is high #13550

K8S rate limit hits when count of parallel jobs is high #13550

Comments

Cl0udius commented Feb 10, 2023

Please confirm the following

Bug Summary

AWX version

Select the relevant components

Installation method

Modifications

Ansible version

Operating system

Web browser

Steps to reproduce

Expected results

Actual results

Additional information

Cl0udius commented Feb 13, 2023