Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

K8S rate limit hits when count of parallel jobs is high #13550

Closed
4 of 9 tasks
Cl0udius opened this issue Feb 10, 2023 · 1 comment
Closed
4 of 9 tasks

K8S rate limit hits when count of parallel jobs is high #13550

Cl0udius opened this issue Feb 10, 2023 · 1 comment

Comments

@Cl0udius
Copy link

Please confirm the following

  • I agree to follow this project's code of conduct.
  • I have checked the current issues for duplicates.
  • I understand that AWX is open source software provided for free and that I might not receive a timely response.

Bug Summary

Hi,

we are currently facing some issues when running a high amount (160-180) of jobs executed in parallel.
We are using container groups for starting the jobs.
The AWX version we use here is 21.10.2.
We implemented the fix for RECEPTOR_KUBE_SUPPORT_RECONNECT on K8S 1.23.14 to not run into problems with failing jobs.

Now we are receiving rate limiting errors from time to time which lead to a an Error state of the job:

ERROR 2023/02/08 09:46:15 [p7FzHay4] Error reading from pod awx/automation-job-8243-zbln6: http2: server sent GOAWAY and closed the connection; LastStreamID=5867, ErrCode=NO_ERROR, debug=""

DEBUG 2023/02/08 13:27:09 [9yVZ0vL2] Detected EOF for pod awx/automation-job-9786-9k2zw. Will retry 5 more times. Error: EOF 

WARNING 2023/02/08 13:27:10 [9yVZ0vL2] Error opening log stream for pod awx/automation-job-9786-9k2zw. Will retry 5 more times. 

Error: Get "https://<K8S>:10250/containerLogs/awx/automation-job-9786-9k2zw/worker?follow=true&sinceTime=2023-02-08T13%3A27%3A09Z&timestamps=true": dial tcp <K8S>:9443: connect: connection refused

The questions for me would be now:

  • Is there a way to influence the log-gathering behavior of the control pods (requests per second/minute to the K8S API) to lower the rate of pulling logs?
  • Is there another way of pulling the logs besides the K8S API?
  • Can we deactivate the pulling of the logs completely and just fetch some final state from the POD? (in case we log directly to Elastic search from the worker via a plugin)

AWX version

21.10.2

Select the relevant components

  • UI
  • API
  • Docs
  • Collection
  • CLI
  • Other

Installation method

kubernetes

Modifications

no

Ansible version

No response

Operating system

No response

Web browser

No response

Steps to reproduce

Start massive amount of jobs in parallel which produce output and running for 5-30 minutes.

Expected results

Logs can be fetched without hitting any rate limit.

Actual results

Rate limit kicks in and Jobs getting marked with error state.

Additional information

No response

@Cl0udius
Copy link
Author

Hi. This is just a guess from my side:

Lets imagine we a amount of 180 jobs in AWX at the same time.
If in this point in time a network hiccup, log file rotation or something else happens what would disconnect the receptor process from the K8S API, we would have a reconnect of all receptor processes at the same time according to the code in the receptor from this PR: ansible/receptor#683

Means connections which were build up over time due to jobs which were started every 5 minutes for example would not produce those errors, but maybe the reconnect of all at once is the root cause.

So as a possible solution we should randomize the amount of seconds the receptor has to wait with the reconnect to the K8S API. So lets just say every client process has to pick a random amount of seconds from 1 to 10 to distribute the reconnect attempts across this time frame.

This could theoretically be done for example here and in places where we also use sleep: https://github.com/ansible/receptor/blob/9473ee061c2708e230d230662d07d049f58b04b6/pkg/workceptor/kubernetes.go#L609

To make it even more flexible we could introduce a second environment variable next to RECEPTOR_KUBE_SUPPORT_RECONNECT, which hold an optional value to specify the maximum wait time for the client processes.

Maybe @TheRealHaoLiu could have a look here and check if i am pointing in the right direction.

BR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant