Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't re-patch pods that are already controlled by current worker #26778

Merged
merged 1 commit into from
Oct 18, 2022

Conversation

hterik
Copy link
Contributor

@hterik hterik commented Sep 29, 2022

After the scheduler has launched many pods, it keeps trying to re-adopt them by patching every pod. Each patch-operation involves a remote API-call which can be be very slow. In the meantime the scheduler can not do anything else.

By ignoring the pods that already have the expected label, the list query-result will be shorter and the number of patch-queries much less.

We had an unlucky moment in our environment, where each patch-operation started taking 100ms each, with 200 pods in flight it accumulates into 20 seconds of blocked scheduler.

After the scheduler has launched many pods, it keeps trying to
re-adopt them by patching every pod. Each patch-operation
involves a remote API-call which can be be very slow.
In the meantime the scheduler can not do anything else.

By ignoring the pods that already have the expected label,
the list query-result will be shorter and the number of
patch-queries much less.

We had an unlucky moment in our environment, where each
patch-operation started taking 100ms each, with 200 pods in
flight it accumulates into 20 seconds of blocked scheduler.
@boring-cyborg boring-cyborg bot added provider:cncf-kubernetes Kubernetes provider related issues area:Scheduler including HA (high availability) scheduler labels Sep 29, 2022
@eladkal eladkal added this to the Airflow 2.4.2 milestone Oct 16, 2022
@ephraimbuddy ephraimbuddy merged commit 27ec562 into apache:main Oct 18, 2022
ephraimbuddy pushed a commit that referenced this pull request Oct 18, 2022
…6778)

After the scheduler has launched many pods, it keeps trying to
re-adopt them by patching every pod. Each patch-operation
involves a remote API-call which can be be very slow.
In the meantime the scheduler can not do anything else.

By ignoring the pods that already have the expected label,
the list query-result will be shorter and the number of
patch-queries much less.

We had an unlucky moment in our environment, where each
patch-operation started taking 100ms each, with 200 pods in
flight it accumulates into 20 seconds of blocked scheduler.

(cherry picked from commit 27ec562)
@ephraimbuddy ephraimbuddy added the type:bug-fix Changelog: Bug Fixes label Oct 18, 2022
ephraimbuddy pushed a commit that referenced this pull request Oct 18, 2022
…6778)

After the scheduler has launched many pods, it keeps trying to
re-adopt them by patching every pod. Each patch-operation
involves a remote API-call which can be be very slow.
In the meantime the scheduler can not do anything else.

By ignoring the pods that already have the expected label,
the list query-result will be shorter and the number of
patch-queries much less.

We had an unlucky moment in our environment, where each
patch-operation started taking 100ms each, with 200 pods in
flight it accumulates into 20 seconds of blocked scheduler.

(cherry picked from commit 27ec562)
@droppoint droppoint mentioned this pull request Nov 21, 2023
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:Scheduler including HA (high availability) scheduler provider:cncf-kubernetes Kubernetes provider related issues type:bug-fix Changelog: Bug Fixes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants