Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pod Scheduling Readiness #3521

Closed
12 tasks done
Huang-Wei opened this issue Sep 16, 2022 · 102 comments
Closed
12 tasks done

Pod Scheduling Readiness #3521

Huang-Wei opened this issue Sep 16, 2022 · 102 comments
Assignees
Labels
sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. stage/stable Denotes an issue tracking an enhancement targeted for Stable/GA status
Milestone

Comments

@Huang-Wei
Copy link
Member

Huang-Wei commented Sep 16, 2022

Enhancement Description

Alpha

  1. approved cncf-cla: yes kind/kep lgtm sig/scheduling size/XXL tide/merge-method-squash
    ahg-g wojtek-t
  2. api-review approved area/code-generation area/e2e-test-framework area/test cncf-cla: yes kind/api-change kind/feature lgtm needs-priority release-note sig/apps sig/scheduling sig/testing size/XL triage/accepted
    ahg-g smarterclayton
  3. api-review approved area/code-generation area/e2e-test-framework area/stable-metrics area/test cncf-cla: yes kind/api-change kind/feature lgtm needs-priority release-note sig/api-machinery sig/apps sig/instrumentation sig/scheduling sig/testing size/XL triage/accepted
    ahg-g liggitt
    logicalhan
  4. approved area/code-generation area/e2e-test-framework area/stable-metrics area/test cncf-cla: yes kind/api-change kind/feature lgtm needs-priority release-note-none sig/api-machinery sig/apps sig/instrumentation sig/scheduling sig/testing size/L triage/accepted
    ahg-g aojea
    logicalhan
  5. approved cncf-cla: yes kind/bug lgtm needs-priority needs-triage release-note-none sig/scheduling size/L
    ahg-g
  6. approved cncf-cla: yes language/en lgtm sig/docs sig/scheduling size/L
    krol3

Beta

  1. approved cncf-cla: yes kind/kep lgtm sig/scheduling size/L
    ahg-g wojtek-t
  2. 10 of 10
    lifecycle/stale needs-triage sig/scheduling
    lianghao208
  3. approved cncf-cla: yes language/en lgtm sig/docs size/XS
    tengqm

Stable

  1. approved cncf-cla: yes kind/kep lgtm sig/scheduling size/M
    ahg-g wojtek-t
  2. approved area/code-generation area/test cncf-cla: yes kind/api-change kind/feature lgtm needs-priority needs-triage release-note sig/api-machinery sig/apps sig/node sig/scheduling sig/testing size/XL triage/accepted
    ahg-g
  3. approved cncf-cla: yes language/en lgtm sig/docs size/S
    Shubham82

Please keep this description up to date. This will help the Enhancement Team to track the evolution of the enhancement efficiently.

@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Sep 16, 2022
@Huang-Wei
Copy link
Member Author

/sig scheduling

@k8s-ci-robot k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Sep 16, 2022
@ahg-g
Copy link
Member

ahg-g commented Sep 19, 2022

/label lead-opted-in

@k8s-ci-robot k8s-ci-robot added the lead-opted-in Denotes that an issue has been opted in to a release label Sep 19, 2022
@ahg-g
Copy link
Member

ahg-g commented Sep 19, 2022

/milestone v1.26

@k8s-ci-robot k8s-ci-robot added this to the v1.26 milestone Sep 19, 2022
@rhockenbury rhockenbury added tracked/yes Denotes an enhancement issue is actively being tracked by the Release Team stage/alpha Denotes an issue tracking an enhancement targeted for Alpha status labels Sep 20, 2022
@Atharva-Shinde
Copy link
Contributor

Hey @kerthcet 👋, 1.26 Enhancements team here!

Just checking in as we approach Enhancements Freeze on 18:00 PDT on Thursday 6th October 2022.

This enhancement is targeting for stage alpha for 1.26 (correct me, if otherwise)

Here's where this enhancement currently stands:

  • KEP file using the latest template has been merged into the k/enhancements repo.
  • KEP status is marked as implementable
  • KEP has an updated detailed test plan section filled out
  • KEP has up to date graduation criteria
  • KEP has a production readiness review that has been completed and merged into k/enhancements.

For this KEP, we would need to:

  • Change the status of kep.yaml to implementable (I've also added a comment review)
  • Include the new updated PR of this KEP in the Issue Description and get it merged before Enhancements Freeze to make this enhancement eligible for 1.26 release.

The status of this enhancement is marked as at risk. Please keep the issue description up-to-date with appropriate stages as well.
Thank you :)

@Huang-Wei
Copy link
Member Author

Thanks @Atharva-Shinde.

@Atharva-Shinde
Copy link
Contributor

Hello @Huang-Wei 👋, just a quick check-in again, as we approach the 1.26 Enhancements freeze.

Please plan to get the PR #3522 merged before Enhancements freeze on 18:00 PDT on Thursday 6th October 2022 i.e tomorrow

For note, the current status of the enhancement is marked at-risk :)

@Huang-Wei
Copy link
Member Author

Thanks for the reminder. It's 99% accomplished atm, just some final comments waiting for the approver to +1.

@Huang-Wei Huang-Wei changed the title Pod Schedule Readiness Pod Scheduling Readiness Oct 6, 2022
@rhockenbury
Copy link

With #3522 merged, we have this marked as tracked for v1.26.

@alculquicondor
Copy link
Member

this label can be modified

Yes, it can, but you might be breaking a few first-party and third-party controllers that assume that this label matches the nodeName or at least that it is unique. The label is documented as well-known, so it should be treated with care https://kubernetes.io/docs/reference/labels-annotations-taints/#kubernetesiohostname

@drewhagen drewhagen moved this from Tracked for Code Freeze to Tracked for Doc Freeze in 1.30 Enhancements Tracking Apr 1, 2024
@fabiand
Copy link

fabiand commented Apr 5, 2024

A good example of how .spec.nodeName means "scheduling done" is by setting nodeName and an arbitrary resource, the resource will be ignored, and the pod fails with


status:
  phase: Failed

  message: 'Pod was rejected: Node didn''t have enough resource: cpu, requested: 400000000, used: 400038007, capacity: 159500'
  reason: OutOfcpu

  containerStatuses:
    - name: nginx
      state:
        terminated:
          exitCode: 137
          reason: ContainerStatusUnknown
          message: The container could not be located when the pod was terminated

      image: 'nginx:1.14.2'
      started: false

To some extent .spec.nodeName could be considered to be a scheduling gate as well - a special one: Only if it is set, then we can (are) schedule(d).
However, due to it's special meaning in the coordination with kubelet, we can not change it's seamtics.

I do not know the current state, but I do wonder - if we are not already doing this today - a Pod with scheduling gates and spec.nodeName should be rejected at admission time.

@vladikr
Copy link

vladikr commented Apr 5, 2024

A good example of how .spec.nodeName means "scheduling done" is by setting nodeName and an arbitrary resource, the resource will be ignored, and the pod fails with


status:
  phase: Failed

  message: 'Pod was rejected: Node didn''t have enough resource: cpu, requested: 400000000, used: 400038007, capacity: 159500'
  reason: OutOfcpu

  containerStatuses:
    - name: nginx
      state:
        terminated:
          exitCode: 137
          reason: ContainerStatusUnknown
          message: The container could not be located when the pod was terminated

      image: 'nginx:1.14.2'
      started: false

To some extent .spec.nodeName could be considered to be a scheduling gate as well - a special one: Only if it is set, then we can (are) schedule(d). However, due to it's special meaning in the coordination with kubelet, we can not change it's seamtics.

I do not know the current state, but I do wonder - if we are not already doing this today - a Pod with scheduling gates and spec.nodeName should be rejected at admission time.

@fabiand Yes, it will be rejected at the admission.
However, the Application-Aware Quota adds the scheduling gate to pods at the admission time as well.
Therefore, pods created with .spec.nodeName will get rejected in the namespaces where AAQ operates.
One example is with kubectl debug node which explicitly sets .spec.nodeName in the pod spec.

@fabiand
Copy link

fabiand commented Apr 5, 2024

I share that it's a general problem, but due to the special handling of .spec.nodeName I do not see how we can resolve the problem in

  1. a backwards compatile manner
  2. keeping the user spec

I do fear that - in your example - kubectl debug or oc debug should change and use affinity instead.

The core problem is that kubelet starts to react once nodeName is set.

Was it considered to change kubelet to only start acting once nodeName is set and schedulingGates is empty?

@alculquicondor
Copy link
Member

Was it considered to change kubelet to only start acting once nodeName is set and schedulingGates is empty?

According to the version skew policy, the change would have to be in the kubelet for 3 versions before we can relax the validation in apiserver.

I guess that could be backwards compatible if we start in 1.31 and we allow scheduling_gates + nodeName in 1.34.
@liggitt wdyt?

One example is with kubectl debug node which explicitly sets .spec.nodeName in the pod spec.

IMO, that falls under the special case where it might make sense to skip scheduler or an external quota system. You probably wouldn't even want to set requests in a debug pod.

@fabiand
Copy link

fabiand commented Apr 5, 2024

probably wouldn't even want to set requests in a debug pod.

FWIW - I do wonder if debug pods should actually be guaranteed. I had a couple of cases where debug pods (as best effort) got killed quickly on busy nodes.

@liggitt
Copy link
Member

liggitt commented Apr 5, 2024

I guess that could be backwards compatible if we start in 1.31 and we allow scheduling_gates + nodeName in 1.34.
@liggitt wdyt?

that seems like making the problem and confusion around use of spec.nodeName worse to me... I don't see a compelling reason to do that

@iholder101
Copy link
Contributor

@alculquicondor

IMO, that falls under the special case where it might make sense to skip scheduler or an external quota system.

TBH, I still try to understand how skipping the scheduler is ever helpful (when you're not using a custom scheduler).
Can you please elaborate on how skipping the scheduler helps in this case? In other words, if we would change kubectl debug to use node affinity instead of nodeName, what would be the implications?

You probably wouldn't even want to set requests in a debug pod.

While this might be correct, the question to me is who makes the decision. Granting a user a knob to skip quota mechanisms feels to me like granting a linux user to bypass permission checks when writing to a file. In both cases the whole idea is to restrict the users and enforce them to comply to a certain policy. Handing the user the possibility to bypass such mechanisms seems entirely contradicting to me and de-facto it makes external quota mechanisms unpractical.

@liggitt

that seems like making the problem and confusion around use of spec.nodeName worse to me... I don't see a compelling reason to do that

Are you open to discussion on that?
IMHO this behavior is consistent with scheduling gates' intent to (as the KEP states) force creating pods in a "not-ready to schedule" state. I understand that technically the pod is not scheduled at all with nodeName set (although I think we all agree it's a surprising behavior that's kept for backward compatibility), however having kubelet to wait for the scheduling gates to be removed before running the pod with a some kind of "not ready" condition sounds very intuitive and consistent to me.

This way we can avoid breaking backward compatibility, support external quota mechanisms and extend scheduling-gates in a consistent matter, which IMHO makes the exceptional nodeName case to be less exceptional.

@liggitt
Copy link
Member

liggitt commented Apr 8, 2024

having kubelet to wait for the scheduling gates to be removed before running the pod with a some kind of "not ready" condition sounds very intuitive and consistent to me.

Not to me... expecting pods which are already assigned to a node to run through scheduling gate removal phases (which couldn't do anything to affect the selected node) seems more confusing than the current behavior which simply forbids that combination. I don't think we should relax validation there and increase confusion.

@fabiand
Copy link

fabiand commented Apr 8, 2024

I (sadly) concur - If the nodeName is set, then it's a matter of fact that the scheduling has happened. This is the contract we can not break.

@fabiand
Copy link

fabiand commented Apr 10, 2024

Fun fact: I created a deployment with pod template that had nodeName defined AND a resource (cpu: 400k) which can not be met. The Deployment had an RC of 2. But due to this scheduling contradiction it ended up with 12k failed pods.

IOW: I wonder if this spec.nodeName is troublesome for other controllers in kube as well.

@Barakmor1
Copy link

Barakmor1 commented Apr 11, 2024

I share that it's a general problem, but due to the special handling of .spec.nodeName I do not see how we can resolve the problem in

  1. a backwards compatile manner
  2. keeping the user spec

I do fear that - in your example - kubectl debug or oc debug should change and use affinity instead.

The reason kubectl debug or oc debug creates pods using .spec.nodeName instead of affinity is likely because it allows the debug pod to bypass taint constraints.

Although it might be achievable with tolerations as well.

@sreeram-venkitesh
Copy link
Member

sreeram-venkitesh commented May 14, 2024

Hey folks 👋, 1.31 Enhancements Lead here.

Since this KEP graduated to GA in the v1.30 release, please update the status field in the kep.yaml file here to implemented. Once everything related to the KEP is completed and the status has been updated, we can close this issue.

/remove-label lead-opted-in

@k8s-ci-robot k8s-ci-robot removed the lead-opted-in Denotes that an issue has been opted in to a release label May 14, 2024
@Huang-Wei
Copy link
Member Author

@sreeram-venkitesh the status field of 3521-pod-scheduling-readiness/kep.yaml has been updated to implementable. I think we can close this issue as completed now?

@sreeram-venkitesh
Copy link
Member

@Huang-Wei Sorry, my mistake! It should be implemented since it has graduated to GA.

@Huang-Wei
Copy link
Member Author

@sreeram-venkitesh feel free to close this issue.

@sreeram-venkitesh
Copy link
Member

@Huang-Wei Sorry for the confusion, we still need to update the status to field to implemented in the kep.yaml file.

@Huang-Wei
Copy link
Member Author

Sure, #4686 is gonna update it.

@sreeram-venkitesh
Copy link
Member

Ah I missed it, thanks for linking it! 🙌🏼

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. stage/stable Denotes an issue tracking an enhancement targeted for Stable/GA status
Projects
Status: Net New
Status: Tracked
Status: Tracked for Doc Freeze
Archived in project
Development

No branches or pull requests