Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Topology aware scheduler plugin in kube-scheduler #2044

Closed
swatisehgal opened this issue Oct 1, 2020 · 28 comments
Closed

Topology aware scheduler plugin in kube-scheduler #2044

swatisehgal opened this issue Oct 1, 2020 · 28 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. tracked/no Denotes an enhancement issue is NOT actively being tracked by the Release Team

Comments

@swatisehgal
Copy link
Contributor

swatisehgal commented Oct 1, 2020

Enhancement Description

  • One-line enhancement description (can be used as a release note):
    Scheduler plugin that runs a simplfied version of topology manager logic in kube-scheduler to enable topology aware pod placement.
  • Kubernetes Enhancement Proposal: Simplified version of topology manager in kube-scheduler #1858
  • Discussion Link:
    SIG Scheduling Meeting 20200702: Recording, Slides
    SIG Node Meeting 20200811: Recording, Slides
  • Primary contacts (assignee):
    Alexey Perevalov (@AlexeyPerevalov)
  • Responsible SIGs: sig-node and sig-scheduling
  • Enhancement target (which target equals to which milestone):
    • Alpha release target (1.21)
    • Beta release target (x.y)
    • Stable release target (x.y)
@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Oct 1, 2020
@swatisehgal
Copy link
Contributor Author

/sig node

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Oct 1, 2020
@swatisehgal
Copy link
Contributor Author

/sig scheduling

@k8s-ci-robot k8s-ci-robot added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Oct 1, 2020
@kikisdeliveryservice
Copy link
Member

@swatisehgal each KEP needs a separate issue.

@swatisehgal swatisehgal changed the title Topology Aware Scheduling in kubernetes Topology aware scheduler plugin in kube-scheduler Oct 2, 2020
@swatisehgal
Copy link
Contributor Author

@swatisehgal each KEP needs a separate issue.

@kikisdeliveryservice Updated this issue and created another issue #2051

@kikisdeliveryservice
Copy link
Member

Great! so the underlying PR from an enhancements reqs perspective is good and just needs 1 change to the dir structure.

@kikisdeliveryservice kikisdeliveryservice added the tracked/yes Denotes an enhancement issue is actively being tracked by the Release Team label Oct 2, 2020
@kikisdeliveryservice kikisdeliveryservice added this to the v1.20 milestone Oct 2, 2020
@kikisdeliveryservice
Copy link
Member

Marking this as 1.20 for now, since that's what the underlying enhancement has. If that's in error, just LMK!

@swatisehgal
Copy link
Contributor Author

swatisehgal commented Oct 5, 2020

@kikisdeliveryservice As per conversation with Derek last week, we are not able to target this for 1.20. So 1.21 is correct.

@kikisdeliveryservice kikisdeliveryservice added tracked/no Denotes an enhancement issue is NOT actively being tracked by the Release Team and removed tracked/yes Denotes an enhancement issue is actively being tracked by the Release Team labels Oct 5, 2020
@kikisdeliveryservice kikisdeliveryservice removed this from the v1.20 milestone Oct 5, 2020
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 4, 2021
@swatisehgal
Copy link
Contributor Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 5, 2021
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 5, 2021
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 5, 2021
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@swatisehgal
Copy link
Contributor Author

/reopen
/remove-lifecycle rotten

@k8s-ci-robot
Copy link
Contributor

@swatisehgal: Reopened this issue.

In response to this:

/reopen
/remove-lifecycle rotten

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot reopened this Jun 10, 2021
@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jun 10, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 8, 2021
@catblade
Copy link

@swatisehgal what can we do to help on this? target was 1.21?

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 24, 2021
@catblade
Copy link

@swatisehgal ping

@swatisehgal
Copy link
Contributor Author

swatisehgal commented Oct 26, 2021

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Oct 26, 2021
@swatisehgal
Copy link
Contributor Author

swatisehgal commented Oct 26, 2021

@catblade We have been focusing on the out of tree solution enablement of Topology aware Scheduling. The two main components are:

  1. Topology aware Scheduler plugin
    Node Resource Topology is now part of https://github.com/kubernetes-sigs/scheduler-plugins repo.

  2. nfd-topology-updater in Node feature Discovery
    Introducing NFD Topology Updater exposing Resource hardware Topology info through CRs kubernetes-sigs/node-feature-discovery#525 was merged recently but there is a lot of work that needs to be done. Please refer to the comment here for information on more work that is either in progress or needs to be done in NFD.

Some of the items we can use help with are:

  1. Watcher implementation (K8s side and/or NFD topology-updater side). Here is some context on this:
  • We have enhanced the Pod Resource API with “List” and “GetAllocatableResources” endpoints in order to account for allocated resources but the API needs further improvement because both the endpoints require the monitoring application to poll the kubelet. If the monitoring application has a too slow monitoring loop, the scheduler gets likely stale information; on the other hand, if the monitoring application monitors very frequently, it adds extra load to the kubelet (and to the system in general). To overcome this limitation, we need to add a Watch endpoint that reports a stream of events to the monitoring application, both when resource allocation changes (when pods are created or deleted) or if the resource availability changes (if new device plugins are added or deleted). Here is our initial POC on this. We were looking for this to be part of Kubernetes 1.24 release. Please refer to the link here on Kubernetes release timelines

  • Alternatively, we can obtain notification events for CRI runtime. Please refer to the discussion about this here.

  • The monitoring application which in our case is NFD-topology Updater in NFD would also need modifications to be able update the CRs on every pod creation/deletion event as opposed to the current timer based approach.

  1. Topology aware scheduling testing at scale. This work ties in to the value proposition of Topology aware Scheduling and we are looking to gain insight into how this solution performs in a large scale cluster.

@catblade
Copy link

@swatisehgal I must be missing some links re 1) and first bullet on items we can help on.

@catblade
Copy link

catblade commented Nov 3, 2021

@swatisehgal re-ping for links above.

@swatisehgal
Copy link
Contributor Author

@catblade Updated the comment above with the relevant links. Please refer to some additional links below:

Issue: #2043
Initial Enhancement proposal: #1884

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 6, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 8, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. tracked/no Denotes an enhancement issue is NOT actively being tracked by the Release Team
Projects
None yet
Development

No branches or pull requests

6 participants