-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add kep-162 Colocated Placement #168
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,268 @@ | ||
# KEP-162: Colocated placement support | ||
|
||
<!-- | ||
A table of contents is helpful for quickly jumping to sections of a KEP and for | ||
highlighting any additional information provided beyond the standard KEP | ||
template. | ||
|
||
Ensure the TOC is wrapped with | ||
<code><!-- toc --&rt;<!-- /toc --&rt;</code> | ||
tags, and then generate with `hack/update-toc.sh`. | ||
--> | ||
|
||
<!-- toc --> | ||
- [Summary](#summary) | ||
- [Motivation](#motivation) | ||
- [Goals](#goals) | ||
- [Non-Goals](#non-goals) | ||
- [Proposal](#proposal) | ||
- [User Stories (Optional)](#user-stories-optional) | ||
- [Story 1](#story-1) | ||
- [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) | ||
- [Risks and Mitigations](#risks-and-mitigations) | ||
- [Design Details](#design-details) | ||
- [LeaderWorkerSet API](#leaderworkerset-api) | ||
- [SubGroup Policy Support](#subgroup-policy-support) | ||
- [Test Plan](#test-plan) | ||
- [Unit tests](#unit-tests) | ||
- [Integration tests](#integration-tests) | ||
- [e2e tests](#e2e-tests) | ||
- [Graduation Criteria](#graduation-criteria) | ||
- [Implementation History](#implementation-history) | ||
- [Alternatives](#alternatives) | ||
<!-- /toc --> | ||
|
||
## Summary | ||
|
||
<!-- | ||
This section is incredibly important for producing high-quality, user-focused | ||
documentation such as release notes or a development roadmap. It should be | ||
possible to collect this information before implementation begins, in order to | ||
avoid requiring implementors to split their attention between writing release | ||
notes and implementing the feature itself. KEP editors and SIG Docs | ||
should help to ensure that the tone and content of the `Summary` section is | ||
useful for a wide audience. | ||
|
||
A good summary is probably at least a paragraph in length. | ||
|
||
Both in this section and below, follow the guidelines of the [documentation | ||
style guide]. In particular, wrap lines to a reasonable length, to make it | ||
easier for reviewers to cite specific portions, and to minimize diff churn on | ||
updates. | ||
|
||
[documentation style guide]: https://github.com/kubernetes/community/blob/master/contributors/guide/style-guide.md | ||
--> | ||
|
||
This KEP aims to add spec.leaderWorkerTemplate.topologyPlacementPolicy to support exclusive and colocated placement of multiple pod groups. | ||
|
||
## Motivation | ||
|
||
Fine-grained topology scheduling for leaders and workers is really necessary. | ||
In some cases, being in the same domain means lower communication costs, so that leader and its related workers are supposed to be in the same topology domain, especially for workers that distribute across multi-hosts to do inference distributively. Currently, enabling [Exclusive placement](https://github.com/kubernetes-sigs/lws/blob/main/docs/examples/sample/README.md#exclusive-placement) makes a pod group schedule to a single topology, but one topology can only have one pod group. Since this restriction isn't really necessary in all cases, we should offer finer-grained scheduling ability. | ||
|
||
This KEP is to have LeaderWorkerSet to support colocated topology placement of one pod group. | ||
|
||
### Goals | ||
|
||
<!-- | ||
List the specific goals of the KEP. What is it trying to achieve? How will we | ||
know that this has succeeded? | ||
--> | ||
|
||
- Add a new way of topology scheduling and allow multiple pod groups land on the same domain | ||
- Include exclusive placement in the new field and maintain backward compatibility | ||
|
||
### Non-Goals | ||
|
||
<!-- | ||
What is out of scope for this KEP? Listing non-goals helps to focus discussion | ||
and make progress. | ||
--> | ||
|
||
## Proposal | ||
|
||
<!-- | ||
This is where we get down to the specifics of what the proposal actually is. | ||
This should have enough detail that reviewers can understand exactly what | ||
you're proposing, but should not include things like API designs or | ||
implementation. What is the desired outcome and how do we measure success?. | ||
The "Design Details" section below is for the real | ||
nitty-gritty. | ||
--> | ||
|
||
### User Stories (Optional) | ||
|
||
#### Story 1 | ||
Each pod group (leader and workers) is colocated in one domain(i.e., more than one group could land on the same domain) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you think of a real case, this one is somehow a more theoretical one. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Based on my understanding, assuming a nodepool is a topology domain, each nodepool contains many nodes with the same GPU(easy to scale). So, it makes sense to schedule more than one group to a nodepool naturally. |
||
|
||
### Notes/Constraints/Caveats (Optional) | ||
|
||
<!-- | ||
What are the caveats to the proposal? | ||
What are some important details that didn't come across above? | ||
Go in to as much detail as necessary here. | ||
This might be a good place to talk about core concepts and how they relate. | ||
--> | ||
|
||
### Risks and Mitigations | ||
|
||
For now, exclusive placement is enabled by setting annotation. We want to unify the exclusive and colocated topology scheduling into one field. We can maintain backward compatibility, but we have to validate that the topology key matches in annotations and fields. | ||
|
||
## Design Details | ||
|
||
<!-- | ||
This section should contain enough information that the specifics of your | ||
change are understandable. This may include API specs (though not always | ||
required) or even code snippets. If there's any ambiguity about HOW your | ||
proposal will be implemented, this is the place to discuss them. | ||
--> | ||
|
||
Similar to implementation of exclusive placement, we only need to remove podAntiAffinity part in the pod webhook. Exclusive and colocated placement are contradictory. If we add this feature to annotations too, when we enable them both in the annotations, the meaning becomes seriously ambiguous. So we think they should be unified to one field and it is necessary to change API specs. | ||
|
||
### LeaderWorkerSet API | ||
|
||
```go | ||
type TopologyPlacementPolicyType string | ||
|
||
const ( | ||
ExclusiveTopologyPlacementPolicyType TopologyPlacementPolicyType = "Exclusive" | ||
ColocatedTopologyPlacementPolicyType TopologyPlacementPolicyType = "Colocated" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How will we implement this? If we implement this using pod-affinity (i.e., same as exclusive placement, but removing the anti-affinity term), we will run into race conditions: consider the case where two groups try to schedule on the same domain at the same time but there is capacity only for one, there is a possibility that both will partially schedule on the domain, and none will schedule fully. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That‘s a tricky case I didn't consider before. It seems like there‘s already an issue when scheduling multiple pod groups with limited resources. Refer to #167 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I understand that a deadlock with the plugin only happens when the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I guess this is more related to scheduling rather than workload itself? This is an exist problem for all distributed ones. Unless we have a queueing system ourself just like |
||
NoneTopologyPlacementPolicyType TopologyPlacementPolicyType = "None" | ||
) | ||
|
||
type LeaderWorkerTemplate struct { | ||
// +optional | ||
TopologyPlacementPolicy TopologyPlacementPolicy `json:"topologyPlacementPolicy",omitempty` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should this a slice in case we'll append more than one policies to the lws, and the policies should be ANDed? |
||
} | ||
|
||
type TopologyPlacementPolicy struct { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe GroupPlacementPolicy? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sounds good. |
||
// +kubebuilder:default=None | ||
// +kubebuilder:validation=Enum={ExclusiveTopologyPlacementPolicyType,ColocatedTopologyPlacementPolicyType,NoneTopologyPlacementPolicyType} | ||
type TopologyPlacementPolicyType `json:"type"` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we add the constraint mode like prefer or required? But this is not that vital, we can add later. |
||
topologyKey *string `json:"topologyKey",omitempty` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why a pointer? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If the policy is none, topologyKey should be nil, otherwise, it should be explicitly declared. So I think pointer is more suitable. |
||
} | ||
``` | ||
|
||
### SubGroup Policy Support | ||
Colocated placement can support subgroup policy as well. | ||
Compared with [exclusive support for subgroup policy](https://github.com/kubernetes-sigs/lws/assets/86417275/ff9fc93d-c738-4c09-abc8-50a7b16d49df), the workflow is almost identical, as follows: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the subgroup feature will only have one leader pod for all the pods in the same topology There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For the subgroup policy, it seems like it just needs to ensure that the leader and workers are in the same topology, so I think colocated placement should meet its requirements. |
||
![](./workflow.jpg) | ||
|
||
### Test Plan | ||
|
||
<!-- | ||
**Note:** *Not required until targeted at a release.* | ||
The goal is to ensure that we don't accept enhancements with inadequate testing. | ||
|
||
All code is expected to have adequate tests (eventually with coverage | ||
expectations). Please adhere to the [Kubernetes testing guidelines][testing-guidelines] | ||
when drafting this test plan. | ||
|
||
[testing-guidelines]: https://git.k8s.io/community/contributors/devel/sig-testing/testing.md | ||
--> | ||
|
||
[X] I/we understand the owners of the involved components may require updates to | ||
existing tests to make this code solid enough prior to committing the changes necessary | ||
to implement this enhancement. | ||
|
||
##### Unit tests | ||
|
||
<!-- | ||
In principle every added code should have complete unit test coverage, so providing | ||
the exact set of tests will not bring additional value. | ||
However, if complete unit test coverage is not possible, explain the reason of it | ||
together with explanation why this is acceptable. | ||
--> | ||
|
||
<!-- | ||
Additionally, for Alpha try to enumerate the core package you will be touching | ||
to implement this enhancement and provide the current unit coverage for those | ||
in the form of: | ||
- <package>: <date> - <current test coverage> | ||
The data can be easily read from: | ||
https://testgrid.k8s.io/sig-testing-canaries#ci-kubernetes-coverage-unit | ||
|
||
This can inform certain test coverage improvements that we want to do before | ||
extending the production code to implement this enhancement. | ||
--> | ||
|
||
Unit tests will cover all introduced functions and complement test cases for modified existing functions. | ||
|
||
##### Integration tests | ||
|
||
<!-- | ||
Integration tests are contained in k8s.io/kubernetes/test/integration. | ||
Integration tests allow control of the configuration parameters used to start the binaries under test. | ||
This is different from e2e tests which do not allow configuration of parameters. | ||
Doing this allows testing non-default options and multiple different and potentially conflicting command line options. | ||
--> | ||
|
||
<!-- | ||
This question should be filled when targeting a release. | ||
For Alpha, describe what tests will be added to ensure proper quality of the enhancement. | ||
|
||
For Beta and GA, add links to added tests together with links to k8s-triage for those tests: | ||
https://storage.googleapis.com/k8s-triage/index.html | ||
--> | ||
|
||
- integration test of pod webhook should cover | ||
- pod affinity/antiaffinity is set correctly when topologyPlacementPolicyType is exclusive | ||
- integration test of pod controller should cover | ||
- pod nodeselector is injected properly when topologyPlacementPolicyType is colocated | ||
- wait until pod is scheduled when topologyPlacementPolicyType is exclusive or colocated | ||
|
||
##### e2e tests | ||
|
||
<!-- | ||
This question should be filled when targeting a release. | ||
For Alpha, describe what tests will be added to ensure proper quality of the enhancement. | ||
|
||
For Beta and GA, add links to added tests together with links to k8s-triage for those tests: | ||
https://storage.googleapis.com/k8s-triage/index.html | ||
|
||
We expect no non-infra related flakes in the last month as a GA graduation criteria. | ||
--> | ||
|
||
When setting topologyPlacementPolicyType lws deployment will have correct pod placement(including exclusive, none, and colocated). It should also work well with other features enabled, like subgroup policy, failure handling and rolling update. | ||
|
||
### Graduation Criteria | ||
|
||
<!-- | ||
|
||
Clearly define what it means for the feature to be implemented and | ||
considered stable. | ||
|
||
If the feature you are introducing has high complexity, consider adding graduation | ||
milestones with these graduation criteria: | ||
- [Maturity levels (`alpha`, `beta`, `stable`)][maturity-levels] | ||
- [Feature gate][feature gate] lifecycle | ||
- [Deprecation policy][deprecation-policy] | ||
|
||
[feature gate]: https://git.k8s.io/community/contributors/devel/sig-architecture/feature-gates.md | ||
[maturity-levels]: https://git.k8s.io/community/contributors/devel/sig-architecture/api_changes.md#alpha-beta-and-stable-versions | ||
[deprecation-policy]: https://kubernetes.io/docs/reference/using-api/deprecation-policy/ | ||
--> | ||
|
||
## Implementation History | ||
|
||
<!-- | ||
Major milestones in the lifecycle of a KEP should be tracked in this section. | ||
Major milestones might include: | ||
- the `Summary` and `Motivation` sections being merged, signaling SIG acceptance | ||
- the `Proposal` section being merged, signaling agreement on a proposed design | ||
- the date implementation started | ||
- the first Kubernetes release where an initial version of the KEP was available | ||
- the version of Kubernetes where the KEP graduated to general availability | ||
- when the KEP was retired or superseded | ||
--> | ||
|
||
|
||
## Alternatives | ||
|
||
<!-- | ||
What other approaches did you consider, and why did you rule them out? These do | ||
not need to be as detailed as the proposal, but should include enough | ||
information to express the idea and why it was not acceptable. | ||
--> | ||
|
||
We can add this feature to annotations as well by introducing new annotation "leaderworkerset.sigs.k8s.io/colocated-topology". But this will conflict with exclusive topology when enbaling them in the meanwhile. So adding new field to API spec is recommended. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,28 @@ | ||
title: Colocation Placement Support | ||
kep-number: 162 | ||
authors: | ||
- "vie-serendipity" | ||
status: provisional | ||
creation-date: 2024-07-01 | ||
reviewers: | ||
- "@kerthcet" | ||
- "@liurupeng" | ||
- "@ahg-g" | ||
approvers: | ||
- "@kerthcet" | ||
- "@liurupeng" | ||
- "@ahg-g" | ||
|
||
# The target maturity stage in the current dev cycle for this KEP. | ||
stage: alpha | ||
|
||
# The most recent milestone for which work toward delivery of this KEP has been | ||
# done. This can be the current (upcoming) milestone, if it is being actively | ||
# worked on. | ||
latest-milestone: "v0.3.0" | ||
|
||
# The milestone at which this feature was, or is targeted to be, at each stage. | ||
milestone: | ||
alpha: "v0.5.0" | ||
beta: "v0.5.0" | ||
stable: "v0.5.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for one pod group (one multi-host inference replica), we need to colocate the pods, but for different pod groups, why we need to put them in the same domain since different replicas won't communicate with each other?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might confuse you with my expression, I mean allowing all leaders to schedule freely across any topology, not restricting one pod group to a topology just like exclusive placement.