Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add kep-162 Colocated Placement #168

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
268 changes: 268 additions & 0 deletions keps/162-colocated-placement/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,268 @@
# KEP-162: Colocated placement support

<!--
A table of contents is helpful for quickly jumping to sections of a KEP and for
highlighting any additional information provided beyond the standard KEP
template.

Ensure the TOC is wrapped with
<code>&lt;!-- toc --&rt;&lt;!-- /toc --&rt;</code>
tags, and then generate with `hack/update-toc.sh`.
-->

<!-- toc -->
- [Summary](#summary)
- [Motivation](#motivation)
- [Goals](#goals)
- [Non-Goals](#non-goals)
- [Proposal](#proposal)
- [User Stories (Optional)](#user-stories-optional)
- [Story 1](#story-1)
- [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional)
- [Risks and Mitigations](#risks-and-mitigations)
- [Design Details](#design-details)
- [LeaderWorkerSet API](#leaderworkerset-api)
- [SubGroup Policy Support](#subgroup-policy-support)
- [Test Plan](#test-plan)
- [Unit tests](#unit-tests)
- [Integration tests](#integration-tests)
- [e2e tests](#e2e-tests)
- [Graduation Criteria](#graduation-criteria)
- [Implementation History](#implementation-history)
- [Alternatives](#alternatives)
<!-- /toc -->

## Summary

<!--
This section is incredibly important for producing high-quality, user-focused
documentation such as release notes or a development roadmap. It should be
possible to collect this information before implementation begins, in order to
avoid requiring implementors to split their attention between writing release
notes and implementing the feature itself. KEP editors and SIG Docs
should help to ensure that the tone and content of the `Summary` section is
useful for a wide audience.

A good summary is probably at least a paragraph in length.

Both in this section and below, follow the guidelines of the [documentation
style guide]. In particular, wrap lines to a reasonable length, to make it
easier for reviewers to cite specific portions, and to minimize diff churn on
updates.

[documentation style guide]: https://github.com/kubernetes/community/blob/master/contributors/guide/style-guide.md
-->

This KEP aims to add spec.leaderWorkerTemplate.topologyPlacementPolicy to support exclusive and colocated placement of multiple pod groups.

## Motivation

Fine-grained topology scheduling for leaders and workers is really necessary.
In some cases, being in the same domain means lower communication costs, so that leader and its related workers are supposed to be in the same topology domain, especially for workers that distribute across multi-hosts to do inference distributively. Currently, enabling [Exclusive placement](https://github.com/kubernetes-sigs/lws/blob/main/docs/examples/sample/README.md#exclusive-placement) makes a pod group schedule to a single topology, but one topology can only have one pod group. Since this restriction isn't really necessary in all cases, we should offer finer-grained scheduling ability.

This KEP is to have LeaderWorkerSet to support colocated topology placement of one pod group.

### Goals

<!--
List the specific goals of the KEP. What is it trying to achieve? How will we
know that this has succeeded?
-->

- Add a new way of topology scheduling and allow multiple pod groups land on the same domain
Copy link
Collaborator

@liurupeng liurupeng Jul 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for one pod group (one multi-host inference replica), we need to colocate the pods, but for different pod groups, why we need to put them in the same domain since different replicas won't communicate with each other?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might confuse you with my expression, I mean allowing all leaders to schedule freely across any topology, not restricting one pod group to a topology just like exclusive placement.

- Include exclusive placement in the new field and maintain backward compatibility

### Non-Goals

<!--
What is out of scope for this KEP? Listing non-goals helps to focus discussion
and make progress.
-->

## Proposal

<!--
This is where we get down to the specifics of what the proposal actually is.
This should have enough detail that reviewers can understand exactly what
you're proposing, but should not include things like API designs or
implementation. What is the desired outcome and how do we measure success?.
The "Design Details" section below is for the real
nitty-gritty.
-->

### User Stories (Optional)

#### Story 1
Each pod group (leader and workers) is colocated in one domain(i.e., more than one group could land on the same domain)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you think of a real case, this one is somehow a more theoretical one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on my understanding, assuming a nodepool is a topology domain, each nodepool contains many nodes with the same GPU(easy to scale). So, it makes sense to schedule more than one group to a nodepool naturally.
And it's reasonable to schedule a group of pods to a single nodepool since they have the same GPU, ensuring equal inference speeds.


### Notes/Constraints/Caveats (Optional)

<!--
What are the caveats to the proposal?
What are some important details that didn't come across above?
Go in to as much detail as necessary here.
This might be a good place to talk about core concepts and how they relate.
-->

### Risks and Mitigations

For now, exclusive placement is enabled by setting annotation. We want to unify the exclusive and colocated topology scheduling into one field. We can maintain backward compatibility, but we have to validate that the topology key matches in annotations and fields.

## Design Details

<!--
This section should contain enough information that the specifics of your
change are understandable. This may include API specs (though not always
required) or even code snippets. If there's any ambiguity about HOW your
proposal will be implemented, this is the place to discuss them.
-->

Similar to implementation of exclusive placement, we only need to remove podAntiAffinity part in the pod webhook. Exclusive and colocated placement are contradictory. If we add this feature to annotations too, when we enable them both in the annotations, the meaning becomes seriously ambiguous. So we think they should be unified to one field and it is necessary to change API specs.

### LeaderWorkerSet API

```go
type TopologyPlacementPolicyType string

const (
ExclusiveTopologyPlacementPolicyType TopologyPlacementPolicyType = "Exclusive"
ColocatedTopologyPlacementPolicyType TopologyPlacementPolicyType = "Colocated"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will we implement this?

If we implement this using pod-affinity (i.e., same as exclusive placement, but removing the anti-affinity term), we will run into race conditions: consider the case where two groups try to schedule on the same domain at the same time but there is capacity only for one, there is a possibility that both will partially schedule on the domain, and none will schedule fully.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That‘s a tricky case I didn't consider before. It seems like there‘s already an issue when scheduling multiple pod groups with limited resources. Refer to #167
Co-scheduling plugin conflicts with LeaderReady startup policy.
In this situation, I think it has to rely on the autoscaler or adding more resources manully.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that a deadlock with the plugin only happens when the leaderready startup policy is adopted. We can do gang scheduling with the pod group declared in the plugin. This can fix the problem you mentioned.
It's worth noting that colocated placement policy and leaderready startup policy can't be set together, and we have to validate in the webhook.
But I'm not sure whether relying on an out-of-tree scheduler plugin is appropriate.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is more related to scheduling rather than workload itself? This is an exist problem for all distributed ones. Unless we have a queueing system ourself just like WaitForPodsReady in kueue.

NoneTopologyPlacementPolicyType TopologyPlacementPolicyType = "None"
)

type LeaderWorkerTemplate struct {
// +optional
TopologyPlacementPolicy TopologyPlacementPolicy `json:"topologyPlacementPolicy",omitempty`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this a slice in case we'll append more than one policies to the lws, and the policies should be ANDed?

}

type TopologyPlacementPolicy struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe GroupPlacementPolicy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

// +kubebuilder:default=None
// +kubebuilder:validation=Enum={ExclusiveTopologyPlacementPolicyType,ColocatedTopologyPlacementPolicyType,NoneTopologyPlacementPolicyType}
type TopologyPlacementPolicyType `json:"type"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add the constraint mode like prefer or required? But this is not that vital, we can add later.

topologyKey *string `json:"topologyKey",omitempty`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why a pointer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the policy is none, topologyKey should be nil, otherwise, it should be explicitly declared. So I think pointer is more suitable.

}
```

### SubGroup Policy Support
Colocated placement can support subgroup policy as well.
Compared with [exclusive support for subgroup policy](https://github.com/kubernetes-sigs/lws/assets/86417275/ff9fc93d-c738-4c09-abc8-50a7b16d49df), the workflow is almost identical, as follows:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the subgroup feature will only have one leader pod for all the pods in the same topology

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the subgroup policy, it seems like it just needs to ensure that the leader and workers are in the same topology, so I think colocated placement should meet its requirements.

![](./workflow.jpg)

### Test Plan

<!--
**Note:** *Not required until targeted at a release.*
The goal is to ensure that we don't accept enhancements with inadequate testing.

All code is expected to have adequate tests (eventually with coverage
expectations). Please adhere to the [Kubernetes testing guidelines][testing-guidelines]
when drafting this test plan.

[testing-guidelines]: https://git.k8s.io/community/contributors/devel/sig-testing/testing.md
-->

[X] I/we understand the owners of the involved components may require updates to
existing tests to make this code solid enough prior to committing the changes necessary
to implement this enhancement.

##### Unit tests

<!--
In principle every added code should have complete unit test coverage, so providing
the exact set of tests will not bring additional value.
However, if complete unit test coverage is not possible, explain the reason of it
together with explanation why this is acceptable.
-->

<!--
Additionally, for Alpha try to enumerate the core package you will be touching
to implement this enhancement and provide the current unit coverage for those
in the form of:
- <package>: <date> - <current test coverage>
The data can be easily read from:
https://testgrid.k8s.io/sig-testing-canaries#ci-kubernetes-coverage-unit

This can inform certain test coverage improvements that we want to do before
extending the production code to implement this enhancement.
-->

Unit tests will cover all introduced functions and complement test cases for modified existing functions.

##### Integration tests

<!--
Integration tests are contained in k8s.io/kubernetes/test/integration.
Integration tests allow control of the configuration parameters used to start the binaries under test.
This is different from e2e tests which do not allow configuration of parameters.
Doing this allows testing non-default options and multiple different and potentially conflicting command line options.
-->

<!--
This question should be filled when targeting a release.
For Alpha, describe what tests will be added to ensure proper quality of the enhancement.

For Beta and GA, add links to added tests together with links to k8s-triage for those tests:
https://storage.googleapis.com/k8s-triage/index.html
-->

- integration test of pod webhook should cover
- pod affinity/antiaffinity is set correctly when topologyPlacementPolicyType is exclusive
- integration test of pod controller should cover
- pod nodeselector is injected properly when topologyPlacementPolicyType is colocated
- wait until pod is scheduled when topologyPlacementPolicyType is exclusive or colocated

##### e2e tests

<!--
This question should be filled when targeting a release.
For Alpha, describe what tests will be added to ensure proper quality of the enhancement.

For Beta and GA, add links to added tests together with links to k8s-triage for those tests:
https://storage.googleapis.com/k8s-triage/index.html

We expect no non-infra related flakes in the last month as a GA graduation criteria.
-->

When setting topologyPlacementPolicyType lws deployment will have correct pod placement(including exclusive, none, and colocated). It should also work well with other features enabled, like subgroup policy, failure handling and rolling update.

### Graduation Criteria

<!--

Clearly define what it means for the feature to be implemented and
considered stable.

If the feature you are introducing has high complexity, consider adding graduation
milestones with these graduation criteria:
- [Maturity levels (`alpha`, `beta`, `stable`)][maturity-levels]
- [Feature gate][feature gate] lifecycle
- [Deprecation policy][deprecation-policy]

[feature gate]: https://git.k8s.io/community/contributors/devel/sig-architecture/feature-gates.md
[maturity-levels]: https://git.k8s.io/community/contributors/devel/sig-architecture/api_changes.md#alpha-beta-and-stable-versions
[deprecation-policy]: https://kubernetes.io/docs/reference/using-api/deprecation-policy/
-->

## Implementation History

<!--
Major milestones in the lifecycle of a KEP should be tracked in this section.
Major milestones might include:
- the `Summary` and `Motivation` sections being merged, signaling SIG acceptance
- the `Proposal` section being merged, signaling agreement on a proposed design
- the date implementation started
- the first Kubernetes release where an initial version of the KEP was available
- the version of Kubernetes where the KEP graduated to general availability
- when the KEP was retired or superseded
-->


## Alternatives

<!--
What other approaches did you consider, and why did you rule them out? These do
not need to be as detailed as the proposal, but should include enough
information to express the idea and why it was not acceptable.
-->

We can add this feature to annotations as well by introducing new annotation "leaderworkerset.sigs.k8s.io/colocated-topology". But this will conflict with exclusive topology when enbaling them in the meanwhile. So adding new field to API spec is recommended.
28 changes: 28 additions & 0 deletions keps/162-colocated-placement/kep.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
title: Colocation Placement Support
kep-number: 162
authors:
- "vie-serendipity"
status: provisional
creation-date: 2024-07-01
reviewers:
- "@kerthcet"
- "@liurupeng"
- "@ahg-g"
approvers:
- "@kerthcet"
- "@liurupeng"
- "@ahg-g"

# The target maturity stage in the current dev cycle for this KEP.
stage: alpha

# The most recent milestone for which work toward delivery of this KEP has been
# done. This can be the current (upcoming) milestone, if it is being actively
# worked on.
latest-milestone: "v0.3.0"

# The milestone at which this feature was, or is targeted to be, at each stage.
milestone:
alpha: "v0.5.0"
beta: "v0.5.0"
stable: "v0.5.0"
Binary file added keps/162-colocated-placement/workflow.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.