diff --git a/keps/2724-topology-aware-schedling/README.md b/keps/2724-topology-aware-schedling/README.md new file mode 100644 index 0000000000..08483e4c2b --- /dev/null +++ b/keps/2724-topology-aware-schedling/README.md @@ -0,0 +1,569 @@ +# KEP-NNNN: Your short, descriptive title + + +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [User Stories](#user-stories) + - [Story 1](#story-1) + - [Story 2](#story-2) + - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) + - [Integration support](#integration-support) + - [JobSet](#jobset) + - [Support for the "auto" mode](#support-for-the-auto-mode) + - [Risks and Mitigations](#risks-and-mitigations) + - [Non-exclusive use of nodes](#non-exclusive-use-of-nodes) + - [Node topology changes](#node-topology-changes) + - [Race condition when accounting for DaemonSet pods](#race-condition-when-accounting-for-daemonset-pods) +- [Design Details](#design-details) + - [Hierarchy representation](#hierarchy-representation) + - [Admin-facing API](#admin-facing-api) + - [User-facing API](#user-facing-api) + - [Validation](#validation) + - [Internal APIs](#internal-apis) + - [Computing the assignment](#computing-the-assignment) + - [Enforcing the assignment](#enforcing-the-assignment) + - [Support for ReplicatedJobs in JobSet](#support-for-replicatedjobs-in-jobset) + - [Test Plan](#test-plan) + - [Prerequisite testing updates](#prerequisite-testing-updates) + - [Unit Tests](#unit-tests) + - [Integration tests](#integration-tests) + - [Graduation Criteria](#graduation-criteria) + - [Alpha (MVP):](#alpha-mvp) + - [Beta](#beta) + - [Stable](#stable) +- [Implementation History](#implementation-history) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) + - [Account usage by watching DaemonSet pods](#account-usage-by-watching-daemonset-pods) + - [Use label for workload](#use-label-for-workload) + + +## Summary + +This KEP introduces a mechanism, called Topology Aware Scheduling (TAS), to +facilitate Job scheduling in Kueue by leveraging the information about the +hierarchical organization of a datacenter. + +First, we start by observing that data centers have organizational units (like +racks and blocks). VMs running within the same organizational unit have better +network bandwidth than VMs on different units. Second, the units form +a hierarchical structure - there are multiple nodes within a rack, and there are +multiple racks within a block. We say that nods placed in different racks are +more distant than nodes placed within the same rack. Similarly, nodes placed in +different blocks are more distant than two nodes within the same block. + +Based on these observations we propose a convention to expose the information +about the node placement in a datacenter hierarchy using node labels. + +We also propose a set of APIs for Kueue administrators and users to utilize this +information in order to optimize the network throughput between the pods. + +## Motivation + +It is common that AI / ML workloads require a significant amount of pod-to-pod +communication to make progress. Thus, it is important, for runtime and overall +cost, to make sure there is a high throughput network connection between the +running pods. + +For example, some workloads may run twice slower if there is a pod placed on +a node which belongs to a different block. However, currently the end user of +Kueue has no way to "require" that all pods of its workload run within the same +block. + +### Goals + +- allow a user to express that a workload can schedule only if all of its pods + can be placed on nodes which are close together (within the same level of + hierarchy) +- allow a user to express that a workload prefers to schedule all of its pods + within the same level of hierarchy, but automatically relax the constraint if + not possible. + +### Non-Goals + +- support MultiKueue at the management cluster +- support Cluster Autoscaler (ProvisioningRequest in particular) + +The above features might be pursued in the future in follow up KEPs. + +## Proposal + +* Introduce convention to represent the topological hierarchy as node labels +* Provide API to configure the list of node labels representing levels in the + hierarchy per Resource Flavor +* Introduce a set of Job annotations which allow to describe requirements for + TAS + +### User Stories + +#### Story 1 + +As an ML researcher I run AI training workloads which require exchanging huge +amounts of data between pods. The workloads take twice longer and are twice +more expensive when there are pods running in different datacenter racks. I +don't want to start my workload if the pods cannot be placed within the same +rack. + +#### Story 2 + +As an ML researcher I run AI training workloads which require communicating +moderate amounts of data between pods. I would like to optimize the runtime of +the workloads, by placing the pods as close to another as possible, ideally +within a single rack, but I'm ok running the workload on nodes scattered across +datacenter if placing all pods within the same hierarchy rack is not possible at +the given time. + +### Notes/Constraints/Caveats (Optional) + +#### Integration support + +In the alpha iteration we aim to support Job & JobSet. If other integrations +follow naturally we may as well support them in alpha. + +##### JobSet + +One complication we noticed for JobSet is that the proposed design assumes +injecting the dedicated scheduling gate, called `kueue.x-k8s.io/topology` into +the PodTemplate. However, currently JobSet's `spec.replicatedJob` field is +immutable, even if the JobSet is suspended. This has been relaxed in JobSet 0.6 +(see [PR](https://github.com/kubernetes-sigs/jobset/pull/623)). + +#### Support for the "auto" mode + +We are considering introducing the "auto" mode where the user does not need to +specify the level by node label value, but the lowest level is selected +automatically. + +One approach is to accept "auto" as a specific value for the "prefer" and +"require" labels. However, we are going to defer this for Beta or GA based on +the users' feedback. + +### Risks and Mitigations + +#### Non-exclusive use of nodes + +The TAS feature assumes exclusive use of the nodes, that is all pods running on +the nodes come from workloads assigned to the same TAS Resource Flavor. This can +be achieved by admins by using a distinct label corresponding to a node group +meant for TAS. + +In order to reduce the risk of such misconfiguration we require every TAS +Resource Flavor to have at least one label. + +#### Node topology changes + +A node can be removed during runtime of a workload, and the replacement for the +pod running on this node may not be able to find a replacement matching the +TopologyAssignment. In that case the workload may not be able to progress. + +First, note that the issue also exists for regular workloads, but for TAS +workloads, given the additional constraints, it might be harder to find a +replacement node. + +In order to mitigate this risk we propose to extend the waitForPodsReady +mechanism with a new timeout, called replacement timeout, which defines the +timeout for all pods to be ready again. The time is computed since the last +transition to the PodsReady=false, more details in the KEP PR. This mechanism +will also be helpful for regular workloads. + +A more involving approach would be to recompute the TopologyAdmission, however, +until now we don't modify the workload's admission while the workload is +scheduled, so it would require extra investigation and effort. We will consider +this before graduation to GA based on investigation if feasible and feedback +from users. + +#### Race condition when accounting for DaemonSet pods + +There is a risk that workloads are scheduled before the DaemonSet pods are +accounted by TAS. + +We consider this risk as not that relevant for alpha as the TAS workloads are +expected to mostly consume accelerators (like TPUs and GPUs), so they are not +expected to compete for resources with DaemonSet pods. + +One way to mitigate the risk is to implement tracking for DaemonSets +(see [Account usage by watching DaemonSet pods](#account-usage-by-watching-daemonset-pods)). +We will re-evaluate the need for Beta or GA based on the users' feedback. + +## Design Details + +### Hierarchy representation + +We propose the model for representing the hierarchy of nodes within a datacenter +by using node labels. We assume the node labels are set up by a cloud provider, +or set up manually by administrators of on-premise clusters. + +Additionally, we assume that every node used for TAS has a set of the labels +which identifies uniquely its location in the tree structure. We do not assume +global uniqueness of labels on each level, i.e. there could be two nodes with +the same "rack" label, but in different "blocks". + +For example, this is a representation of the dataset hierarchy; + +| node | cloud.provider.com/topology-block | cloud.provider.com/topology-rack | +|:------:|:----------------------------------:|:--------------------------------:| +| node-1 | block-1 | rack-1 | +| node-2 | block-1 | rack-2 | +| node-3 | block-2 | rack-1 | +| node-4 | block-2 | rack-3 | + +Note that, there is a pair of nodes, node-1 and node-3, with the same value of +the "cloud.provider.com/topology-rack" label, but in different blocks. + +### Admin-facing API + +```golang +// ResourceFlavorSpec defines the desired state of the ResourceFlavor +type ResourceFlavorSpec struct { + ... + + // TopologyName indicates the name of the topology for the ResourceFlavor. + // When specified, it enables scraping of the topology information from the + // nodes matching to the Resource Flavor node labels. + TopologyName *string +} + +// TopologySpec defines the desired state of Topology +type TopologySpec struct { + // Levels defines the levels of topology. + Levels []TopologyLevel +} + +// TopologyLevel defines the desired state of TopologyLevel +type TopologyLevel struct { + // NodeLabel indicates the name of the node label for a specific topology + // level. Examples: + // - cloud.provider.com/topology-block + // - cloud.provider.com/topology-rack + NodeLabel string +} +``` + +Example TAS Resource Flavor & Topology configuration: + +```yaml +kind: ResourceFlavor +metadata: + name: "tas-flavor" +spec: + nodeLabels: + "cloud.provider.com/node-group: tas" + topologyName: default +—k +kind: Topology +metadata: + name: "default" +spec: + levels: + - nodeLabel: cloud.provider.com/topology-block + - nodeLabel: cloud.provider.com/topology-rack +``` + +### User-facing API + +The user will need to point the workload to the ClusterQueue with the TAS +ResourceFlavor, and add one of the annotations: + +```golang +const ( + + // This annotation indicates that a workload requires Topology Aware Scheduling, + // and running all pods on nodes closely connected within the same level of + // hierarchy is a strong requirement for scheduling the workload. + RequireTopologyAnnotation = "kueue.x-k8s.io/require-topology" + + // This annotation indicates that a workload requires Topology Aware Scheduling, + // but running all pods without the same topology level is a preference rather + // than requirement. There is a distinguished value "auto" which means that + // the lowest hierarchy level should be used. + PreferTopologyAnnotation = "kueue.x-k8s.io/prefer-topology" +) +``` + +### Validation + +We introduce the following validations: +- the value of `kueue.x-k8s.io/require-topology` is one of the labels specified + in the `LevelLabels` structure +- the value of `kueue.x-k8s.io/prefer-topology` is one of the labels specified + in the `LevelLabels` structure, or "auto" +- the ResourceFlavor defining`TopologyAwareScheduling` needs to have at least + one node label + +### Internal APIs + +We extend the `Workload` structure to reflect the topology request at the +Job level. + +```golang +type WorkloadSpec struct { + ... + // TopologyRequest defines the topology requested for the corresponding Job. + TopologyRequest *TopologyRequest +} + +type TopologyRequest struct { + // Policy defines the policy used for TAS. Possible values are: + // - Prefer set when `kueue.x-k8s.io/prefer-topology` annotation is set on the Job + // - Require set when `kueue.x-k8s.io/require-topology` annotation is set on the Job + Policy TopologyRequestPolicy + + // Level indicated by the `kueue.x-k8s.io/prefer-topology` or `kueue.x-k8s.io/require-topology ` + // annotation + Level string +} +``` + +We extend the `PodSetAssignment` structure to keep track of the number of pods +at each topology level to the specific subset of nodes. + +```golang +type PodSetAssignment struct { + ... + + // TopologyAssignment indicates the resources assigned per topology level + TopologyAssignment *TopologyAssignment +} + +type TopologyAssignment struct { + // Slices contains the list of assignments split into slices + Slices []TopologyAssignmentSlice +} + +type TopologyAssignmentSlice struct { + // NodeLabels constitutes the nodeSelector for a given slice of pods. It + // defines values for all labels configured in the Topology.Levels. + NodeLabels map[string]string + + // Count indicates the number of pods in a given TopologyAssignmentSlice + Count int +} +``` + +Kueue uses the `kueue.x-k8s.io/topology` scheduling gate to delay the +`nodeSelector` assignment, because different pods in the same PodSet may have +different values: + +```golang +const ( + // TopologySchedulingGate is used to delay topology assignment for pods + // once all the pods are created. + TopologySchedulingGate = "kueue.x-k8s.io/topology" + + // WorkloadAnnotation indicates the name of the workload assigned. + WorkloadAnnotation = "kueue.x-k8s.io/workload" + + // PodSetLabel indicates the name of the PodSet in the workload + PodSeLabel = "kueue.x-k8s.io/podset" +) +``` + +### Computing the assignment + +The extended pod assignment is set on admitting the workload. In order to +compute the assignment Kueue caches the information about the node allocatable +capacity and the current usage by other workloads in this resource flavor. + +The information about the allocatable capacity is scraped from nodes based on +the `status.allocatable` field. + +The cached information about the used resources is updated whenever a workload +is admitted, suspended, resumed or finished. Additionally, we scrape the usage +information from all non-TAS pods bound to the TAS nodes - this is done to +account for Pods not created by Workloads not managed by Kueue (such as +DaemonSet pods). + +The available capacity for TAS nodes is then computed as a difference between +the allocatable space and current usage. + +For a given PodSet Kueue: +- when the `require-topology` is used, then Kueue tries to find any value of the + level label which can accommodate all the pods. If there is no such value, then + the workload keeps waiting in the queue. +- when the `prefer-topology` is used, then Kueue tries to use heuristics to + minimize the number of values of the label used. One approach is to greedily + choose the value which can accommodate the most pods. + +### Enforcing the assignment + +When the workload has the PodSet assignments which are non-homogenous (all Pods +landing in the same topology) and is about to start we modify the corresponding +PodTemplates in the Job object to inject the `kueue.x-k8s.io/topology` +scheduling gate. For homogenous PodSet assignments it is enough to inject the +corresponding nodeSelector in the PodTemplates. + +Then, there is a new component, called `TopologyUngater`, which is a Workload +reconciler which lists all pods for a given TAS PodSet, and ensures that the +pods in the expected number are un-gated to a given value. + +Along with the scheduling gate, to each pod template the +`kueue.x-k8s.io/workload` and `kueue.x-k8s.io/podset` labels / annotations are +added to facilitate quick lookup (List request) for all pods corresponding to +the workload (and more specifically PodSetAssignment) by `TopologyAssigner`. + +The TopologyUngater watches for pod events which trigger the reconciliation +loop. The reconciliations are batched by 1s periods to make sure multiple +created pods at the similar time don't trigger too many reconciliations. + +In order to prevent ungating more pods as expected for a PodSet we consider to +use the expectations mechanism. The expectations are set for when we are about +to ungate a Pod. The expectation is fulfield if the Pod is observed as ungated +or the ungating request fails. We hold ungating if there are pending ungatings +within the PodSet. + +### Support for ReplicatedJobs in JobSet + +Currently, in the PodSet we just keep the number of Pods created for a given +PodTemplate. However, in case of JobSet the pods are naturally split between +instances of replicated Jobs. + +Thus, without any extra work for JobSet, if JobSet has more than one +ReplicatedJob its Pods could thus be assigned by TAS to different topologies, +making the assignment suboptimal. + +In order to improve accuracy of the support for ReplicatedJobs we do the +following API adjustments: + +```golang +type PodSet struct { + ... + + // ReplicatedJobCount indicates the number of replicated Jobs being span by + // the PodSet. Each replicated Job is assumed to have the same number of Pods + // with the same template. + // Default: 1 + ReplicatedJobCount *int32 + + // ReplicatedJobKeyLabel specifies the name of the label which indicates + // the specific Job instance among ReplicatedJobs. + ReplicatedJobKeyLabel *string +} +``` + +In case of JobSet `ReplicatedJobKeyLabel` could be either +`jobset.sigs.k8s.io/job-index` or `jobset.sigs.k8s.io/replicatedjob-name` as +both will uniquely identify a specific Job among the set of ReplicatedJob. + +```golang +type PodSetAssignment struct { + + // ReplicatedJobKey indicates the value of the Pod label which is used to + // specify a specific Job among a set of ReplicatedJobs. + ReplicatedJobKey *string +} +``` +Then, when the compute the assignments per pods sets corresponding to +ReplicatedJobs, rather than the entire PodSet. Finally, the `PodSetUngater` +ungates pods per replicatedJob assignment. + +### Test Plan + +[x] I/we understand the owners of the involved components may require updates to +existing tests to make this code solid enough prior to committing the changes +necessary to implement this enhancement. + +##### Prerequisite testing updates + +#### Unit Tests + +We add integration tests for the following scenarios: +- a new node is added making workload admission possible +- the PodSet assignment is computed successfully for `require-topology` +- the PodSet assignment cannot be computed for `require-topology` +- the PodSet assignment is computed successfully for `prefer-topology` across + multiple values +- the PodSet assignment cannot be computed for `prefer-topology` +- the schedulingGate is added to the pod template + +#### Integration tests + +We are going to add the integration tests to make sure the implementation is +well covered. In particular, the following scenarios need coverage: +- adding new nodes +- PodSet assignment for `prefer-topology` and `require-topology` +- Job's pod template has the "topology" scheduling gate injected +- un-gate pods by with the "topology" scheduling gate +- the Workload can be suspended and unsuspended + +### Graduation Criteria + +#### Alpha (MVP): + +- support for all built-in integrations: Job and JobSet +- support single-level hierarchy +- support TAS with minimal cross to other features (no cohorts, no preemption, + no reclaimable pods) + +The new validations which are for MVP, but likely will be relaxed in the future: +- ClusterQueue is marked inactive if it contains a TAS ResourceFlavor and + belongs to a cohort +- ClusterQueue is marked inactive if it contains a TAS ResourceFlavor and + enables preemptions +- ClusterQueue is marked inactive if it contains a TAS ResourceFlavor and uses + MultiKueue admission check +- ClusterQueue is marked inactive if it contains a TAS ResourceFlavor and uses + ProvisioningRequest admission check + +#### Beta + +- support for all other integrations +- support multi-level hierarchy +- support replacement timeout in WaitForPodsReady +- re-evaluate accounting for used resources by watching DaemonSets +- re-evaluate the need for the "auto" mode which does not require a user to + specify the hierarchy name, but just selects the lowest one +- support for ReplicatedJobs in JobSet + +#### Stable + +Consider the following improvements and implement if feasible: +- put pods with consecutive indexes of an IndexedJob on close nodes +- support the following features: reclaimable, pods, cohorts preemption + within cluster queue +- support re-computing the TopologyAssignment while the workload + is running +- perform full scheduling simulation rather than just capacity counting + (including pod affinities and anti-affinities) + +## Implementation History + +- 2024-07-25 - PR merged [Allow mutating schedulingGates when the Jobset is suspended](Allow mutating schedulingGates when the Jobset is suspended) +- 2024-07-30 - this KEP + + +## Drawbacks + +Tracking of nodes by Kueue will increase its code complexity and memory +consumption. + +## Alternatives + +### Account usage by watching DaemonSet pods + +The idea is to account for DaemonSet pods by tracking the DaemonSets rather than +their Pods. This can prevent the [race condition](Race condition when accounting +for DaemonSet pods) risk. + +**Reasons for discarding/deferring** + +When accounting for the the DaemonSets we would need to verify their selectors +would allow match a node in question. + +This is a complication which may not be necessary for MVP since the workloads +will mostly expected to consume accelerators (GPUs and TPUs). + +We will consider this for Beta or GA based on the users' feedback. + +### Use label for workload + +We considered using the `kueue.x-k8s.io/workload` label rather than annotation. + +**Reasons for discarding/deferring** + +Currently workloads can have names which exceed the maximal length for a label. +So, we would need to shorten the maximal workload name by reducing the const +`maxPrefixLength` in `workload_names.go`. However, we can also facilitate +fast lookups based on labels if we index the workloads. diff --git a/keps/2724-topology-aware-schedling/kep.yaml b/keps/2724-topology-aware-schedling/kep.yaml new file mode 100644 index 0000000000..1a0faa5b82 --- /dev/null +++ b/keps/2724-topology-aware-schedling/kep.yaml @@ -0,0 +1,36 @@ +title: Topology Aware Scheduling +kep-number: 2724 +authors: + - "@mimowo" + - "@mwielgus" +status: provisional +creation-date: 2023-07-30 +reviewers: + - "@pbundyra" + - "@alculquicondor" +approvers: + - "@alculquicondor" + +# The target maturity stage in the current dev cycle for this KEP. +stage: alpha + +# The most recent milestone for which work toward delivery of this KEP has been +# done. This can be the current (upcoming) milestone, if it is being actively +# worked on. +latest-milestone: "v0.9" + +# The milestone at which this feature was, or is targeted to be, at each stage. +milestone: + alpha: "v0.9" + beta: "v0.10" + stable: "v0.11" + +# The following PRR answers are required at alpha release +# List the feature gate name and the components for which it must be enabled +feature-gates: + - name: TopologyAwareScheduling +disable-supported: true + +# The following PRR answers are required at beta release +#metrics: +# - my_feature_metric \ No newline at end of file