diff --git a/keps/2724-topology-aware-schedling/README.md b/keps/2724-topology-aware-schedling/README.md index 08483e4c2b..5a20749350 100644 --- a/keps/2724-topology-aware-schedling/README.md +++ b/keps/2724-topology-aware-schedling/README.md @@ -1,4 +1,4 @@ -# KEP-NNNN: Your short, descriptive title +# KEP-2724: Topology Aware Scheduling - [Summary](#summary) @@ -9,10 +9,13 @@ - [User Stories](#user-stories) - [Story 1](#story-1) - [Story 2](#story-2) + - [Story 3](#story-3) - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) - [Integration support](#integration-support) + - [Job](#job) - [JobSet](#jobset) - [Support for the "auto" mode](#support-for-the-auto-mode) + - [PodSetAssignment is per lowest-topology level](#podsetassignment-is-per-lowest-topology-level) - [Risks and Mitigations](#risks-and-mitigations) - [Non-exclusive use of nodes](#non-exclusive-use-of-nodes) - [Node topology changes](#node-topology-changes) @@ -25,13 +28,12 @@ - [Internal APIs](#internal-apis) - [Computing the assignment](#computing-the-assignment) - [Enforcing the assignment](#enforcing-the-assignment) - - [Support for ReplicatedJobs in JobSet](#support-for-replicatedjobs-in-jobset) - [Test Plan](#test-plan) - [Prerequisite testing updates](#prerequisite-testing-updates) - [Unit Tests](#unit-tests) - [Integration tests](#integration-tests) - [Graduation Criteria](#graduation-criteria) - - [Alpha (MVP):](#alpha-mvp) + - [Alpha:](#alpha) - [Beta](#beta) - [Stable](#stable) - [Implementation History](#implementation-history) @@ -39,13 +41,15 @@ - [Alternatives](#alternatives) - [Account usage by watching DaemonSet pods](#account-usage-by-watching-daemonset-pods) - [Use label for workload](#use-label-for-workload) + - [Implement it in ClusterAutoscaler or kube-scheduler](#implement-it-in-clusterautoscaler-or-kube-scheduler) + - [Support for ReplicatedJobs in JobSet](#support-for-replicatedjobs-in-jobset) ## Summary This KEP introduces a mechanism, called Topology Aware Scheduling (TAS), to facilitate Job scheduling in Kueue by leveraging the information about the -hierarchical organization of a datacenter. +hierarchical organization of a data center. First, we start by observing that data centers have organizational units (like racks and blocks). VMs running within the same organizational unit have better @@ -56,7 +60,7 @@ more distant than nodes placed within the same rack. Similarly, nodes placed in different blocks are more distant than two nodes within the same block. Based on these observations we propose a convention to expose the information -about the node placement in a datacenter hierarchy using node labels. +about the node placement in a data center hierarchy using node labels. We also propose a set of APIs for Kueue administrators and users to utilize this information in order to optimize the network throughput between the pods. @@ -70,7 +74,7 @@ running pods. For example, some workloads may run twice slower if there is a pod placed on a node which belongs to a different block. However, currently the end user of -Kueue has no way to "require" that all pods of its workload run within the same +Kueue has no way to "required" that all pods of its workload run within the same block. ### Goals @@ -102,19 +106,37 @@ The above features might be pursued in the future in follow up KEPs. #### Story 1 As an ML researcher I run AI training workloads which require exchanging huge -amounts of data between pods. The workloads take twice longer and are twice -more expensive when there are pods running in different datacenter racks. I -don't want to start my workload if the pods cannot be placed within the same -rack. +amounts of data between the worker Pods. I use k8s Jobs. + +A) The workloads take twice longer and are twice more expensive when there are +pods running in different data center racks. I don't want to start my workload +if the pods cannot be placed within the same rack. + +B) I would like to optimize the runtime of the workloads, by placing the pods as +close to another as possible, ideally within a single rack, but I'm ok running +the workload on nodes scattered across data center if placing all pods within +the same hierarchy rack is not possible at the given time. #### Story 2 -As an ML researcher I run AI training workloads which require communicating -moderate amounts of data between pods. I would like to optimize the runtime of -the workloads, by placing the pods as close to another as possible, ideally -within a single rack, but I'm ok running the workload on nodes scattered across -datacenter if placing all pods within the same hierarchy rack is not possible at -the given time. +Similar to [Story 1](#story-1), but I would like to run my Workloads using +JobSet with multiple ReplicatedJob instances per worker +([`replicas>0`](https://github.com/kubernetes-sigs/jobset/blob/d7c1dadd55906dec1d1275bcb7b73f08d1fa509f/api/jobset/v1alpha2/jobset_types.go#L227)). + +To achieve optimal performance I need to ensure that every ReplicatedJob +instance is contained within a "rack". + +Note: not planned for [Alpha](#alpha), to be evaluated for [Beta](#beta). + +#### Story 3 + +Similar to [Story 3](#story-3), but I use multi-template Jobs (JobSet, MPIJob, +TFJob) and Pods belonging to different templates also need to exchange sizable +amount of data, impacting the execution time. I would like to be able to +indicate that (1) all worker Pods are contained within a "rack", but also all +Pods in the workload are contained within a "block". + +Note: not planned for [Alpha](#alpha), to be evaluated for [Beta](#beta). ### Notes/Constraints/Caveats (Optional) @@ -123,23 +145,114 @@ the given time. In the alpha iteration we aim to support Job & JobSet. If other integrations follow naturally we may as well support them in alpha. +##### Job + +**Example**: + +```yaml +apiVersion: batch/v1 +kind: Job +metadata: + namespace: tas-example-job + labels: + kueue.x-k8s.io/queue-name: user-queue +spec: + parallelism: 10 + completions: 10 + template: + metadata: + annotations: + kueue.x-k8s.io/podset-required-topology: cloud.provider.com/topology-rack + spec: + containers: + - name: worker + image: gcr.io/k8s-staging-perf-tests/sleep:v0.1.0 + args: ["300s"] + resources: + requests: + cpu: "4" + restartPolicy: Never +``` + +In this example we indicate that all Pods created by the Job should be contained +within the same "rack". + ##### JobSet One complication we noticed for JobSet is that the proposed design assumes injecting the dedicated scheduling gate, called `kueue.x-k8s.io/topology` into -the PodTemplate. However, currently JobSet's `spec.replicatedJob` field is -immutable, even if the JobSet is suspended. This has been relaxed in JobSet 0.6 +the PodTemplate. This required a JobSet fix which is released in 0.6 (see [PR](https://github.com/kubernetes-sigs/jobset/pull/623)). +**Example**: + +```yaml +apiVersion: jobset.x-k8s.io/v1alpha2 +kind: JobSet +metadata: + name: tas-example-jobset + labels: + kueue.x-k8s.io/queue-name: user-queue +spec: + replicatedJobs: + - name: leader + replicas: 1 + template: + spec: + completions: 1 + parallelism: 1 + template: + metadata: + annotations: + kueue.x-k8s.io/podset-required-topology: cloud.provider.com/topology-rack + spec: + containers: + - name: leader + image: gcr.io/k8s-staging-perf-tests/sleep:v0.1.0 + args: ["300s"] + - name: workers + replicas: 2 + template: + spec: + completions: 2 + parallelism: 2 + template: + metadata: + annotations: + kueue.x-k8s.io/podset-preferred-topology: cloud.provider.com/topology-rack + spec: + containers: + - name: worker + image: gcr.io/k8s-staging-perf-tests/sleep:v0.1.0 + args: ["100s"] +``` + +In this example we say that the PodSet corresponding to the leader requires +the "rack" topology domain (there is only one such Pod, so the requirement +is satisfied trivially). The PodSet corresponding to the worker only +prefers the "rack" topology domain, so it may fallback to "block". + #### Support for the "auto" mode We are considering introducing the "auto" mode where the user does not need to specify the level by node label value, but the lowest level is selected automatically. -One approach is to accept "auto" as a specific value for the "prefer" and -"require" labels. However, we are going to defer this for Beta or GA based on -the users' feedback. +One approach is to accept "auto" as a specific value for the "preferred" and +"required" annotations. However, we are going to defer this for Beta or GA based +on the users' feedback. + +#### PodSetAssignment is per lowest-topology level + +Assigning the lowest-topology level is to prevent the following race conditions. + +For example, there is a topology with one block: "block1" and with three racks: +"rack1", "rack2", "rack3". Now, a user sends "workload1" which requiring "block" +topology, and it fits within two racks. Let's assume Kueue only assigned +nodeSelector to match for "block1" on admission. Then, kube-scheduler binds the +Pods to "rack1" and "rack2". In the meanwhile, there is "workload2" admitted, +requiring "rack". It gets assigned to "rack1", but cannot be scheduled since the +nodes of "rack1" are already in use. ### Risks and Mitigations @@ -166,7 +279,8 @@ replacement node. In order to mitigate this risk we propose to extend the waitForPodsReady mechanism with a new timeout, called replacement timeout, which defines the timeout for all pods to be ready again. The time is computed since the last -transition to the PodsReady=false, more details in the KEP PR. This mechanism +transition to the `PodsReady=false`, more details in the +[KEP PR](https://github.com/kubernetes-sigs/kueue/pull/2737). This mechanism will also be helpful for regular workloads. A more involving approach would be to recompute the TopologyAdmission, however, @@ -192,7 +306,7 @@ We will re-evaluate the need for Beta or GA based on the users' feedback. ### Hierarchy representation -We propose the model for representing the hierarchy of nodes within a datacenter +We propose the model for representing the hierarchy of nodes within a data center by using node labels. We assume the node labels are set up by a cloud provider, or set up manually by administrators of on-premise clusters. @@ -223,22 +337,34 @@ type ResourceFlavorSpec struct { // TopologyName indicates the name of the topology for the ResourceFlavor. // When specified, it enables scraping of the topology information from the // nodes matching to the Resource Flavor node labels. - TopologyName *string + // + // +optional + TopologyName *string `json:"topologyName,omitempty"` } // TopologySpec defines the desired state of Topology type TopologySpec struct { - // Levels defines the levels of topology. - Levels []TopologyLevel + // Levels defines the levels of topology. + // + // +listType=atomic + // +kubebuilder:validation:MinItems=1 + // +kubebuilder:validation:MaxItems=5 + Levels []TopologyLevel `json:"levels,omitempty"` } // TopologyLevel defines the desired state of TopologyLevel type TopologyLevel struct { - // NodeLabel indicates the name of the node label for a specific topology - // level. Examples: + // NodeLabel indicates the name of the node label for a specific topology + // level. Examples: // - cloud.provider.com/topology-block // - cloud.provider.com/topology-rack - NodeLabel string + // + // +required + // +kubebuilder:validation:Required + // +kubebuilder:validation:MinLength=1 + // +kubebuilder:validation:MaxLength=316 + // +kubebuilder:validation:Pattern=`^([a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*/)?(([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9])$` + NodeLabel string `json:"nodeLabel,omitempty"` } ``` @@ -252,7 +378,7 @@ spec: nodeLabels: "cloud.provider.com/node-group: tas" topologyName: default -—k +--- kind: Topology metadata: name: "default" @@ -265,55 +391,68 @@ spec: ### User-facing API The user will need to point the workload to the ClusterQueue with the TAS -ResourceFlavor, and add one of the annotations: +ResourceFlavor. Additionally, the user specifies the annotations at the +PodTemplate level: ```golang const ( - // This annotation indicates that a workload requires Topology Aware Scheduling, + // This annotation indicates that a PodSet requires Topology Aware Scheduling, // and running all pods on nodes closely connected within the same level of // hierarchy is a strong requirement for scheduling the workload. - RequireTopologyAnnotation = "kueue.x-k8s.io/require-topology" + PodSetRequiredTopologyAnnotation = "kueue.x-k8s.io/podset-required-topology" - // This annotation indicates that a workload requires Topology Aware Scheduling, + // This annotation indicates that a PodSet requires Topology Aware Scheduling, // but running all pods without the same topology level is a preference rather - // than requirement. There is a distinguished value "auto" which means that - // the lowest hierarchy level should be used. - PreferTopologyAnnotation = "kueue.x-k8s.io/prefer-topology" + // than requirement. + // + // The levels are evaluated one-by-one going up from the level indicated by + // the annotation. If the PodSet cannot fit within a given topology domain + // then the next topology level up is checked. If the PodSet cannot fit + // at the highest topology level, then it gets admitted as distributed + // among multiple topology domains. + PodSetPreferredTopologyAnnotation = "kueue.x-k8s.io/podset-preferred-topology" ) ``` ### Validation -We introduce the following validations: -- the value of `kueue.x-k8s.io/require-topology` is one of the labels specified - in the `LevelLabels` structure -- the value of `kueue.x-k8s.io/prefer-topology` is one of the labels specified - in the `LevelLabels` structure, or "auto" -- the ResourceFlavor defining`TopologyAwareScheduling` needs to have at least +The validations at the creation time: +- the ResourceFlavor defining `TopologyAwareScheduling` needs to have at least one node label +We introduce the following validations (post-creation time, workload violating +the rules is deactivated): +- the value of `kueue.x-k8s.io/podset-required-topology` is one of the labels + specified in the topology node labels +- the value of `kueue.x-k8s.io/podset-preferred-topology` is one of the labels + specified in the topology node labels +- if "podset-required-topology" or "podset-preferred-topology" is specified for + one PodTemplate, then one of them is specified for every PodTemplate in the + Job spec. + ### Internal APIs We extend the `Workload` structure to reflect the topology request at the Job level. ```golang -type WorkloadSpec struct { +type PodSet struct { ... - // TopologyRequest defines the topology requested for the corresponding Job. - TopologyRequest *TopologyRequest + // TopologyRequest defines the topology requested for the corresponding PodSet. + // +optional + TopologyRequest *PodSetTopologyRequest `json:"topologyRequest,omitempty"` } -type TopologyRequest struct { +type PodSetTopologyRequest struct { // Policy defines the policy used for TAS. Possible values are: - // - Prefer set when `kueue.x-k8s.io/prefer-topology` annotation is set on the Job - // - Require set when `kueue.x-k8s.io/require-topology` annotation is set on the Job - Policy TopologyRequestPolicy + // - Preferred set when `kueue.x-k8s.io/podset-preferred-topology` annotation is set on the Job + // - Required set when `kueue.x-k8s.io/podset-required-topology` annotation is set on the Job + Policy TopologyRequestPolicy `json:"policy"` - // Level indicated by the `kueue.x-k8s.io/prefer-topology` or `kueue.x-k8s.io/require-topology ` - // annotation - Level string + // Level indicated by the `kueue.x-k8s.io/podset-preferred-topology` or + // `kueue.x-k8s.io/podset-required-topology` annotation + Level string `json:"level"` } ``` @@ -325,21 +464,28 @@ type PodSetAssignment struct { ... // TopologyAssignment indicates the resources assigned per topology level - TopologyAssignment *TopologyAssignment + // +optional + TopologyAssignment *TopologyAssignment `json:"topologyAssignment,omitempty"` } type TopologyAssignment struct { - // Slices contains the list of assignments split into slices - Slices []TopologyAssignmentSlice + // Groups contains the list of assignments split into groups corresponding + // to the same topology domain at the lowest level of the hierarchy. + // +required + // +listType=atomic + // +kubebuilder:validation:MinItems=1 + Groups []TopologyAssignmentGroup `json:"groups"` } -type TopologyAssignmentSlice struct { +type TopologyAssignmentGroup struct { // NodeLabels constitutes the nodeSelector for a given slice of pods. It // defines values for all labels configured in the Topology.Levels. - NodeLabels map[string]string + // +kubebuilder:validation:MinItems=1 + NodeLabels map[string]string `json:"nodeLabels"` - // Count indicates the number of pods in a given TopologyAssignmentSlice - Count int + // Count indicates the number of pods in a given TopologyAssignmentGroup. + // +required + Count int32 `json:"count"` } ``` @@ -361,9 +507,14 @@ const ( ) ``` +The above API does not support [Story 2](#story-2). We will defer the support +for [Beta](#beta). The initial approach for the design is left in the +[Support for ReplicatedJobs in JobSet](#support-for-replicatedjobs-in-jobset) +section. + ### Computing the assignment -The extended pod assignment is set on admitting the workload. In order to +The extended PodSet assignment is set on admitting the workload. In order to compute the assignment Kueue caches the information about the node allocatable capacity and the current usage by other workloads in this resource flavor. @@ -380,20 +531,21 @@ The available capacity for TAS nodes is then computed as a difference between the allocatable space and current usage. For a given PodSet Kueue: -- when the `require-topology` is used, then Kueue tries to find any value of the +- when the `podset-required-topology` is used, then Kueue tries to find any value of the level label which can accommodate all the pods. If there is no such value, then the workload keeps waiting in the queue. -- when the `prefer-topology` is used, then Kueue tries to use heuristics to - minimize the number of values of the label used. One approach is to greedily - choose the value which can accommodate the most pods. +- when the `podset-preferred-topology` is used, then Kueue tries to find a level + at which the PodSet fully fits within a topology domain corresponding to the + level. Kueue starts the search from the specified level, but if the PodSet + does not fit, then it tries higher levels in the hierarchy. ### Enforcing the assignment -When the workload has the PodSet assignments which are non-homogenous (all Pods -landing in the same topology) and is about to start we modify the corresponding -PodTemplates in the Job object to inject the `kueue.x-k8s.io/topology` -scheduling gate. For homogenous PodSet assignments it is enough to inject the -corresponding nodeSelector in the PodTemplates. +When the workload has the PodSet assignments and is about to start we modify the +corresponding PodTemplates in the Job object to inject the +`kueue.x-k8s.io/topology` scheduling gate. Note that the assignment always +includes the lowest level of the topology (see +[PodSetAssignment is per lowest-topology level](#podsetassignment-is-per-lowest-topology-level)). Then, there is a new component, called `TopologyUngater`, which is a Workload reconciler which lists all pods for a given TAS PodSet, and ensures that the @@ -410,55 +562,10 @@ created pods at the similar time don't trigger too many reconciliations. In order to prevent ungating more pods as expected for a PodSet we consider to use the expectations mechanism. The expectations are set for when we are about -to ungate a Pod. The expectation is fulfield if the Pod is observed as ungated +to ungate a Pod. The expectation is fulfilled if the Pod is observed as ungated or the ungating request fails. We hold ungating if there are pending ungatings within the PodSet. -### Support for ReplicatedJobs in JobSet - -Currently, in the PodSet we just keep the number of Pods created for a given -PodTemplate. However, in case of JobSet the pods are naturally split between -instances of replicated Jobs. - -Thus, without any extra work for JobSet, if JobSet has more than one -ReplicatedJob its Pods could thus be assigned by TAS to different topologies, -making the assignment suboptimal. - -In order to improve accuracy of the support for ReplicatedJobs we do the -following API adjustments: - -```golang -type PodSet struct { - ... - - // ReplicatedJobCount indicates the number of replicated Jobs being span by - // the PodSet. Each replicated Job is assumed to have the same number of Pods - // with the same template. - // Default: 1 - ReplicatedJobCount *int32 - - // ReplicatedJobKeyLabel specifies the name of the label which indicates - // the specific Job instance among ReplicatedJobs. - ReplicatedJobKeyLabel *string -} -``` - -In case of JobSet `ReplicatedJobKeyLabel` could be either -`jobset.sigs.k8s.io/job-index` or `jobset.sigs.k8s.io/replicatedjob-name` as -both will uniquely identify a specific Job among the set of ReplicatedJob. - -```golang -type PodSetAssignment struct { - - // ReplicatedJobKey indicates the value of the Pod label which is used to - // specify a specific Job among a set of ReplicatedJobs. - ReplicatedJobKey *string -} -``` -Then, when the compute the assignments per pods sets corresponding to -ReplicatedJobs, rather than the entire PodSet. Finally, the `PodSetUngater` -ungates pods per replicatedJob assignment. - ### Test Plan [x] I/we understand the owners of the involved components may require updates to @@ -471,11 +578,11 @@ necessary to implement this enhancement. We add integration tests for the following scenarios: - a new node is added making workload admission possible -- the PodSet assignment is computed successfully for `require-topology` -- the PodSet assignment cannot be computed for `require-topology` -- the PodSet assignment is computed successfully for `prefer-topology` across - multiple values -- the PodSet assignment cannot be computed for `prefer-topology` +- the PodSet assignment is computed successfully for `podset-required-topology` +- the PodSet assignment cannot be computed for `podset-required-topology` +- the PodSet assignment is computed successfully for `podset-preferred-topology` + across multiple values +- the PodSet assignment cannot be computed for `podset-preferred-topology` - the schedulingGate is added to the pod template #### Integration tests @@ -483,14 +590,14 @@ We add integration tests for the following scenarios: We are going to add the integration tests to make sure the implementation is well covered. In particular, the following scenarios need coverage: - adding new nodes -- PodSet assignment for `prefer-topology` and `require-topology` +- PodSet assignment for `podset-preferred-topology` and `podset-required-topology` - Job's pod template has the "topology" scheduling gate injected - un-gate pods by with the "topology" scheduling gate - the Workload can be suspended and unsuspended ### Graduation Criteria -#### Alpha (MVP): +#### Alpha: - support for all built-in integrations: Job and JobSet - support single-level hierarchy @@ -515,7 +622,10 @@ The new validations which are for MVP, but likely will be relaxed in the future: - re-evaluate accounting for used resources by watching DaemonSets - re-evaluate the need for the "auto" mode which does not require a user to specify the hierarchy name, but just selects the lowest one -- support for ReplicatedJobs in JobSet +- re-evaluate the need to support for "preferred/required" preferences at the + ReplicatedJob level (see [Story 2](#story-2)) +- re-evaluate the need to support for "preferred/required" preferences at the + Workload level (see [Story 3](#story-3)) #### Stable @@ -567,3 +677,89 @@ Currently workloads can have names which exceed the maximal length for a label. So, we would need to shorten the maximal workload name by reducing the const `maxPrefixLength` in `workload_names.go`. However, we can also facilitate fast lookups based on labels if we index the workloads. + +### Implement it in ClusterAutoscaler or kube-scheduler + +This feature requires tracking of the nodes to reconstruct the nodes capacity +for all topology domains, and this is a novelty in Kueue which didn't do that, +leaving the features requiring knowledge about Nodes to kube-scheduler or +Cluster Autoscaler. + +**Reasons for discarding/deferring** + +kube-scheduler: +- scheduler does not have ability to do what `TopologyUngater` does, it is + lower-level of abstraction +- the topology information is not represented in the core k8s, and scheduler + wouldn't read the information from CRDs +- kube-scheduler doesn't do PodGroup level scheduling or preemption which is + required + +ClusterAutoscaler: +- Isn't focused on scheduling or preemption; its responsibility is to provision + new Nodes +- Operates at the level of individual Pods rather than Jobs or PodGroups + +Also, both of the components don't have a concept of quota, which we want to be +supported with this feature. So, if Kueue admits a Job with "required: rack" +just based on quota, the Pods might be created, but the components wouldn't +be able to schedule the Pods if there is not enough capacity in the topology +domain. + +### Support for ReplicatedJobs in JobSet + +This is needed to support [Story 2](#story-2). + +Currently, in the PodSet we just keep the number of Pods created for a given +PodTemplate. So, without any extra work for JobSet, if JobSet has more than one +ReplicatedJob its Pods could thus be assigned by TAS to different topology +domains, making the assignment suboptimal. + +Below is the "backup" of the API discussed to support assignments per an +instance of a ReplicatedJob: + +```golang +type PodSet struct { + ... + + // ReplicatedJobCount indicates the number of replicated Jobs being span by + // the PodSet. Each replicated Job is assumed to have the same number of Pods + // with the same template. + // Default: 1 + ReplicatedJobCount *int32 + + // ReplicatedJobKeyLabel specifies the name of the label which indicates + // the specific Job instance among ReplicatedJobs. + ReplicatedJobKeyLabel *string + + // ReplicatedJobLabelValue specifies the value of the label indicated by + // ReplicatedJobKeyLabel for a specific replicated Job. + ReplicatedJobLabelValue *string +} +``` + +In case of JobSet `ReplicatedJobKeyLabel` could be either +`jobset.sigs.k8s.io/job-index` or `jobset.sigs.k8s.io/replicatedjob-name` as +both will uniquely identify a specific Job among the set of ReplicatedJob. + +```golang +type PodSetAssignment struct { + + // ReplicatedJobKey indicates the key of the Pod label which is used to + // specify a specific Job among a set of ReplicatedJobs. + ReplicatedJobKey *string +} +``` +We the compute then assignments per ReplicatedJobs, rather than the entire +PodSet. The computed assignments are represented as instances of +`PodSetAssignment` with specific `TopologyAssignment`. + +Finally, the `PodSetUngater` ungates pods per ReplicatedJob assignment. + +**Reasons for discarding/deferring** + +Supporting the level of a replica requires more APIs, refactoring of how Kueue +operates. This could significantly delay the delivery of the first version +of the feature. As explained in the +[comment](https://github.com/kubernetes-sigs/kueue/pull/2725#discussion_r1758513294) +we prefer to start with the PodSet level as simpler. diff --git a/keps/2724-topology-aware-schedling/kep.yaml b/keps/2724-topology-aware-schedling/kep.yaml index 1a0faa5b82..5c810fcd8d 100644 --- a/keps/2724-topology-aware-schedling/kep.yaml +++ b/keps/2724-topology-aware-schedling/kep.yaml @@ -8,8 +8,10 @@ creation-date: 2023-07-30 reviewers: - "@pbundyra" - "@alculquicondor" + - "@tenzen-y" approvers: - "@alculquicondor" + - "@tenzen-y" # The target maturity stage in the current dev cycle for this KEP. stage: alpha @@ -22,8 +24,8 @@ latest-milestone: "v0.9" # The milestone at which this feature was, or is targeted to be, at each stage. milestone: alpha: "v0.9" - beta: "v0.10" - stable: "v0.11" + beta: + stable: # The following PRR answers are required at alpha release # List the feature gate name and the components for which it must be enabled