diff --git a/site/content/en/docs/concepts/_index.md b/site/content/en/docs/concepts/_index.md index ecb71bfec4..0bf7f4f78b 100644 --- a/site/content/en/docs/concepts/_index.md +++ b/site/content/en/docs/concepts/_index.md @@ -37,7 +37,7 @@ Kueue. Sometimes referred to as _job_. ### [Workload Priority Class](/docs/concepts/workload_priority_class) `WorkloadPriorityClass` defines a priority class for a workload, -independently from [pod priority](https://kubernetes.io/docs/concepts/scheduling-eviction/pod-priority-preemption/). +independently from [pod priority](https://kubernetes.io/docs/concepts/scheduling-eviction/pod-priority-preemption/). This priority value from a `WorkloadPriorityClass` is only used for managing the queueing and preemption of [Workloads](#workload). ### [Admission Check](/docs/concepts/admission_check) @@ -46,6 +46,11 @@ A mechanism allowing internal or external components to influence the timing of ![Components](/images/queueing-components.svg) +### [Topology Aware Scheduling](/docs/concepts/topology_aware_scheduling) + +A mechanism allowing to schedule Workloads optimizing Pod placement for +network throuput between the Pods. + ## Glossary ### Quota Reservation diff --git a/site/content/en/docs/concepts/topology_aware_scheduling.md b/site/content/en/docs/concepts/topology_aware_scheduling.md new file mode 100644 index 0000000000..e35c3f2971 --- /dev/null +++ b/site/content/en/docs/concepts/topology_aware_scheduling.md @@ -0,0 +1,120 @@ +--- +title: "Topology Aware Scheduling" +date: 2024-04-11 +weight: 6 +description: > + Allows scheduling of Pods based on the topology of nodes in a data center. +--- + +{{< feature-state state="alpha" for_version="v0.9" >}} + +It is common that AI/ML workloads require a significant amount of pod-to-pod +communication. Therefore the network bandwidth between the running Pods +translates into the workload execution time, and the cost of running +such workloads. The available bandwidth between the Pods depends on the placement +of the Nodes, running the Pods, in the data center. + +We observe that the data centers have a hierarchical structure of their +organizational units, like racks and blocks, where there are multiple nodes +within a rack, and there are multiple racks within a block. Pods running within +the same organizational unit have better network bandwidth than Pods on +different units. We say that nods placed in different racks are more distant +than nodes placed within the same rack. Similarly, nodes placed in different +blocks are more distant than two nodes within the same block. + +In this feature (called Topology Aware Scheduling, or TAS for short) we +introduce a convention to represent the +[hierarchical node topology information](#node-topology-information), and a set +of APIs for Kueue administrators and users to utilize the information +to optimize the Pod placement. + +### Node topology information + +We propose a lightweight model for representing the hierarchy of nodes within a +data center by using node labels. In this model the node labels are set up by a +cloud provider, or set up manually by administrators of on-premise clusters. + +Additionally, we assume that every node used for TAS has a set of the labels +which identifies uniquely its location in the tree structure. We do not assume +global uniqueness of labels on each level, i.e. there could be two nodes with +the same "rack" label, but in different "blocks". + +For example, this is a representation of the data center hierarchy; + +| node | cloud.provider.com/topology-block | cloud.provider.com/topology-rack | +|:------:|:----------------------------------:|:--------------------------------:| +| node-1 | block-1 | rack-1 | +| node-2 | block-1 | rack-2 | +| node-3 | block-2 | rack-1 | +| node-4 | block-2 | rack-3 | + +Note that, there is a pair of nodes, node-1 and node-3, with the same value of +the "cloud.provider.com/topology-rack" label, but in different blocks. + +### Capacity calculation + +For each PodSet TAS determines the current free capacity per each topology +domain (like a given rack) by: +- including Node allocatable capacity (based on the `.status.allocatable` field) + of only ready Nodes (with `Ready=True` condition), +- subtracting the usage coming from all other admitted TAS workloads, +- subtracting the usage coming from all other non-TAS Pods (owned mainly by + DaemonSets, but also including static Pods, Deployments, etc.). + +### Admin-facing APIs + +As an admin, in order to enable the feature you need to: +1. ensure the `TopologyAwareScheduling` feature gate is enabled +2. create at least one instance of the `Topology` API +3. reference the `Topology` API from a dedicated ResourceFlavor by the + `.spec.topologyName` field + +#### Example + +{{< include "examples/tas/sample-queues.yaml" "yaml" >}} + +### User-facing APIs + +Once TAS is configured and ready to be used, you can create Jobs with the +following annotations set at the PodTemplate level: +- `kueue.x-k8s.io/podset-preferred-topology` - indicates that a PodSet requires + Topology Aware Scheduling, but scheduling all pods within pods on nodes + within the same topology domain is a preference rather than requirement. + The levels are evaluated one-by-one going up from the level indicated by + the annotation. If the PodSet cannot fit within a given topology domain + then the next topology level up is considered. If the PodSet cannot fit + at the highest topology level, then it gets admitted as distributed + among multiple topology domains. +- `kueue.x-k8s.io/podset-required-topology` - indicates indicates that a PodSet + requires Topology Aware Scheduling, and requires scheduling all pods on nodes + within the same topology domain corresponding to the topology level + indicated by the annotation value (e.g. within a rack or within a block). + +#### Example + +Here is an example Job a user might submit to use TAS. It assumes there exists +a LocalQueue named `tas-user-queue` which refernces the ClusterQueue pointing +to a TAS ResourceFlavor. + +{{< include "examples/tas/sample-job-preferred.yaml" "yaml" >}} + +### Limitations + +Currently, there are multiple limitations for the compatibility of the feature +with other features. In particular, a ClusterQueue referencing a TAS Resource +Flavor (with the `.spec.topologyName` field) is marked as inactive in the +following scenarios: +- the CQ is in cohort (`.spec.cohort` is set) +- the CQ is using [preemption](preemption.md) +- the CQ is using [MultiKueue](multikueue.md) or + [ProvisioningRequest](/docs/admission-check-controllers/provisioning/) admission checks + +These usage scenarios are considered to be supported in the future releases +of Kueue. + +## Drawbacks + +When enabling the feature Kueue starts to keep track of all Pods and all nodes +in the system, which results in larger memory requirements for Kueue. +Additionally, Kueue will take longer to schedule the workloads as it needs to +take the topology information into account. diff --git a/site/static/examples/tas/sample-job-preferred.yaml b/site/static/examples/tas/sample-job-preferred.yaml new file mode 100644 index 0000000000..3a0674bfce --- /dev/null +++ b/site/static/examples/tas/sample-job-preferred.yaml @@ -0,0 +1,24 @@ +apiVersion: batch/v1 +kind: Job +metadata: + generateName: tas-sample-preferred + labels: + kueue.x-k8s.io/queue-name: tas-user-queue +spec: + parallelism: 40 + completions: 40 + completionMode: Indexed + template: + metadata: + annotations: + kueue.x-k8s.io/podset-preferred-topology: "cloud.provider.com/topology-block" + spec: + containers: + - name: dummy-job + image: gcr.io/k8s-staging-perf-tests/sleep:v0.1.0 + args: ["300s"] + resources: + requests: + cpu: "1" + memory: "200Mi" + restartPolicy: Never \ No newline at end of file diff --git a/site/static/examples/tas/sample-queues.yaml b/site/static/examples/tas/sample-queues.yaml new file mode 100644 index 0000000000..56c1ae68db --- /dev/null +++ b/site/static/examples/tas/sample-queues.yaml @@ -0,0 +1,34 @@ +apiVersion: kueue.x-k8s.io/v1alpha1 +kind: Topology +metadata: + name: "default" +spec: + levels: + - nodeLabel: "cloud.provider.com/topology-block" + - nodeLabel: "cloud.provider.com/topology-rack" + - nodeLabel: "kubernetes.io/hostname" +--- +kind: ResourceFlavor +apiVersion: kueue.x-k8s.io/v1beta1 +metadata: + name: "tas-flavor" +spec: + nodeLabels: + cloud.provider.com/: "tas-node-group" + topologyName: "default" +--- +apiVersion: kueue.x-k8s.io/v1beta1 +kind: ClusterQueue +metadata: + name: "tas-cluster-queue" +spec: + namespaceSelector: {} # match all. + resourceGroups: + - coveredResources: ["cpu", "memory"] + flavors: + - name: "tas-flavor" + resources: + - name: "cpu" + nominalQuota: 100 + - name: "memory" + nominalQuota: 100Gi \ No newline at end of file