Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TAS: Add documentation page for concepts #3432

Merged
merged 2 commits into from
Nov 5, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 6 additions & 1 deletion site/content/en/docs/concepts/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ Kueue. Sometimes referred to as _job_.
### [Workload Priority Class](/docs/concepts/workload_priority_class)

`WorkloadPriorityClass` defines a priority class for a workload,
independently from [pod priority](https://kubernetes.io/docs/concepts/scheduling-eviction/pod-priority-preemption/).
independently from [pod priority](https://kubernetes.io/docs/concepts/scheduling-eviction/pod-priority-preemption/).
This priority value from a `WorkloadPriorityClass` is only used for managing the queueing and preemption of [Workloads](#workload).

### [Admission Check](/docs/concepts/admission_check)
Expand All @@ -46,6 +46,11 @@ A mechanism allowing internal or external components to influence the timing of

![Components](/images/queueing-components.svg)

### [Topology Aware Scheduling](/docs/concepts/topology_aware_scheduling)

A mechanism allowing to schedule Workloads optimizing Pod placement for
network throuput between the Pods.

## Glossary

### Quota Reservation
Expand Down
120 changes: 120 additions & 0 deletions site/content/en/docs/concepts/topology_aware_scheduling.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
---
title: "Topology Aware Scheduling"
date: 2024-04-11
weight: 6
description: >
Allows scheduling of Pods based on the topology of nodes in a data center.
---

{{< feature-state state="alpha" for_version="v0.9" >}}

It is common that AI/ML workloads require a significant amount of pod-to-pod
communication. Therefore the network bandwidth between the running Pods
translates into the workload execution time, and the cost of running
such workloads. The available bandwidth between the Pods depends on the placement
of the Nodes, running the Pods, in the data center.

We observe that the data centers have a hierarchical structure of their
organizational units, like racks and blocks, where there are multiple nodes
within a rack, and there are multiple racks within a block. Pods running within
the same organizational unit have better network bandwidth than Pods on
different units. We say that nods placed in different racks are more distant
than nodes placed within the same rack. Similarly, nodes placed in different
blocks are more distant than two nodes within the same block.

In this feature (called Topology Aware Scheduling, or TAS for short) we
introduce a convention to represent the
[hierarchical node topology information](#node-topology-information), and a set
of APIs for Kueue administrators and users to utilize the information
to optimize the Pod placement.

### Node topology information
mimowo marked this conversation as resolved.
Show resolved Hide resolved

We propose a lightweight model for representing the hierarchy of nodes within a
data center by using node labels. In this model the node labels are set up by a
cloud provider, or set up manually by administrators of on-premise clusters.

Additionally, we assume that every node used for TAS has a set of the labels
which identifies uniquely its location in the tree structure. We do not assume
global uniqueness of labels on each level, i.e. there could be two nodes with
the same "rack" label, but in different "blocks".

For example, this is a representation of the data center hierarchy;

| node | cloud.provider.com/topology-block | cloud.provider.com/topology-rack |
|:------:|:----------------------------------:|:--------------------------------:|
| node-1 | block-1 | rack-1 |
| node-2 | block-1 | rack-2 |
| node-3 | block-2 | rack-1 |
| node-4 | block-2 | rack-3 |

Note that, there is a pair of nodes, node-1 and node-3, with the same value of
the "cloud.provider.com/topology-rack" label, but in different blocks.

### Capacity calculation

For each PodSet TAS determines the current free capacity per each topology
domain (like a given rack) by:
- including Node allocatable capacity (based on the `.status.allocatable` field)
of only ready Nodes (with `Ready=True` condition),
- subtracting the usage coming from all other admitted TAS workloads,
- subtracting the usage coming from all other non-TAS Pods (owned mainly by
DaemonSets, but also including static Pods, Deployments, etc.).

### Admin-facing APIs

As an admin, in order to enable the feature you need to:
1. ensure the `TopologyAwareScheduling` feature gate is enabled
2. create at least one instance of the `Topology` API
3. reference the `Topology` API from a dedicated ResourceFlavor by the
`.spec.topologyName` field

#### Example

{{< include "examples/tas/sample-queues.yaml" "yaml" >}}

### User-facing APIs

Once TAS is configured and ready to be used, you can create Jobs with the
following annotations set at the PodTemplate level:
- `kueue.x-k8s.io/podset-preferred-topology` - indicates that a PodSet requires
Topology Aware Scheduling, but scheduling all pods within pods on nodes
within the same topology domain is a preference rather than requirement.
The levels are evaluated one-by-one going up from the level indicated by
the annotation. If the PodSet cannot fit within a given topology domain
then the next topology level up is considered. If the PodSet cannot fit
at the highest topology level, then it gets admitted as distributed
among multiple topology domains.
- `kueue.x-k8s.io/podset-required-topology` - indicates indicates that a PodSet
requires Topology Aware Scheduling, and requires scheduling all pods on nodes
within the same topology domain corresponding to the topology level
indicated by the annotation value (e.g. within a rack or within a block).

#### Example

Here is an example Job a user might submit to use TAS. It assumes there exists
a LocalQueue named `tas-user-queue` which refernces the ClusterQueue pointing
to a TAS ResourceFlavor.

{{< include "examples/tas/sample-job-preferred.yaml" "yaml" >}}

### Limitations

Currently, there are multiple limitations for the compatibility of the feature
with other features. In particular, a ClusterQueue referencing a TAS Resource
Flavor (with the `.spec.topologyName` field) is marked as inactive in the
following scenarios:
- the CQ is in cohort (`.spec.cohort` is set)
- the CQ is using [preemption](preemption.md)
- the CQ is using [MultiKueue](multikueue.md) or
[ProvisioningRequest](/docs/admission-check-controllers/provisioning/) admission checks

These usage scenarios are considered to be supported in the future releases
of Kueue.

## Drawbacks

When enabling the feature Kueue starts to keep track of all Pods and all nodes
in the system, which results in larger memory requirements for Kueue.
Additionally, Kueue will take longer to schedule the workloads as it needs to
take the topology information into account.
24 changes: 24 additions & 0 deletions site/static/examples/tas/sample-job-preferred.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
apiVersion: batch/v1
kind: Job
metadata:
generateName: tas-sample-preferred
labels:
kueue.x-k8s.io/queue-name: tas-user-queue
spec:
parallelism: 40
completions: 40
completionMode: Indexed
template:
metadata:
annotations:
kueue.x-k8s.io/podset-preferred-topology: "cloud.provider.com/topology-block"
spec:
containers:
- name: dummy-job
image: gcr.io/k8s-staging-perf-tests/sleep:v0.1.0
args: ["300s"]
resources:
requests:
cpu: "1"
memory: "200Mi"
restartPolicy: Never
34 changes: 34 additions & 0 deletions site/static/examples/tas/sample-queues.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
apiVersion: kueue.x-k8s.io/v1alpha1
kind: Topology
metadata:
name: "default"
spec:
levels:
- nodeLabel: "cloud.provider.com/topology-block"
- nodeLabel: "cloud.provider.com/topology-rack"
- nodeLabel: "kubernetes.io/hostname"
---
kind: ResourceFlavor
apiVersion: kueue.x-k8s.io/v1beta1
metadata:
name: "tas-flavor"
spec:
nodeLabels:
cloud.provider.com/: "tas-node-group"
topologyName: "default"
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
name: "tas-cluster-queue"
spec:
namespaceSelector: {} # match all.
resourceGroups:
- coveredResources: ["cpu", "memory"]
flavors:
- name: "tas-flavor"
resources:
- name: "cpu"
nominalQuota: 100
- name: "memory"
nominalQuota: 100Gi