Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Topology Manager proposal. #1680

Merged
merged 8 commits into from
Jan 8, 2019
322 changes: 322 additions & 0 deletions contributors/design-proposals/node/topology-manager.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,322 @@
# Node Topology Manager

_Authors:_

* @ConnorDoyle - Connor Doyle <[email protected]>
* @balajismaniam - Balaji Subramaniam <[email protected]>
* @lmdaly - Louise M. Daly <[email protected]>

**Contents:**

* [Overview](#overview)
* [Motivation](#motivation)
* [Goals](#goals)
* [Non-Goals](#non-goals)
* [User Stories](#user-stories)
* [Proposal](#proposal)
* [User Stories](#user-stories)
* [Proposed Changes](#proposed-changes)
* [New Component: Topology Manager](#new-component-topology-manager)
* [Computing Preferred Affinity](#computing-preferred-affinity)
* [New Interfaces](#new-interfaces)
* [Changes to Existing Components](#changes-to-existing-components)
* [Graduation Criteria](#graduation-criteria)
* [alpha (target v1.11)](#alpha-target-v1.11)
* [beta](#beta)
* [GA (stable)](#ga-stable)
* [Challenges](#challenges)
* [Limitations](#limitations)
* [Alternatives](#alternatives)
* [Reference](#reference)

# Overview

An increasing number of systems leverage a combination of CPUs and
hardware accelerators to support latency-critical execution and
high-throughput parallel computation. These include workloads in fields
such as telecommunications, scientific computing, machine learning,
financial services and data analytics. Such hybrid systems comprise a
high performance environment.

In order to extract the best performance, optimizations related to CPU
isolation and memory and device locality are required. However, in
Kubernetes, these optimizations are handled by a disjoint set of
components.

This proposal provides a mechanism to coordinate fine-grained hardware
resource assignments for different components in Kubernetes.

# Motivation

Multiple components in the Kubelet make decisions about system
topology-related assignments:

- CPU manager
- The CPU manager makes decisions about the set of CPUs a container is
allowed to run on. The only implemented policy as of v1.8 is the static
one, which does not change assignments for the lifetime of a container.
- Device manager
- The device manager makes concrete device assignments to satisfy
container resource requirements. Generally devices are attached to one
peripheral interconnect. If the device manager and the CPU manager are
misaligned, all communication between the CPU and the device can incur
an additional hop over the processor interconnect fabric.
- Container Network Interface (CNI)
- NICs including SR-IOV Virtual Functions have affinity to one socket,
with measurable performance ramifications.

*Related Issues:*

- [Hardware topology awareness at node level (including NUMA)][k8s-issue-49964]
- [Discover nodes with NUMA architecture][nfd-issue-84]
- [Support VF interrupt binding to specified CPU][sriov-issue-10]
- [Proposal: CPU Affinity and NUMA Topology Awareness][proposal-affinity]

Note that all of these concerns pertain only to multi-socket systems. Correct
behavior requires that the kernel receive accurate topology information from
the underlying hardware (typically via the SLIT table). See section 5.2.16
and 5.2.17 of the
[ACPI Specification](http://www.acpi.info/DOWNLOADS/ACPIspec50.pdf) for more
information.

## Goals

- Arbitrate preferred socket affinity for containers based on input from
CPU manager and Device Manager.
- Provide an internal interface and pattern to integrate additional
topology-aware Kubelet components.

## Non-Goals

- _Inter-device connectivity:_ Decide device assignments based on direct
device interconnects. This issue can be separated from socket
locality. Inter-device topology can be considered entirely within the
scope of the Device Manager, after which it can emit possible
socket affinities. The policy to reach that decision can start simple
and iterate to include support for arbitrary inter-device graphs.
- _HugePages:_ This proposal assumes that pre-allocated HugePages are
spread among the available memory nodes in the system. We further assume
the operating system provides best-effort local page allocation for
containers (as long as sufficient HugePages are free on the local memory
node.
- _CNI:_ Changing the Container Networking Interface is out of scope for
this proposal. However, this design should be extensible enough to
accommodate network interface locality if the CNI adds support in the
future. This limitation is potentially mitigated by the possibility to
use the device plugin API as a stopgap solution for specialized
networking requirements.

## User Stories

*Story 1: Fast virtualized network functions*

A user asks for a "fast network" and automatically gets all the various
pieces coordinated (hugepages, cpusets, network device) co-located on a
socket.

*Story 2: Accelerated neural network training*

A user asks for an accelerator device and some number of exclusive CPUs
in order to get the best training performance, due to socket-alignment of
the assigned CPUs and devices.

# Proposal

*Main idea: Two phase topology coherence protocol*

Topology affinity is tracked at the container level, similar to devices and
CPU affinity. At pod admission time, a new component called the Topology
Manager collects possible configurations from the Device Manager and the
CPU Manager. The Topology Manager acts as an oracle for local alignment by
those same components when they make concrete resource allocations. We
expect the consulted components to use the inferred QoS class of each
pod in order to prioritize the importance of fulfilling optimal locality.

## Proposed Changes

### New Component: Topology Manager

This proposal is focused on a new component in the Kubelet called the
Topology Manager. The Topology Manager implements the pod admit handler
interface and participates in Kubelet pod admission. When the `Admit()`
function is called, the Topology Manager collects topology hints from other
Kubelet components.

If the hints are not compatible, the Topology Manager may choose to
reject the pod. Behavior in this case depends on a new Kubelet configuration
value to choose the topology policy. The Topology Manager supports two
modes: `strict` and `preferred` (default). In `strict` mode, the pod is
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than a Kubelet configuration value, I think it would make more sense to have the pod determine whether it wants to either have "strict" NUMA-alignment or fail to schedule, or whether it prefers to prioritize being able to run at all over ensuring optimal performance.

After all, it's the workload that determines how performance-sensitive it is.

rejected if alignment cannot be satisfied. The Topology Manager could
use `softAdmitHandler` to keep the pod in `Pending` state.

The Topology Manager component will be disabled behind a feature gate until
graduation from alpha to beta.

#### Computing Preferred Affinity

A topology hint indicates a preference for some well-known local resources.
Initially, the only supported reference resource is a mask of CPU socket IDs.
After collecting hints from all providers, the Topology Manager chooses some
mask that is present in all lists. Here is a sketch:

1. Apply a partial order on each list: number of bits set in the
mask, ascending. This biases the result to be more precise if
possible.
1. Iterate over the permutations of preference lists and compute
bitwise-and over the masks in each permutation.
1. Store the first non-empty result and break out early.
1. If no non-empty result exists, return an error.

The behavior when a match does not exist is configurable, as described
above.

#### New Interfaces

```go
package topologymanager

// TopologyManager helps to coordinate local resource alignment
// within the Kubelet.
type Manager interface {
lifecycle.PodAdmitHandler
Store
AddHintProvider(HintProvider)
RemovePod(podName string)
}

// SocketMask is a bitmask-like type denoting a subset of available sockets.
type SocketMask struct{} // TBD

// TopologyHints encodes locality to local resources.
type TopologyHints struct {
Sockets []SocketMask
}

// HintStore manages state related to the Topology Manager.
type Store interface {
// GetAffinity returns the preferred affinity for the supplied
// pod and container.
GetAffinity(podName string, containerName string) TopologyHints
}

// HintProvider is implemented by Kubelet components that make
// topology-related resource assignments. The Topology Manager consults each
// hint provider at pod admission time.
type HintProvider interface {
// Returns hints if this hint provider has a preference; otherwise
// returns `_, false` to indicate "don't care".
GetTopologyHints(pod v1.Pod, containerName string) (TopologyHints, bool)
}
```

_Listing: Topology Manager and related interfaces (sketch)._

![topology-manager-components](https://user-images.githubusercontent.com/379372/47447523-8efd2b00-d772-11e8-924d-eea5a5e00037.png)

_Figure: Topology Manager components._

![topology-manager-instantiation](https://user-images.githubusercontent.com/379372/47447526-945a7580-d772-11e8-9761-5213d745e852.png)

_Figure: Topology Manager instantiation and inclusion in pod admit lifecycle._

### Changes to Existing Components

1. Kubelet consults Topology Manager for pod admission (discussed above.)
1. Add two implementations of Topology Manager interface and a feature gate.
1. As much Topology Manager functionality as possible is stubbed when the
feature gate is disabled.
1. Add a functional Topology Manager that queries hint providers in order
to compute a preferred socket mask for each container.
1. Add `GetTopologyHints()` method to CPU Manager.
1. CPU Manager static policy calls `GetAffinity()` method of
Topology Manager when deciding CPU affinity.
1. Add `GetTopologyHints()` method to Device Manager.
1. Add Socket ID to Device structure in the device plugin
interface. Plugins should be able to determine the socket
when enumerating supported devices. See the protocol diff below.
1. Device Manager calls `GetAffinity()` method of Topology Manager when
deciding device allocation.

```diff
diff --git a/pkg/kubelet/apis/deviceplugin/v1beta1/api.proto b/pkg/kubelet/apis/deviceplugin/v1beta1/api.proto
index efbd72c133..f86a1a5512 100644
--- a/pkg/kubelet/apis/deviceplugin/v1beta1/api.proto
+++ b/pkg/kubelet/apis/deviceplugin/v1beta1/api.proto
@@ -73,6 +73,10 @@ message ListAndWatchResponse {
repeated Device devices = 1;
}

+message TopologyInfo {
+ optional int32 socketID = 1 [default = -1];
+}
+
/* E.g:
* struct Device {
* ID: "GPU-fef8089b-4820-abfc-e83e-94318197576e",
@@ -85,6 +89,8 @@ message Device {
string ID = 1;
// Health of the device, can be healthy or unhealthy, see constants.go
string health = 2;
+ // Topology details of the device (optional.)
+ optional TopologyInfo topology = 3;
}
```

_Listing: Amended device plugin gRPC protocol._

![topology-manager-wiring](https://user-images.githubusercontent.com/379372/47447533-9a505680-d772-11e8-95ca-ef9a8290a46a.png)

_Figure: Topology Manager hint provider registration._

![topology-manager-hints](https://user-images.githubusercontent.com/379372/47447543-a0463780-d772-11e8-8412-8bf4a0571513.png)

_Figure: Topology Manager fetches affinity from hint providers._

# Graduation Criteria

## Phase 1: Alpha (target v1.13)

* Feature gate is disabled by default.
* Alpha-level documentation.
* Unit test coverage.
* CPU Manager allocation policy takes topology hints into account.
* Device plugin interface includes socket ID.
* Device Manager allocation policy takes topology hints into account.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need some way to introspect decisions may be via structured logs?


## Phase 2: Beta (later versions)

* Feature gate is enabled by default.
* Alpha-level documentation.
* Node e2e tests.
* Support hugepages alignment.
* User feedback.

## GA (stable)

* *TBD*

# Challenges

* Testing the Topology Manager in a continuous integration environment
depends on cloud infrastructure to expose multi-node topologies
to guest virtual machines.
* Implementing the `GetHints()` interface may prove challenging.

# Limitations

* *TBD*

# Alternatives

* [AutoNUMA][numa-challenges]: This kernel feature affects memory
allocation and thread scheduling, but does not address device locality.

# References

* *TBD*

[k8s-issue-49964]: https://github.com/kubernetes/kubernetes/issues/49964
[nfd-issue-84]: https://github.com/kubernetes-incubator/node-feature-discovery/issues/84
[sriov-issue-10]: https://github.com/hustcat/sriov-cni/issues/10
[proposal-affinity]: https://github.com/kubernetes/community/pull/171
[numa-challenges]: https://queue.acm.org/detail.cfm?id=2852078