Skip to content

Latest commit

 

History

History
133 lines (93 loc) · 7.22 KB

README.md

File metadata and controls

133 lines (93 loc) · 7.22 KB

resource-topology-exporter

Resource Topology exporter for Topology Aware Scheduler

This is resource topology exporter to enable NUMA-aware scheduling. We introduce a standalone daemon which runs on each node in the cluster as a daemonset. It collect resources allocated to running pods along with associated topology (NUMA nodes) and provides information of the available resources (with numa node granularity) through a CRD instance created per node. so that the scheduler can use it to make a NUMA aware placement decision.

Background

Currently scheduler is incapable of correctly accounting for the available resources and their associated topology information. Topology manager is responsible for identifying numa nodes on which the resources are allocated and scheduler is unaware of per-NUMA resource allocation.

A KEP is currently in progress to expose per-NUMA node resource information to scheduler through CRD

CRD

Available resources with topology of the node should be stored in CRD. Format of the topology described in this document.

// NodeResourceTopology is a specification for a Foo resource
type NodeResourceTopology struct {
	metav1.TypeMeta   `json:",inline"`
	metav1.ObjectMeta `json:"metadata,omitempty"`

	TopologyPolicy []string `json:"topologyPolicies"`
	Zones          ZoneMap  `json:"zones"`
}

// Zone is the spec for a NodeResourceTopology resource
type Zone struct {
	Type       string          `json:"type"`
	Parent     string          `json:"parent,omitempty"`
	Costs      map[string]int  `json:"costs,omitempty"`
	Attributes map[string]int  `json:"attributes,omitempty"`
	Resources  ResourceInfoMap `json:"resources,omitempty"`
}

type ZoneMap map[string]Zone
type ResourceInfoMap map[string]ResourceInfo

type ResourceInfo struct {
	Allocatable string `json:"allocatable"`
	Capacity    string `json:"capacity"`
}

Design based on Pod Resource API

Kubelet exposes endpoint at /var/lib/kubelet/pod-resources/kubelet.sock for exposing information about assignment of devices to containers. It obtains this information from the internal state of the kubelet's Device Manager and returns a single PodResourcesResponse enabling monitor applications to poll for resources allocated to pods and containers on the node. This makes PodResource API a reasonable way of obtaining allocated resource information.

However, PodResource API currently only exposes devices as the container resources (without topology info). We are proposing KEP to enhance it to expose CPU information along with device topology info. In order to use pod-resource-api source in Resource Topology Exporter, you will need to use patched version of kubelet implementing the changes proposed in the aforementioned KEPs:

  1. https://github.com/kubernetes/kubernetes/pull/93243/files
  2. https://github.com/fromanirh/kubernetes/tree/podresources-get-available-devices

A kubernetes branch with both these features that was used for testing is available here

This will no longer be needed once the KEP and the PR are merged.

Furthermore, changes are being proposed to enhance (KEP) PodResource API to support a Watch() endpoint, enabling monitor applications to be notified of new resource allocation, release or resource allocation updates. This will be useful to enable Resource Topology Exporter to become more event based as opposed to its current mechanism of polling.

Installation

  1. You can use the following environment variables to configure the exporter image name:
    • REPOOWNER: name of the repository on which the image will be pushed (example: quay.io/$REPOOWNER/...)
    • IMAGENAME: name of the image to build
    • IMAGETAG: name of the image tag to use
  2. To deploy the exporter run:
make push
make config
make deploy

The Makefile provides other targets:

  • build: Build the device plugin go code
  • gofmt: To format the code
  • push: To push the docker image to a registry
  • images: To build the docker image

Configuration

RTE grew quite a lot of options to address different usecases and flows, accruing features while assisting the evolution of the NUMA aware scheduler. Up until version 0.16 included, the daemon command line flags where the main/only way to configure the behavior.

Started version 0.17, the daemon gained support for configuration file, including configlets (config directory). Please check the issue 111 to learn about the design of the feature.

Workload requesting devices

To test the working of exporter, deploy test deployment that request resources

make deploy-pod

Limitations

  • RTE assumes the devices are not created dynamically.
  • Due to the current (2020, Sept) limitations of CRI, we now rely on podresource API to obtain resource information. Details can be found in the alternatives section below. CRI support is available in the release v0.1 following which CRI support would be deprecated in this repository.

Alternative Approach

Design based on CRI

This daemon can also gather resource information using the Container Runtime interface.

The containerStatusResponse returned as a response to the ContainerStatus rpc contains Info field which is used by the container runtime for capturing ContainerInfo.

message ContainerStatusResponse {
      ContainerStatus status = 1;
      map<string, string> info = 2;
}

Containerd has been used as the container runtime in the initial investigation. The internal container object info here

The Daemon set is responsible for the following:

  • Parsing the info field to obtain container resource information
  • Identifying NUMA nodes of the allocated resources
  • Identifying total number of resources allocated on a NUMA node basis
  • Detecting Node resource capacity on a NUMA node basis
  • Updating the CRD instance per node indicating available resources on NUMA nodes, which is referred to the scheduler

Drawbacks

The content of the info field is free form, unregulated by the API contract. So, CRI-compliant container runtime engines are not required to add any configuration-specific information, like for example cpu allocation, here. In case of containerd container runtime, the Linux Container Configuration is added in the info map depending on the verbosity setting of the container runtime engine.

There is currently work going on in the community as part of the the Vertical Pod Autoscaling feature to update the ContainerStatus field to report back containerResources KEP.