diff --git a/keps/sig-node/compute-device-assignment.md b/keps/sig-node/compute-device-assignment.md index 36228a6dff2..cfb01d78cb9 100644 --- a/keps/sig-node/compute-device-assignment.md +++ b/keps/sig-node/compute-device-assignment.md @@ -60,12 +60,18 @@ In this document we will discuss the motivation and code changes required for in ## Changes -Add a v1alpha1 Kubelet GRPC service, at `/var/lib/kubelet/pod-resources/kubelet.sock`, which returns information about the kubelet's assignment of devices to containers. It obtains this information from the internal state of the kubelet's Device Manager. The GRPC Service returns a single PodResourcesResponse, which is shown in proto below: +Add a v1alpha1 Kubelet GRPC service, at `/var/lib/kubelet/pod-resources/kubelet.sock`, which returns information about the kubelet's assignment of devices to containers. It obtains this information from the internal state of the kubelet's Device Manager. +The GRPC Service exposes two endpoints: +- `List`, which returns a single PodResourcesResponse, enabling monitor applications to poll for resources allocated to pods and containers on the node. +- `Watch`, which returns a stream of PodResourcesResponse, enabling monitor applications to be notified of new resource allocation, release or resource allocation updates, using the `action` field in the response. + +This is shown in proto below: ```protobuf // PodResources is a service provided by the kubelet that provides information about the // node resources consumed by pods and containers on the node service PodResources { rpc List(ListPodResourcesRequest) returns (ListPodResourcesResponse) {} + rpc Watch(WatchPodResourcesRequest) returns (stream WatchPodResourcesResponse) {} } // ListPodResourcesRequest is the request made to the PodResources service @@ -76,11 +82,27 @@ message ListPodResourcesResponse { repeated PodResources pod_resources = 1; } +// WatchPodResourcesRequest is the request made to the Watch PodResourcesLister service +message WatchPodResourcesRequest {} + +enum WatchPodAction { + ADDED = 0; + DELETED = 1; +} + +// WatchPodResourcesResponse is the response returned by Watch function +message WatchPodResourcesResponse { + WatchPodAction action = 1; + string uid = 2; + repeated PodResources pod_resources = 3; +} + // PodResources contains information about the node resources assigned to a pod message PodResources { string name = 1; string namespace = 2; repeated ContainerResources containers = 3; + int64 resource_version = 4; } // ContainerResources contains information about the resources assigned to a container @@ -96,11 +118,34 @@ message ContainerDevices { } ``` +### Consuming the Watch endpoint in client applications + +Using the `Watch` endpoint, client applications can be notified of the pod resource allocation changes as soon as possible. +However, the state of a pod will not be sent up until the first resource allocation change, which is the pod deletion in the worst case. +Client applications who need to have the complete resource allocation picture thus need to consume both `List` and `Watch` endpoints. + +The `resourceVersion` found in the responses of both APIs allows client applications to identify the most recent information. +The `resourceVersion` value is updated following the same semantics of pod `resourceVersion` value, and the implementation +may use the same value from the corresponding pods. +To keep the implementation simple as possible, the kubelet does *not* store any historical list of changes. + +In order to make sure not to miss any updates, client application can: +1. call the `Watch` endpoint to get a stream of changes. +2. call the `List` endpoint to get the state of all the pods in the node. +3. reconcile updates using the `resourceVersion`. + +In order to make the resource accounting on the client side, safe and easy as possible the `Watch` implementation +will guarantee ordering of the event delivery in such a way that the capacity invariants are always preserved, and the value +will be consistent after each event received - not only at steady state. +Consider the following scenario with 10 devices, all allocated: pod A with device D1 allocated gets deleted, then +pod B starts and gets device D1 again. In this case `Watch` will guarantee that `DELETE` and `ADDED` events are delivered +in the correct order. + ### Potential Future Improvements -* Add `ListAndWatch()` function to the GRPC endpoint so monitoring agents don't need to poll. * Add identifiers for other resources used by pods to the `PodResources` message. * For example, persistent volume location on disk +* Implement historical list of changes, allowing client applications to call `List` and `Watch` endpoints in a more natural order. ## Alternatives Considered @@ -164,6 +209,7 @@ Beta: ## Implementation History +- 2020-10-01: KEP extended with Watch API - 2018-09-11: Final version of KEP (proposing pod-resources endpoint) published and presented to sig-node. [Slides](https://docs.google.com/presentation/u/1/d/1xz-iHs8Ec6PqtZGzsmG1e68aLGCX576j_WRptd2114g/edit?usp=sharing) - 2018-10-30: Demo with example gpu monitoring daemonset - 2018-11-10: KEP lgtm'd and approved