Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

podresources: add Watch endpoint #1926

Closed
wants to merge 5 commits into from
Closed
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 9 additions & 3 deletions keps/sig-node/compute-device-assignment.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,18 +60,24 @@ In this document we will discuss the motivation and code changes required for in

## Changes

Add a v1alpha1 Kubelet GRPC service, at `/var/lib/kubelet/pod-resources/kubelet.sock`, which returns information about the kubelet's assignment of devices to containers. It obtains this information from the internal state of the kubelet's Device Manager. The GRPC Service returns a single PodResourcesResponse, which is shown in proto below:
Add a v1alpha1 Kubelet GRPC service, at `/var/lib/kubelet/pod-resources/kubelet.sock`, which returns information about the kubelet's assignment of devices to containers. It obtains this information from the internal state of the kubelet's Device Manager.
The GRPC Service can return:
- a single PodResourcesResponse, enabling monitor applications to poll for resources allocated to pods and containers on the node.
- a stream of PodResourcesResponse, enabling monitor applications to be notified of new resource allocation, release or resource allocation updates.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the stream be of individual pods, rather than the entire list when it changes? We would need to do something to signal deletion... But I would worry that a high rate of pod churn could make this very expensive.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's reasonable to introduce threshold on kubelet side, configurable via KubeletConfiguration, to not send notification so often, one notification will contain a bunch of podresources (it already described in this KEP). I think it worth mentioning in this KEP, but for implementation I think it's step 2.

Copy link
Contributor Author

@ffromani ffromani Aug 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dashpole the intention is totally to stream only individual pod changes. The API will return:

  1. initially, the state of all the pods (exactly the same output as the current List() API)
  2. then, only the pod whose resource allocation changed for any reason, newly created pod, or deleted pod.
    Regarding deletion, I was thinking to just send a message with all the resources (currently devices only) cleared.

So the monitoring app observes

  1. message which contains pod P1, with container C1 (with devices d1, d2) and container C2 (with devices d3)
    (some time passes, pod gets deleted)
  2. a message which contains pod P1, with container C1 (no devices) and container C2 (no devices)
    So the monitor app can unambiguously learn that "all the resources previously allocated to C1 and C2 in P1 can now be cleared".

Makes sense?

Also, I believe this should be documented, processwise is ok to add this in the KEP, or should be added as comment to the .proto file? or both?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also: yes, we can totally add configurable thresholds, I'll add to the KEP.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that addresses my primary concern. I'm not entirely sold on "pod without resources" indicating deletion being the best way to represent it, but as long as we consider some alternatives and still prefer it, it is reasonable.

Copy link
Contributor Author

@ffromani ffromani Aug 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem regarding better ways to represent deletions: I'm very open on alternatives.
To elaborate my rationale, I considered this approach because it required no extra changes to the API - the diff is minimal and the semantic seemed clear enough.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dashpole the intention is totally to stream only individual pod changes. The API will return:

  1. initially, the state of all the pods (exactly the same output as the current List() API)

I would like to avoid it

  1. then, only the pod whose resource allocation changed for any reason, newly created pod, or deleted pod.
    Regarding deletion, I was thinking to just send a message with all the resources (currently devices only) cleared.

Maybe just add action field into ListPodResourcesResponse with following possible values: ADDED, UPDATED, DELETED

So the monitoring app observes

  1. message which contains pod P1, with container C1 (with devices d1, d2) and container C2 (with devices d3)
    (some time passes, pod gets deleted)
  2. a message which contains pod P1, with container C1 (no devices) and container C2 (no devices)
    So the monitor app can unambiguously learn that "all the resources previously allocated to C1 and C2 in P1 can now be cleared".

Makes sense?

Also, I believe this should be documented, processwise is ok to add this in the KEP, or should be added as comment to the .proto file? or both?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dashpole the intention is totally to stream only individual pod changes. The API will return:

  1. initially, the state of all the pods (exactly the same output as the current List() API)

I would like to avoid it

Which issues do you see with this approach?

  1. then, only the pod whose resource allocation changed for any reason, newly created pod, or deleted pod.
    Regarding deletion, I was thinking to just send a message with all the resources (currently devices only) cleared.

Maybe just add action field into ListPodResourcesResponse with following possible values: ADDED, UPDATED, DELETED

If we send the complete allocation with each message, besides the DELETED case we are still discussing, I don't really see the benefit of a separate action field: could you please elaborate on this?
I see some benefit if each message provides the resource allocation delta (changes from the previous message), but I'd like to avoid sending deltas, it seems more robust (and not much more inefficient) to send total allocation each time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re serving initial state as part of "watch": kubernetes/kubernetes#13969
This is causing us a bunch of burden and is misleading to users.

I would really prefer those two to be separate calls for list vs watch (stream). There is a question how you can ensure consistency (i.e. that nothing happened between the list and the watch calls that won't be reflected in watch and also weren't in list yet).

Maybe just add action field into ListPodResourcesResponse with following possible values: ADDED, UPDATED, DELETED

That would be consistent with k8s watch, so I would really support this one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re serving initial state as part of "watch": kubernetes/kubernetes#13969
This is causing us a bunch of burden and is misleading to users.

I would really prefer those two to be separate calls for list vs watch (stream). There is a question how you can ensure consistency (i.e. that nothing happened between the list and the watch calls that won't be reflected in watch and also weren't in list yet).

Maybe just add action field into ListPodResourcesResponse with following possible values: ADDED, UPDATED, DELETED

That would be consistent with k8s watch, so I would really support this one.

Ok, all of those are good points, and I was not aware of the issue you mentioned. I will update the KEP.


This is shown in proto below:
```protobuf
// PodResources is a service provided by the kubelet that provides information about the
// node resources consumed by pods and containers on the node
service PodResources {
rpc List(ListPodResourcesRequest) returns (ListPodResourcesResponse) {}
rpc ListAndWatch(ListPodResourcesRequest) returns (stream ListPodResourcesResponse) {}
}

// ListPodResourcesRequest is the request made to the PodResources service
message ListPodResourcesRequest {}

// ListPodResourcesResponse is the response returned by List function
// ListPodResourcesResponse is the response returned by List and ListAndWatch functions
message ListPodResourcesResponse {
repeated PodResources pod_resources = 1;
}
Expand All @@ -98,7 +104,6 @@ message ContainerDevices {

### Potential Future Improvements

* Add `ListAndWatch()` function to the GRPC endpoint so monitoring agents don't need to poll.
* Add identifiers for other resources used by pods to the `PodResources` message.
* For example, persistent volume location on disk

Expand Down Expand Up @@ -164,6 +169,7 @@ Beta:

## Implementation History

- 2020-08-XX: KEP extended with ListAndWatch function
ffromani marked this conversation as resolved.
Show resolved Hide resolved
- 2018-09-11: Final version of KEP (proposing pod-resources endpoint) published and presented to sig-node. [Slides](https://docs.google.com/presentation/u/1/d/1xz-iHs8Ec6PqtZGzsmG1e68aLGCX576j_WRptd2114g/edit?usp=sharing)
- 2018-10-30: Demo with example gpu monitoring daemonset
- 2018-11-10: KEP lgtm'd and approved
Expand Down