-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
podresources: add Watch endpoint #1926
Conversation
Extend the protocol with a simple implementation of ListAndWatch to enable monitoring agents to be notified of resource allocation changes. Signed-off-by: Francesco Romani <[email protected]>
Welcome @fromanirh! |
Hi @fromanirh. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@dashpole could be interested in reviewing this PR |
This feature would be useful for the topology exporter agent that we intend to introduce as part of Topology Aware Scheduling work |
Add a v1alpha1 Kubelet GRPC service, at `/var/lib/kubelet/pod-resources/kubelet.sock`, which returns information about the kubelet's assignment of devices to containers. It obtains this information from the internal state of the kubelet's Device Manager. | ||
The GRPC Service can return: | ||
- a single PodResourcesResponse, enabling monitor applications to poll for resources allocated to pods and containers on the node. | ||
- a stream of PodResourcesResponse, enabling monitor applications to be notified of new resource allocation, release or resource allocation updates. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the stream be of individual pods, rather than the entire list when it changes? We would need to do something to signal deletion... But I would worry that a high rate of pod churn could make this very expensive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's reasonable to introduce threshold on kubelet side, configurable via KubeletConfiguration, to not send notification so often, one notification will contain a bunch of podresources (it already described in this KEP). I think it worth mentioning in this KEP, but for implementation I think it's step 2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dashpole the intention is totally to stream only individual pod changes. The API will return:
- initially, the state of all the pods (exactly the same output as the current List() API)
- then, only the pod whose resource allocation changed for any reason, newly created pod, or deleted pod.
Regarding deletion, I was thinking to just send a message with all the resources (currently devices only) cleared.
So the monitoring app observes
- message which contains pod P1, with container C1 (with devices d1, d2) and container C2 (with devices d3)
(some time passes, pod gets deleted) - a message which contains pod P1, with container C1 (no devices) and container C2 (no devices)
So the monitor app can unambiguously learn that "all the resources previously allocated to C1 and C2 in P1 can now be cleared".
Makes sense?
Also, I believe this should be documented, processwise is ok to add this in the KEP, or should be added as comment to the .proto file? or both?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also: yes, we can totally add configurable thresholds, I'll add to the KEP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, that addresses my primary concern. I'm not entirely sold on "pod without resources" indicating deletion being the best way to represent it, but as long as we consider some alternatives and still prefer it, it is reasonable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem regarding better ways to represent deletions: I'm very open on alternatives.
To elaborate my rationale, I considered this approach because it required no extra changes to the API - the diff is minimal and the semantic seemed clear enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dashpole the intention is totally to stream only individual pod changes. The API will return:
- initially, the state of all the pods (exactly the same output as the current List() API)
I would like to avoid it
- then, only the pod whose resource allocation changed for any reason, newly created pod, or deleted pod.
Regarding deletion, I was thinking to just send a message with all the resources (currently devices only) cleared.
Maybe just add action field into ListPodResourcesResponse with following possible values: ADDED, UPDATED, DELETED
So the monitoring app observes
- message which contains pod P1, with container C1 (with devices d1, d2) and container C2 (with devices d3)
(some time passes, pod gets deleted)- a message which contains pod P1, with container C1 (no devices) and container C2 (no devices)
So the monitor app can unambiguously learn that "all the resources previously allocated to C1 and C2 in P1 can now be cleared".Makes sense?
Also, I believe this should be documented, processwise is ok to add this in the KEP, or should be added as comment to the .proto file? or both?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dashpole the intention is totally to stream only individual pod changes. The API will return:
- initially, the state of all the pods (exactly the same output as the current List() API)
I would like to avoid it
Which issues do you see with this approach?
- then, only the pod whose resource allocation changed for any reason, newly created pod, or deleted pod.
Regarding deletion, I was thinking to just send a message with all the resources (currently devices only) cleared.Maybe just add action field into ListPodResourcesResponse with following possible values: ADDED, UPDATED, DELETED
If we send the complete allocation with each message, besides the DELETED case we are still discussing, I don't really see the benefit of a separate action field: could you please elaborate on this?
I see some benefit if each message provides the resource allocation delta (changes from the previous message), but I'd like to avoid sending deltas, it seems more robust (and not much more inefficient) to send total allocation each time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re serving initial state as part of "watch": kubernetes/kubernetes#13969
This is causing us a bunch of burden and is misleading to users.
I would really prefer those two to be separate calls for list vs watch (stream). There is a question how you can ensure consistency (i.e. that nothing happened between the list and the watch calls that won't be reflected in watch and also weren't in list yet).
Maybe just add action field into ListPodResourcesResponse with following possible values: ADDED, UPDATED, DELETED
That would be consistent with k8s watch, so I would really support this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re serving initial state as part of "watch": kubernetes/kubernetes#13969
This is causing us a bunch of burden and is misleading to users.I would really prefer those two to be separate calls for list vs watch (stream). There is a question how you can ensure consistency (i.e. that nothing happened between the list and the watch calls that won't be reflected in watch and also weren't in list yet).
Maybe just add action field into ListPodResourcesResponse with following possible values: ADDED, UPDATED, DELETED
That would be consistent with k8s watch, so I would really support this one.
Ok, all of those are good points, and I was not aware of the issue you mentioned. I will update the KEP.
Add a v1alpha1 Kubelet GRPC service, at `/var/lib/kubelet/pod-resources/kubelet.sock`, which returns information about the kubelet's assignment of devices to containers. It obtains this information from the internal state of the kubelet's Device Manager. | ||
The GRPC Service can return: | ||
- a single PodResourcesResponse, enabling monitor applications to poll for resources allocated to pods and containers on the node. | ||
- a stream of PodResourcesResponse, enabling monitor applications to be notified of new resource allocation, release or resource allocation updates. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's reasonable to introduce threshold on kubelet side, configurable via KubeletConfiguration, to not send notification so often, one notification will contain a bunch of podresources (it already described in this KEP). I think it worth mentioning in this KEP, but for implementation I think it's step 2.
Address reviewers comment: 1. Add explicit Watch endpoint so APIs are composable (not bundled in ListAndWatch) 2. Add explicit action field in the Watch() endpoint response Signed-off-by: Francesco Romani <[email protected]>
Thanks for all the comments. I think I addressed all of them but the consistency concern between List and Watch, which I'm still thinking about. I will update again once I'm happy with a proposal. Last but not least: I'll squash commits when we reach agreement. |
|
||
// WatchPodResourcesResponse is the response returned by Watch function | ||
message WatchPodResourcesResponse { | ||
WatchPodAction action = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if just exposing the pod resourceVersion
here is a good way forward
* Missed reference to "ListAndWatch", now replaced by "Watch" * renamed UPDATED->MODIFIED To be more compliant with kube naming standards (https://kubernetes.io/docs/reference/using-api/api-concepts/#efficient-detection-of-changes) Signed-off-by: Francesco Romani <[email protected]>
PR with the implemenatation: kubernetes/kubernetes#94612 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a few questions, both related to this KEP, and how it relates to closing out the KubeletPodResources
feature that is presently in beta.
Please clarify the following:
- The
PodResources
message does not include the uuid, this means we are not able to differentiate across space and time as two different pods with same name/same namespace appear the same. I think we should emit the uid especially when adding watch. - The grpc interface for PodResources is v1alpha1, but the feature is beta. Should we just move this to v1 now and commit to backward compatibility on this interface. I do not see the need for a new feature gate related to this new operation, but would like to clean up our grpc surfaces to stable fwd compatible boundaries.
- How is resource version derived? Is it the value on the pod itself?
- Does the watch guarantee ordering? For example, if a pod A is deleted, device D is reclaimed, and pod C is started and assigned device D, am I guaranteed to get those events emitted in order? If so, how and where are you integrating emitting the watch events within the kubelet subsystems?
- How do you intend to test the watch semantics in e2e?
// PodResources contains information about the node resources assigned to a pod | ||
message PodResources { | ||
string name = 1; | ||
string namespace = 2; | ||
repeated ContainerResources containers = 3; | ||
int64 resource_version = 4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this the pod resource version as stored in etcd or something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, that's the intention, in order to enable client code to reconcile data from Watch()
with the data they get from List()
.
// WatchPodResourcesRequest is the request made to the Watch PodResourcesLister service | ||
message WatchPodResourcesRequest {} | ||
|
||
enum WatchPodAction { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when is each action emitted? can you clarify when modified would be used in life of pod?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actions should be emitted:
- ADDED: when resources are assigned to the pod (I'm thinking about HintProvider's Allocate())
- DELETED: when resources are claimed back (I'm thinking about UpdateAllocatedDevices())
I'll document better in the KEP text.
In Hindsight we most likely don't need MODIFED, will just remove it.
To keep the implementation simple as possible, the kubelet does *not* store any historical list of changes. | ||
|
||
In order to make sure not to miss any updates, client application can: | ||
1. call the `Watch` endpoint to get a stream of changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what triggers watch events from getting emitted in kubelet code flows?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After some experiments, I think ADDED should be triggered after succesfull allocation from topology manager (https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/cm/topologymanager/topology_manager.go#L232)
while DELETED should be triggered once device are claimed back (https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/cm/container_manager_linux.go#L1034)
/assign @dashpole @derekwaynecarr |
/ok-to-test |
I see this feature as an evolution of the existing beta feature for To track this properly in the project, we should do the following:
Do you agree with above @dashpole ? |
Good point, needs to be added.
I'm not sure which is the question here: surely I want to meet the beta graduation criterias (good e2e tests for example).
Yes. This just seemed the simplest and safest approach but I'm open to alternatives here.
Initially I thought ordering was not needed because this interface wants to enable resource accounting, but thinking about it some ordering guarantees are indeed needed to avoid some misreporting in corner cases (pod deletion/addition when at full resource capacity comes to mind, but there is likely more). So we need to guarantee this, and I'll document this in the KEP. About emitting event I think we covered in inline comments.
I'll elaborate test plans (does this need to be in the KEP? asking because the original one doesn't mention that) but I think we'll start with
e2e tests is integral part of prototyping implementation already in this WIP PR: kubernetes/kubernetes@8c0e617 |
Next step for me is to reflect the review comments in the KEP PR, then I'll consolidate this PR in the existing #1884 ; last I'll close this one. This is meant to simplify the process and make review easier. |
Add new field in the API responses objects to allow client applications to consume both `List` and `Watch` endpoints. The issue here is enabling client applications to not lose any updates when both APIs are used. The straightforward option is to follow the generic k8s approach (see link below) and let kubelet keep a historical window of the last recent changes, so client applications have the chance to issue `List` and shortly after `Watch`, starting from the resourceVersion returned in `List`. The underlying assumption is indeed that `Watch` happens "shortly" after `List`, otherwise the system cannot guarantee the lack of gaps. However implementing this support requires to keep the aforementioned sliding window of changes, which however requires careful implementation to address scalability and safety guarantees. However, the `podresources` API is a specific API, so, while is good to follow as much as possible the generic API concepts, it also allows some possible little differences which can help keep the implementation simple and safe. This patch proposes a simplest possible approach to reconcile the `List` and `Watch` responses, providing the `resource_version` field and suggesting a little change in the client applications programming model. Inspired by the concepts found on https://kubernetes.io/docs/reference/using-api/api-concepts/#efficient-detection-of-changes Signed-off-by: Francesco Romani <[email protected]>
Signed-off-by: Francesco Romani <[email protected]>
aea8a6e
to
1346c2a
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: fromanirh The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@fromanirh: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
comments addressed and updated content added on top of #1884 - closing this one to reduce the confusion. |
@fromanirh: PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Per comments during the review of the Watch endpoint proposal: 1. kubernetes#1926 (comment) 2. kubernetes#1926 (review) The agreed semantic of a Watch() response message a. refers to a single pod. The intention was always to stream only individual pod changes (kubernetes#1926 (comment)) b. must allow the client to reconcile with the response of the List() endpoint, thus must include a pod resource version. This patch thus adds the missing resource version field and removes the `repeated` attribute to the `PodResources` field. Removing `repeated` is the simplest possible change that aligns the proposal to the intention. Alternatively, it is possible to change the proto so we can allow a `WarchPodResourcesResponse` object to convey information about more pods; however the performance and UX benefits of this more invasive change are unclear, so we avoid it at this moment. This change was missing because it was lost in a rebase Signed-off-by: Francesco Romani <[email protected]>
Extend the protocol with a simple implementation of ListAndWatch
to enable monitoring agents to be notified of resource allocation
changes.
Signed-off-by: Francesco Romani [email protected]