-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP: efficient node heartbeats #2090
Conversation
keps/0008-node-heartbeat.md
Outdated
The underlying root cause is combination of: | ||
|
||
- etcd keeping both current state and transaction log with copy-on-write | ||
- node heartbeats being pontetially very large objects (even though |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems like it might be worthwhile to investigate and correct this... this is likely the "images present on node" list in node status. would it make more sense to move that out of the object instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Images are only one potential problem (yes - we may consider moving them to a separate object, but that doesn't solve the generic problem of "heartbeats are expensive").
The second big problems are Volumes - we will soon support 100+ volumes per VM in GCE and that has exactly the same issue.
So we really should solve this problem in more generic way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Wojtek.
Also, the image info is unnecessarily large, regardless where it is stored.
And I want to mitigate the proliferation of APIs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incorporated into text
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically the node object is unbound today. Besides images and volumes, there could be labels and annotations to that API object. And periodically (now 10s) updating makes it very expensive.
keps/0008-node-heartbeat.md
Outdated
|
||
We will use CRD (as opposed to built-in API), because: | ||
|
||
- CRDs are `the` way to created new APIs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is quite accurate. It doesn't seem proper for a core kubernetes component to depend on the existence of a "custom" resource, which would be required to exist in all clusters to be functional
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not the API-expert. I've been discussing it with @bgrant0607 and he explicitly said that it should be done with CRD.
Personally, I don't really have preference here, but I trust that Brian knows what he is saying.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not the API-expert. I've been discussing it with @bgrant0607 and he explicitly said that it should be done with CRD. Personally, I don't really have preference here, but I trust that Brian knows what he is saying.
that doesn't seem proper... node heartbeating isn't an extension, add-on, or optional component... it's fundamental to a kubernetes cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liggitt - is your concerns only about CRD vs built-in API or more?
To cite Brian on CRDs:
1. CRD is a mechanism for building APIs
2. Most new APIs will at least start as CRDs
3. Most new APIs will stay CRDs forever
But I will let @bgrant0607 to comment more on that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is your concerns only about CRD vs built-in API or more?
That's my main concern. An API the kubelet depends on is an API the core repo has to define, document, build clients for, and support. Most of those are easier and far more reliable for the kube-apiserver to do as a compiled-in type, rather than a CRD. Given that the schema stability and iteration time is tied to kube releases, I don't see many benefits to building these as a CRD, and see plenty of downsides.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The existing leader election code is in client-go. It could go there, or elsewhere in k/k.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I initially wanted to have this somewhere in k/k. But it's not a strong preference - client-go sounds also fine for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarified where to put that code in the doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wojtek-t Can you include your experimental benchmark on CRD vs. built-in into this KEP? I am fine with CRD, at least we can start with CRD first. If the performance concern remains, we can graduate that to built-in later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically, so far I only did json vs protobuf benchmark on this particular object with some ~real data. I can create a better benchmark when I'm back from leave in couple days.
@dchen1107 that said - how exactly you want this to be incorporated into this KEP? Just the results? Or maybe link to PR to main repo? Or sth else?
keps/0008-node-heartbeat.md
Outdated
This proposal presents a proper solution for that problem. | ||
|
||
|
||
Note that currently: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that currently by default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
keps/0008-node-heartbeat.md
Outdated
|
||
Other notes: | ||
|
||
1. Additional advantage of using LeaderElection for that purpose would be the ability |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exclude?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
keps/0008-node-heartbeat.md
Outdated
- there was a meaningful change in it (initially we can probably assume that every | ||
change is meaningful, including e.g. images on the node) | ||
- or it didn’t report it over last `node-status-update-period` seconds | ||
1. Kubelet creates and periodically updates its own LeaderElection object and frequency |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its own LeaderElection object
what name/pattern is reserved for kubelets to use? presumably we don't want to manually configure 5000 kubelets with distinct heartbeat object names
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that should be explicitly specified in the proposal. It should be the node name, or possibly trivially derived from the node name, but isolating the resources by namespace should be good enough to not require prefixing IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added to the doc above
keps/0008-node-heartbeat.md
Outdated
|
||
## Proposal | ||
|
||
We propose introducing a new `LeaderElection` CRD with the following schema: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this a namespaced or cluster-scoped object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be namespaced, so that paople would also be able to take advantage of it in their applications.
In that case, I would say that the the object corresponding to a given Kubelet should be:
namespace: kube-system
name: kubelet-<node name>
Does that sound ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it needs to be namespaced.
I'd put them in a new namespace and skip the prefix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added to doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would the new namespace be called?
I assume the entire namespace is just scoped to node leases?
I guess we should add it to the set of immortal namespaces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would the new namespace be called?
TBD
I assume the entire namespace is just scoped to node leases?
Yes.
I guess we should add it to the set of immortal namespaces.
Yes
keps/0008-node-heartbeat.md
Outdated
Cons: | ||
|
||
- Introduces a new, narrow-purpose API. Leader election is already used by other | ||
components, implemented using annotations on Endpoints and ConfigMaps. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see this as a con. Using a leader election API object for a single actor which doesn't make use of any of the "leader" aspects of it is pretty confusing.
The leader election API works well when you have functionally identical components sharing a single leader election lock object to determine who is the leader. It makes little sense when you have distinct components each given their own object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think only the name "leader election" is confusing. This is a Lease API, which is what I'd call it.
The other actor is the node controller, which takes action when the Kubelet doesn't renew its lease in time. It just doesn't acquire the lease.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also note that my opinion on a "node heartbeat" API hasn't changed since the original proposal (kubernetes/kubernetes#14735) in 2015. In fact, with ~60 APIs now it's even more critical that we introduce new ones with great care.
keps/0008-node-heartbeat.md
Outdated
// Standard object's metadata. | ||
// More info: https://git.k8s.io/community/contributors/devel/api-conventions.md#metadata | ||
// +optional | ||
ObjectMeta metav1.ObjectMeta `json:"metadata,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what indicates the object is connected to a particular node? the name? the spec.holderIdentity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the name. As I mentioned above, the object corresponding to given Kubelet should be:
namespace: kube-system
name: kubelet-
[holderIdentity will probably be the same as name though]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added that explanation below.
keps/0008-node-heartbeat.md
Outdated
|
||
In the meantime, we will change `NodeController` to treat both updates of NodeStatus | ||
object as well as updates of the new `LeaderElection` object corresponding to a given | ||
node as healthiness signal from a given Kubelet. This will make it work for both old |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how will the node controller watch/filter to objects that are related to nodes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say we should be watching kube-system namespace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or a dedicated namespace
keps/0008-node-heartbeat.md
Outdated
of those updates is independent from NodeStatus update frequency. | ||
|
||
In the meantime, we will change `NodeController` to treat both updates of NodeStatus | ||
object as well as updates of the new `LeaderElection` object corresponding to a given |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This breaks backwards compatibility for any integration anyone has done to watching node objects for changes. That’s any custom scheduler, and alerting based function, and probably a non trivial amount of custom node integrations built by users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Treating both as a healthiness signal? Why? Can you clarify that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll still write the status, just less frequently. It's effectively a configuration change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, schedulers shouldn't be watching elapsed time since last heartbeat directly. The node controller encodes cordoning and eviction in taints and conditions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@smarterclayton We thought about such compatibility issues and this is why we still updated lastHeartbeatTime periodically, but with a increasing period.
d867040
to
38bab24
Compare
keps/0008-node-heartbeat.md
Outdated
object (if they relied on frequent NodeStatus changes) | ||
1. further reduce NodeStatus updates frequency to not less often than once per | ||
1 minute. | ||
We can’t stop periodically updating NodeStatus as it would be API breaking change, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to confirm, today's NodeStatus
update will be kept, right?
While in this case, isn't it true that the old consumers may get stale status of Node?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really understand the question here.
We will be updating NodeStatus less frequently (say once every 1m instead every 10s). So NodeStatus will be potentially 1 minute old instead of 10s old.
But if something will change in the status, we will be updating it (please read the proposal carefully).
So the status itself won't be more stale than it is now - it's only "lastHeartbeatTimestamp" that will potentially be older, but users shouldn't really rely on that...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If something material changes in the node status, it will be updated by kubelet, as it is now.
If the node is deemed MIA, the node controller will update the node status and taints, as it does now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I am aware of the status updating design in this doc, but then what would be the API breaking change if we stop the periodically NodeStatus update? It seems harmless.
keps/0008-node-heartbeat.md
Outdated
|
||
Note that currently (by default): | ||
|
||
- Lack of NodeStatus update for 40s results in NodeController marking node |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, if some other component changed the Ready condition to Unknown, would node controller then set the taint correctly?
If not, it might be useful to decouple that code in node controller and create a way to disable the node controller's status checking so that someone could implement their own probing system if they wanted to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why 40s and 5min, and are they configurable? I can see use cases in IoT where devices may be disconnected for minutes at a time, but had pods scheduled to them that should remain running.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@PatrickLang its configurable via node-monitor-grace-period and pod-eviction-timeout arguments to node lifecycle controller. its probably best to refer to these timings generically and just reference their defaults.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to mention appropriate settings mentioned by Derek.
Answering Brian`s comment - yes, we should probably decouple that logic. But that's independent from this proposal (we should probably do it even without this proposal), so I will not mention it here to avoid making it to wide/complex.
keps/0008-node-heartbeat.md
Outdated
|
||
We propose introducing a new `LeaderElection` CRD with the following schema: | ||
``` | ||
type LeaderElection struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would call this Lease.
It's fine that the leader-election library would use it, since that's what that library does
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switched to Lease - that sounds fine to me too.
75e0666
to
35463b3
Compare
owning-sig: sig-node | ||
participating-sigs: | ||
- sig-scalability | ||
- sig-apimachinery |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add sig-scheduling, due to the image issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
keps/0008-node-heartbeat.md
Outdated
|
||
### Risks and Mitigations | ||
|
||
Increasing default frequency of of NodeStatus updates may potential break customers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/customers/clients/g
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for comments. PTAL
The proposal looks good to me, but I hope @derekwaynecarr can take a look before I apply /lgtm label. He is out for OpenShift Summit this week. |
keps/0008-node-heartbeat.md
Outdated
- CRDs are `the` way to created new APIs | ||
- even though CRDs aren't very efficient due to lack of protobuf support now: | ||
- performance should be acceptable (because Lease object will be small) | ||
<br> TODO(wojtek-t): Working on microbenchmark to prove with data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we simulate the problem in kubemark?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can simulate the solution (once implemented) with kubemark, but we are not really measuring there e.g. serialization time, which is quite crucial from perspective of it.
But yes - from Kubemark we will know if it regressed our SLOs or not.
keps/0008-node-heartbeat.md
Outdated
- even though CRDs aren't very efficient due to lack of protobuf support now: | ||
- performance should be acceptable (because Lease object will be small) | ||
<br> TODO(wojtek-t): Working on microbenchmark to prove with data. | ||
- protobuf support for CRDs is on the roadmap (though not near term) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
keps/0008-node-heartbeat.md
Outdated
HolderIdentity string `json:"holderIdentity"` | ||
LeaseDurationSeconds int32 `json:"leaseDurationSeconds"` | ||
AcquireTime metav1.Time `json:"acquireTime"` | ||
RenewTime metav1.Time `json:"renewTime"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably should be microtime (higher precision) except we haven’t gotten that in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, we have that already in:
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apimachinery/pkg/apis/meta/v1/micro_time.go#L33
And it's already used in new events and in audit logs:
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/apis/audit/v1alpha1/types.go#L131
And I agree we should use it here (probably for both acquire and renew times). Changed.
7fa7866
to
02359f0
Compare
keps/0008-node-heartbeat.md
Outdated
|
||
## Proposal | ||
|
||
We propose introducing a new `Lease` CRD. To make it easily |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
given where the discussion in #2090 (comment) ended up, I was surprised to see this still proposed as a CRD. Isn't it odd to build core function on top of extension APIs? The things CRDs let you do (easily disable, change schema, validation, etc) are all things I would not want any chance of happening to the resource my nodes used for heartbeating.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bgrant0607 for more comments
Personally my slight preference was also towards built-in API, but Brian has very strong opinion on using CRD for it and ultimately he is the API reviewer.
That said, as long as this is not yet enabled by default, we will be free to change that. And purely from code-perspective, using CRD vs built-in API doesn't change much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @liggitt that using a CRD for this does not seem right. If it's core to the function of Kubernetes, it should be core to the API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put this on the agenda for SIG Architecture for tomorrow (the 17th).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I missed the arch meeting, but 💯 to being in the built-in API
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have just discussed this during SIG Architecture.
Long story short, it seems that we want to pursue with using CRDs for even important APIs, but given that CRDs are not feature-complete it and given how core this object will be (necessary for Kubelets to work correctly), there seem to be a decision to use a built-in API.
The question, that I have now, is - what API groups it should belong then. IIUC, we don't want to add anything new to "core" api group. And I don't really see any other apigroup to be a good match. So should we create a new one?
@bgrant0607 @smarterclayton @liggitt @lavalamp - I would like to hear your suggestions here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The convincing argument was about risk.
We could (and should) benchmark CRDs, but since they don't have versioning support yet and since we don't know whether we could seamlessly transition storage from a CRD to a built-in API, if we ran into a performance or other problem, it may not be easy to resolve if we started as a CRD.
Yes, don't add any new APIs to the core API group. Besides, we don't want to add a beta-ish API to a v1 group.
We'd need to choose a group name even if it were a CRD.
Conceptually, I'd categorize it as "coordination".
And I believe we've agreed that new groups will be suffixed with .k8s.io
.
So that would suggest coordination.k8s.io
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Brian!
I changed the doc to talk about built-in API (and the coordination.k8s.io api group). And added some explanation for future reference that we lso considered CRDs and why we decided for built-in api.
Thanks for sharing detailed explanation on etcd limits. Just a note, Delta compression in etcd would be interesting, but not sure how it will be implemented when etcd also stores Protocol Buffer encoded values. Same for other compression algorithms. We once evaluated transport-layer compression etcd-io/etcd#5186, but not much benefit. Would love to experiment again! For the near future, I would look into periodic compaction on etcd storage and run defragmentation to reclaim the disk space. There are plans to make this better as well etcd-io/etcd#9222. /cc @jpbetz @joelegasse Update: didn't mean to trigger request review bot :0 |
@gyuho: GitHub didn't allow me to request PR reviews from the following users: joelegasse. Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
02359f0
to
52a5ab8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a number of questions that would help to clarify.
It is not clear to me if the kubelet updates node conditions at a frequency less than the node status update frequency in the case of meaningful change. If so, at what frequency?
Its not clear why this is not behind a feature gate of some kind as its evaluated rather than immediately moving to beta.
I also prefer a built-in API because I don't know if there is sufficient validation in place server side in the proposed solution. I don't always trust the kubelet.
EDITED - I prefer a built in API over CRD
keps/0008-node-heartbeat.md
Outdated
|
||
## Proposal | ||
|
||
We propose introducing a new `Lease` CRD. To make it easily |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @liggitt that using a CRD for this does not seem right. If it's core to the function of Kubernetes, it should be core to the API.
|
||
The Spec is effectively of already existing (and thus proved) [LeaderElectionRecord][]. | ||
The only difference is using `MicroTime` instead of `Time` for better precision. | ||
That would hopefully allow us go get directly to Beta. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think whether the API goes directly from inception to beta is orthogonal to the design, but seems to possibly influence the usage of CRDs versus built-in API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's a question of how we validate the new API other than in Kubemark and other synthetic tests if it starts as alpha and disabled by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think whether the API goes directly from inception to beta is orthogonal to the design,
It's orthogonal to the design, it influences timeline when we can start using it.
but seems to possibly influence the usage of CRDs versus built-in API.
Why? I consider CRD vs built-in API a separate thing from versioning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CRDs only currently support a single version per api group.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it.
That said, with a decision to proceed with built-in api, this part is no longer valid.
keps/0008-node-heartbeat.md
Outdated
We will use that object to represent node heartbeat - for each Node | ||
there will be corresponding `Lease` object with Name equal | ||
to Node name in a newly created dedicated namespace (we considered | ||
using `kube-system` namespace but decided that it's already too overloaded). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree a net new namespace is preferred if the node name to lease name mapping is required.
Is the controller responsible for creating the namespace if needed so nodes do not require that permission?
I would prefer the namespace is immortal similar to default, kube-system, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - I completely agree - added to the doc.
1. Kubelet creates and periodically updates its own Lease object and frequency | ||
of those updates is independent from NodeStatus update frequency. | ||
|
||
In the meantime, we will change `NodeController` to treat both updates of NodeStatus |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to clarify, non-renewal of a lease corresponds to NodeReady=ConditionUnknown, but for tainting purposes we still require reading the conditions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if I understand - can you please clarify?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think the updated text lets me answer my own question. non renewal of a lease will cause the node controller to mark the node ready condition to condition unknown. any other conditions source from the node will still be set when there is meaningful change, and the node controller will convert those conditions to taints as usual.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - that is correct.
1. announce that we are going to reduce frequency of NodeStatus updates further | ||
and give people 1-2 releases to switch their code to use `Lease` | ||
object (if they relied on frequent NodeStatus changes) | ||
1. further reduce NodeStatus updates frequency to not less often than once per |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't this impact how frequently the node can set conditions that inform scheduling?
if a node is not ready, has disk/memory pressure, etc. will a node report that condition at the normal 10s window, and only if there is no "meaningful" change it will report every 1m? i think 1m is far too long, as there are other services that would want to monitor node health status conditions separate from the lease.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sentence from above is supposed to explain that:
Kubelet is reporting NodeStatus if:
- there was a meaningful change in it (initially we can probably assume that every
change is meaningful, including e.g. images on the node)
- or it didn’t report it over last `node-status-update-period` seconds
So if any condition changed (no matter if the node is ready or not), Kubelet should report its status every 10s.
Only when all conditions (and other stuff) didn't change over last 10s (or sth), we are allowed to not report it up to 1m (or whatever period we will decide for).
@derekwaynecarr Is that more clear now? Do you have any suggestion to make that initial senstence I pasted above more cleaner?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
further iterations of this proposal will need to debate what is meaningful or not, but yes, this is clear enough for now.
1. We may consider reducing frequency of NodeStatus updates to once every 5 minutes | ||
(instead of 1 minute). That would help with performance/scalability even more. | ||
Caveats: | ||
- NodeProblemDetector is currently updating (some) node conditions every 1 minute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NPD is one consumer, but the other node conditions (disk pressure, memory pressure) that relate to eviction pressure transition period setting are also important. it seems to me we need the ability to report those condition changes at a shorter interval than how you propose to scale node status update frequency. maybe we need a separate value for condition changes versus other changes.
keps/0008-node-heartbeat.md
Outdated
Note that if we keep adding node conditions owned by other components, the | ||
number of writes of Node object will go up. But that issue is separate from | ||
that proposal. | ||
1. Increasing default frequency of NodeStatus updates may potentially break clients |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is repeated later in the risks section. would just remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
keps/0008-node-heartbeat.md
Outdated
|
||
## Graduation Criteria | ||
|
||
This can be immediately promoted to Beta, as the API is effectively a copy of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems odd. Are there reasons a feature gate is not chosen as well?
It feels like there are two distinct topics:
- graduation criteria of a
Lease
API - modification of the kubelet and related components
Why is this different than TaintNodesByCondition or TaintBasedEvictions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the feature gate: I agree that the kubelet and node controller behavior should be configurable. That should be decoupled from the API.
The API itself is effectively just data. It should be largely equivalent to the long-standing annotation-based implementation on ConfigMap and other resources. There isn't a controller side to the API implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah - that's a good point.
I made that explicit that I was thinking about API (to promote it directly to Beta).
I added a point to put the changes to kubelet/NodeController logic behind feature gate. I think it should be enabled by default once done, but that's probably a separate discussion (though put that into doc).
52a5ab8
to
cd37b1a
Compare
@derekwaynecarr I clarified those in the comments. TL;DR
Yes - if something changes it will be updated (with the same frequency as it is today - 10s).
Yes - it should be behind feature gate - I added to the doc. I think it should be enabled by default once implemented (I mean the feature gate for it), but that's a separate discussion. |
cd37b1a
to
5465e31
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PTAL
keps/0008-node-heartbeat.md
Outdated
|
||
## Proposal | ||
|
||
We propose introducing a new `Lease` CRD. To make it easily |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Brian!
I changed the doc to talk about built-in API (and the coordination.k8s.io api group). And added some explanation for future reference that we lso considered CRDs and why we decided for built-in api.
|
||
The Spec is effectively of already existing (and thus proved) [LeaderElectionRecord][]. | ||
The only difference is using `MicroTime` instead of `Time` for better precision. | ||
That would hopefully allow us go get directly to Beta. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it.
That said, with a decision to proceed with built-in api, this part is no longer valid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a few minor nits, but this is good for me.
/lgtm
/approve
As a result, we decided to proceed with built-in API. | ||
|
||
|
||
We this new API in place, we will change Kubelet so that: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: With
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in #2172
1. Kubelet creates and periodically updates its own Lease object and frequency | ||
of those updates is independent from NodeStatus update frequency. | ||
|
||
In the meantime, we will change `NodeController` to treat both updates of NodeStatus |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think the updated text lets me answer my own question. non renewal of a lease will cause the node controller to mark the node ready condition to condition unknown. any other conditions source from the node will still be set when there is meaningful change, and the node controller will convert those conditions to taints as usual.
|
||
Once all the code changes are done, we will: | ||
|
||
1. start updating `Lease` object every 10s by default, at the same time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: i know this is clarified earlier, but this is just changing the default frequency, but does not impact the frequency by which the node sends "meaningful" changes (i.e. changes in node condition) to the apiserver. it only impacts the frequency of non-meaningful changes being updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in follow up PR.
1. announce that we are going to reduce frequency of NodeStatus updates further | ||
and give people 1-2 releases to switch their code to use `Lease` | ||
object (if they relied on frequent NodeStatus changes) | ||
1. further reduce NodeStatus updates frequency to not less often than once per |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
further iterations of this proposal will need to debate what is meaningful or not, but yes, this is clear enough for now.
1. We may consider reducing frequency of NodeStatus updates to once every 5 minutes | ||
(instead of 1 minute). That would help with performance/scalability even more. | ||
Caveats: | ||
- NodeProblemDetector is currently updating (some) node conditions every 1 minute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: derekwaynecarr The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Follow up PR: #2175
As a result, we decided to proceed with built-in API. | ||
|
||
|
||
We this new API in place, we will change Kubelet so that: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in #2172
1. Kubelet creates and periodically updates its own Lease object and frequency | ||
of those updates is independent from NodeStatus update frequency. | ||
|
||
In the meantime, we will change `NodeController` to treat both updates of NodeStatus |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - that is correct.
|
||
Once all the code changes are done, we will: | ||
|
||
1. start updating `Lease` object every 10s by default, at the same time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in follow up PR.
KEP: efficient node heartbeats
/assign @dchen1107 @derekwaynecarr
/reviewers @lavalamp @deads2k
/cc @bgrant0607 @yujuhong @liggitt @smarterclayton
@kubernetes/sig-node-api-reviews @kubernetes/sig-api-machinery-api-reviews