Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP: efficient node heartbeats #2090

Merged
merged 1 commit into from
May 21, 2018

Conversation

wojtek-t
Copy link
Member

/assign @dchen1107 @derekwaynecarr
/reviewers @lavalamp @deads2k

/cc @bgrant0607 @yujuhong @liggitt @smarterclayton
@kubernetes/sig-node-api-reviews @kubernetes/sig-api-machinery-api-reviews

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Apr 30, 2018
The underlying root cause is combination of:

- etcd keeping both current state and transaction log with copy-on-write
- node heartbeats being pontetially very large objects (even though
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems like it might be worthwhile to investigate and correct this... this is likely the "images present on node" list in node status. would it make more sense to move that out of the object instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Images are only one potential problem (yes - we may consider moving them to a separate object, but that doesn't solve the generic problem of "heartbeats are expensive").

The second big problems are Volumes - we will soon support 100+ volumes per VM in GCE and that has exactly the same issue.

So we really should solve this problem in more generic way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Wojtek.

Also, the image info is unnecessarily large, regardless where it is stored.

And I want to mitigate the proliferation of APIs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorporated into text

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically the node object is unbound today. Besides images and volumes, there could be labels and annotations to that API object. And periodically (now 10s) updating makes it very expensive.


We will use CRD (as opposed to built-in API), because:

- CRDs are `the` way to created new APIs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is quite accurate. It doesn't seem proper for a core kubernetes component to depend on the existence of a "custom" resource, which would be required to exist in all clusters to be functional

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not the API-expert. I've been discussing it with @bgrant0607 and he explicitly said that it should be done with CRD.
Personally, I don't really have preference here, but I trust that Brian knows what he is saying.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not the API-expert. I've been discussing it with @bgrant0607 and he explicitly said that it should be done with CRD. Personally, I don't really have preference here, but I trust that Brian knows what he is saying.

that doesn't seem proper... node heartbeating isn't an extension, add-on, or optional component... it's fundamental to a kubernetes cluster.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liggitt - is your concerns only about CRD vs built-in API or more?

To cite Brian on CRDs:

1. CRD is a mechanism for building APIs
2. Most new APIs will at least start as CRDs
3. Most new APIs will stay CRDs forever

But I will let @bgrant0607 to comment more on that.

Copy link
Member

@liggitt liggitt May 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is your concerns only about CRD vs built-in API or more?

That's my main concern. An API the kubelet depends on is an API the core repo has to define, document, build clients for, and support. Most of those are easier and far more reliable for the kube-apiserver to do as a compiled-in type, rather than a CRD. Given that the schema stability and iteration time is tied to kube releases, I don't see many benefits to building these as a CRD, and see plenty of downsides.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existing leader election code is in client-go. It could go there, or elsewhere in k/k.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially wanted to have this somewhere in k/k. But it's not a strong preference - client-go sounds also fine for me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarified where to put that code in the doc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wojtek-t Can you include your experimental benchmark on CRD vs. built-in into this KEP? I am fine with CRD, at least we can start with CRD first. If the performance concern remains, we can graduate that to built-in later.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, so far I only did json vs protobuf benchmark on this particular object with some ~real data. I can create a better benchmark when I'm back from leave in couple days.

@dchen1107 that said - how exactly you want this to be incorporated into this KEP? Just the results? Or maybe link to PR to main repo? Or sth else?

This proposal presents a proper solution for that problem.


Note that currently:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that currently by default?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


Other notes:

1. Additional advantage of using LeaderElection for that purpose would be the ability

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exclude?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

- there was a meaningful change in it (initially we can probably assume that every
change is meaningful, including e.g. images on the node)
- or it didn’t report it over last `node-status-update-period` seconds
1. Kubelet creates and periodically updates its own LeaderElection object and frequency
Copy link
Member

@liggitt liggitt May 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its own LeaderElection object

what name/pattern is reserved for kubelets to use? presumably we don't want to manually configure 5000 kubelets with distinct heartbeat object names

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that should be explicitly specified in the proposal. It should be the node name, or possibly trivially derived from the node name, but isolating the resources by namespace should be good enough to not require prefixing IMO.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added to the doc above


## Proposal

We propose introducing a new `LeaderElection` CRD with the following schema:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a namespaced or cluster-scoped object?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be namespaced, so that paople would also be able to take advantage of it in their applications.

In that case, I would say that the the object corresponding to a given Kubelet should be:
namespace: kube-system
name: kubelet-<node name>

Does that sound ok?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it needs to be namespaced.

I'd put them in a new namespace and skip the prefix.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added to doc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would the new namespace be called?
I assume the entire namespace is just scoped to node leases?
I guess we should add it to the set of immortal namespaces.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would the new namespace be called?

TBD

I assume the entire namespace is just scoped to node leases?

Yes.

I guess we should add it to the set of immortal namespaces.

Yes

Cons:

- Introduces a new, narrow-purpose API. Leader election is already used by other
components, implemented using annotations on Endpoints and ConfigMaps.
Copy link
Member

@liggitt liggitt May 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see this as a con. Using a leader election API object for a single actor which doesn't make use of any of the "leader" aspects of it is pretty confusing.

The leader election API works well when you have functionally identical components sharing a single leader election lock object to determine who is the leader. It makes little sense when you have distinct components each given their own object.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think only the name "leader election" is confusing. This is a Lease API, which is what I'd call it.

The other actor is the node controller, which takes action when the Kubelet doesn't renew its lease in time. It just doesn't acquire the lease.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also note that my opinion on a "node heartbeat" API hasn't changed since the original proposal (kubernetes/kubernetes#14735) in 2015. In fact, with ~60 APIs now it's even more critical that we introduce new ones with great care.

// Standard object's metadata.
// More info: https://git.k8s.io/community/contributors/devel/api-conventions.md#metadata
// +optional
ObjectMeta metav1.ObjectMeta `json:"metadata,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what indicates the object is connected to a particular node? the name? the spec.holderIdentity?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the name. As I mentioned above, the object corresponding to given Kubelet should be:
namespace: kube-system
name: kubelet-

[holderIdentity will probably be the same as name though]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added that explanation below.


In the meantime, we will change `NodeController` to treat both updates of NodeStatus
object as well as updates of the new `LeaderElection` object corresponding to a given
node as healthiness signal from a given Kubelet. This will make it work for both old
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how will the node controller watch/filter to objects that are related to nodes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say we should be watching kube-system namespace.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or a dedicated namespace

of those updates is independent from NodeStatus update frequency.

In the meantime, we will change `NodeController` to treat both updates of NodeStatus
object as well as updates of the new `LeaderElection` object corresponding to a given
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This breaks backwards compatibility for any integration anyone has done to watching node objects for changes. That’s any custom scheduler, and alerting based function, and probably a non trivial amount of custom node integrations built by users.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Treating both as a healthiness signal? Why? Can you clarify that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll still write the status, just less frequently. It's effectively a configuration change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, schedulers shouldn't be watching elapsed time since last heartbeat directly. The node controller encodes cordoning and eviction in taints and conditions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smarterclayton We thought about such compatibility issues and this is why we still updated lastHeartbeatTime periodically, but with a increasing period.

@wojtek-t wojtek-t force-pushed the node_heartbeat_kep branch from d867040 to 38bab24 Compare May 1, 2018 05:55
object (if they relied on frequent NodeStatus changes)
1. further reduce NodeStatus updates frequency to not less often than once per
1 minute.
We can’t stop periodically updating NodeStatus as it would be API breaking change,
Copy link
Contributor

@resouer resouer May 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm, today's NodeStatus update will be kept, right?

While in this case, isn't it true that the old consumers may get stale status of Node?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand the question here.

We will be updating NodeStatus less frequently (say once every 1m instead every 10s). So NodeStatus will be potentially 1 minute old instead of 10s old.
But if something will change in the status, we will be updating it (please read the proposal carefully).

So the status itself won't be more stale than it is now - it's only "lastHeartbeatTimestamp" that will potentially be older, but users shouldn't really rely on that...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If something material changes in the node status, it will be updated by kubelet, as it is now.

If the node is deemed MIA, the node controller will update the node status and taints, as it does now.

Copy link
Contributor

@resouer resouer May 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I am aware of the status updating design in this doc, but then what would be the API breaking change if we stop the periodically NodeStatus update? It seems harmless.


Note that currently (by default):

- Lack of NodeStatus update for 40s results in NodeController marking node
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, if some other component changed the Ready condition to Unknown, would node controller then set the taint correctly?

If not, it might be useful to decouple that code in node controller and create a way to disable the node controller's status checking so that someone could implement their own probing system if they wanted to.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why 40s and 5min, and are they configurable? I can see use cases in IoT where devices may be disconnected for minutes at a time, but had pods scheduled to them that should remain running.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PatrickLang its configurable via node-monitor-grace-period and pod-eviction-timeout arguments to node lifecycle controller. its probably best to refer to these timings generically and just reference their defaults.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to mention appropriate settings mentioned by Derek.

Answering Brian`s comment - yes, we should probably decouple that logic. But that's independent from this proposal (we should probably do it even without this proposal), so I will not mention it here to avoid making it to wide/complex.


We propose introducing a new `LeaderElection` CRD with the following schema:
```
type LeaderElection struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would call this Lease.

It's fine that the leader-election library would use it, since that's what that library does

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched to Lease - that sounds fine to me too.

@wojtek-t wojtek-t force-pushed the node_heartbeat_kep branch 2 times, most recently from 75e0666 to 35463b3 Compare May 7, 2018 12:30
owning-sig: sig-node
participating-sigs:
- sig-scalability
- sig-apimachinery
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add sig-scheduling, due to the image issue

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


### Risks and Mitigations

Increasing default frequency of of NodeStatus updates may potential break customers
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/customers/clients/g

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Member Author

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for comments. PTAL

@dchen1107
Copy link
Member

The proposal looks good to me, but I hope @derekwaynecarr can take a look before I apply /lgtm label. He is out for OpenShift Summit this week.

- CRDs are `the` way to created new APIs
- even though CRDs aren't very efficient due to lack of protobuf support now:
- performance should be acceptable (because Lease object will be small)
<br> TODO(wojtek-t): Working on microbenchmark to prove with data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we simulate the problem in kubemark?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can simulate the solution (once implemented) with kubemark, but we are not really measuring there e.g. serialization time, which is quite crucial from perspective of it.
But yes - from Kubemark we will know if it regressed our SLOs or not.

- even though CRDs aren't very efficient due to lack of protobuf support now:
- performance should be acceptable (because Lease object will be small)
<br> TODO(wojtek-t): Working on microbenchmark to prove with data.
- protobuf support for CRDs is on the roadmap (though not near term)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

HolderIdentity string `json:"holderIdentity"`
LeaseDurationSeconds int32 `json:"leaseDurationSeconds"`
AcquireTime metav1.Time `json:"acquireTime"`
RenewTime metav1.Time `json:"renewTime"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably should be microtime (higher precision) except we haven’t gotten that in.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, we have that already in:
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apimachinery/pkg/apis/meta/v1/micro_time.go#L33
And it's already used in new events and in audit logs:
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/apis/audit/v1alpha1/types.go#L131

And I agree we should use it here (probably for both acquire and renew times). Changed.

@wojtek-t wojtek-t force-pushed the node_heartbeat_kep branch from 7fa7866 to 02359f0 Compare May 14, 2018 09:08

## Proposal

We propose introducing a new `Lease` CRD. To make it easily
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given where the discussion in #2090 (comment) ended up, I was surprised to see this still proposed as a CRD. Isn't it odd to build core function on top of extension APIs? The things CRDs let you do (easily disable, change schema, validation, etc) are all things I would not want any chance of happening to the resource my nodes used for heartbeating.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bgrant0607 for more comments

Personally my slight preference was also towards built-in API, but Brian has very strong opinion on using CRD for it and ultimately he is the API reviewer.
That said, as long as this is not yet enabled by default, we will be free to change that. And purely from code-perspective, using CRD vs built-in API doesn't change much.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @liggitt that using a CRD for this does not seem right. If it's core to the function of Kubernetes, it should be core to the API.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put this on the agenda for SIG Architecture for tomorrow (the 17th).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I missed the arch meeting, but 💯 to being in the built-in API

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have just discussed this during SIG Architecture.
Long story short, it seems that we want to pursue with using CRDs for even important APIs, but given that CRDs are not feature-complete it and given how core this object will be (necessary for Kubelets to work correctly), there seem to be a decision to use a built-in API.

The question, that I have now, is - what API groups it should belong then. IIUC, we don't want to add anything new to "core" api group. And I don't really see any other apigroup to be a good match. So should we create a new one?

@bgrant0607 @smarterclayton @liggitt @lavalamp - I would like to hear your suggestions here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The convincing argument was about risk.

We could (and should) benchmark CRDs, but since they don't have versioning support yet and since we don't know whether we could seamlessly transition storage from a CRD to a built-in API, if we ran into a performance or other problem, it may not be easy to resolve if we started as a CRD.

Yes, don't add any new APIs to the core API group. Besides, we don't want to add a beta-ish API to a v1 group.

We'd need to choose a group name even if it were a CRD.

Conceptually, I'd categorize it as "coordination".

And I believe we've agreed that new groups will be suffixed with .k8s.io.

So that would suggest coordination.k8s.io.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Brian!

I changed the doc to talk about built-in API (and the coordination.k8s.io api group). And added some explanation for future reference that we lso considered CRDs and why we decided for built-in api.

@gyuho
Copy link
Member

gyuho commented May 15, 2018

Thanks for sharing detailed explanation on etcd limits.

Just a note,

Delta compression in etcd would be interesting, but not sure how it will be implemented when etcd also stores Protocol Buffer encoded values. Same for other compression algorithms. We once evaluated transport-layer compression etcd-io/etcd#5186, but not much benefit. Would love to experiment again!

For the near future, I would look into periodic compaction on etcd storage and run defragmentation to reclaim the disk space. There are plans to make this better as well etcd-io/etcd#9222.

/cc @jpbetz @joelegasse

Update: didn't mean to trigger request review bot :0

@k8s-ci-robot
Copy link
Contributor

@gyuho: GitHub didn't allow me to request PR reviews from the following users: joelegasse.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

Just note:

Delta compression in etcd would be interesting, but just not sure how it will be implemented when etcd also stores Protocol Buffer encoded values. Same for other compression algorithms. We once evaluated transport-layer compression etcd-io/etcd#5186, but not much benefit. Would love to experiment again!

For the near future, I would look into periodic compaction on etcd storage and run defragmentation to reclaim the disk space. There are plans to make this better as well (etcd-io/etcd#9222).

/cc @jpbetz @joelegasse

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot requested a review from jpbetz May 15, 2018 21:35
@wojtek-t wojtek-t force-pushed the node_heartbeat_kep branch from 02359f0 to 52a5ab8 Compare May 16, 2018 14:09
Copy link
Member

@derekwaynecarr derekwaynecarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a number of questions that would help to clarify.

It is not clear to me if the kubelet updates node conditions at a frequency less than the node status update frequency in the case of meaningful change. If so, at what frequency?

Its not clear why this is not behind a feature gate of some kind as its evaluated rather than immediately moving to beta.

I also prefer a built-in API because I don't know if there is sufficient validation in place server side in the proposed solution. I don't always trust the kubelet.

EDITED - I prefer a built in API over CRD


## Proposal

We propose introducing a new `Lease` CRD. To make it easily
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @liggitt that using a CRD for this does not seem right. If it's core to the function of Kubernetes, it should be core to the API.


The Spec is effectively of already existing (and thus proved) [LeaderElectionRecord][].
The only difference is using `MicroTime` instead of `Time` for better precision.
That would hopefully allow us go get directly to Beta.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think whether the API goes directly from inception to beta is orthogonal to the design, but seems to possibly influence the usage of CRDs versus built-in API.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a question of how we validate the new API other than in Kubemark and other synthetic tests if it starts as alpha and disabled by default.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think whether the API goes directly from inception to beta is orthogonal to the design,

It's orthogonal to the design, it influences timeline when we can start using it.

but seems to possibly influence the usage of CRDs versus built-in API.

Why? I consider CRD vs built-in API a separate thing from versioning.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CRDs only currently support a single version per api group.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it.
That said, with a decision to proceed with built-in api, this part is no longer valid.

We will use that object to represent node heartbeat - for each Node
there will be corresponding `Lease` object with Name equal
to Node name in a newly created dedicated namespace (we considered
using `kube-system` namespace but decided that it's already too overloaded).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree a net new namespace is preferred if the node name to lease name mapping is required.
Is the controller responsible for creating the namespace if needed so nodes do not require that permission?
I would prefer the namespace is immortal similar to default, kube-system, etc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - I completely agree - added to the doc.

1. Kubelet creates and periodically updates its own Lease object and frequency
of those updates is independent from NodeStatus update frequency.

In the meantime, we will change `NodeController` to treat both updates of NodeStatus
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to clarify, non-renewal of a lease corresponds to NodeReady=ConditionUnknown, but for tainting purposes we still require reading the conditions?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if I understand - can you please clarify?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the updated text lets me answer my own question. non renewal of a lease will cause the node controller to mark the node ready condition to condition unknown. any other conditions source from the node will still be set when there is meaningful change, and the node controller will convert those conditions to taints as usual.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - that is correct.

1. announce that we are going to reduce frequency of NodeStatus updates further
and give people 1-2 releases to switch their code to use `Lease`
object (if they relied on frequent NodeStatus changes)
1. further reduce NodeStatus updates frequency to not less often than once per
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't this impact how frequently the node can set conditions that inform scheduling?

if a node is not ready, has disk/memory pressure, etc. will a node report that condition at the normal 10s window, and only if there is no "meaningful" change it will report every 1m? i think 1m is far too long, as there are other services that would want to monitor node health status conditions separate from the lease.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sentence from above is supposed to explain that:

Kubelet is reporting NodeStatus if:
  - there was a meaningful change in it (initially we can probably assume that every
    change is meaningful, including e.g. images on the node)
  - or it didn’t report it over last `node-status-update-period` seconds

So if any condition changed (no matter if the node is ready or not), Kubelet should report its status every 10s.
Only when all conditions (and other stuff) didn't change over last 10s (or sth), we are allowed to not report it up to 1m (or whatever period we will decide for).

@derekwaynecarr Is that more clear now? Do you have any suggestion to make that initial senstence I pasted above more cleaner?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

further iterations of this proposal will need to debate what is meaningful or not, but yes, this is clear enough for now.

1. We may consider reducing frequency of NodeStatus updates to once every 5 minutes
(instead of 1 minute). That would help with performance/scalability even more.
Caveats:
- NodeProblemDetector is currently updating (some) node conditions every 1 minute
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NPD is one consumer, but the other node conditions (disk pressure, memory pressure) that relate to eviction pressure transition period setting are also important. it seems to me we need the ability to report those condition changes at a shorter interval than how you propose to scale node status update frequency. maybe we need a separate value for condition changes versus other changes.

Note that if we keep adding node conditions owned by other components, the
number of writes of Node object will go up. But that issue is separate from
that proposal.
1. Increasing default frequency of NodeStatus updates may potentially break clients
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is repeated later in the risks section. would just remove it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


## Graduation Criteria

This can be immediately promoted to Beta, as the API is effectively a copy of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems odd. Are there reasons a feature gate is not chosen as well?

It feels like there are two distinct topics:

  1. graduation criteria of a Lease API
  2. modification of the kubelet and related components

Why is this different than TaintNodesByCondition or TaintBasedEvictions?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the feature gate: I agree that the kubelet and node controller behavior should be configurable. That should be decoupled from the API.

The API itself is effectively just data. It should be largely equivalent to the long-standing annotation-based implementation on ConfigMap and other resources. There isn't a controller side to the API implementation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah - that's a good point.
I made that explicit that I was thinking about API (to promote it directly to Beta).

I added a point to put the changes to kubelet/NodeController logic behind feature gate. I think it should be enabled by default once done, but that's probably a separate discussion (though put that into doc).

@wojtek-t wojtek-t force-pushed the node_heartbeat_kep branch from 52a5ab8 to cd37b1a Compare May 17, 2018 07:57
@wojtek-t
Copy link
Member Author

I have a number of questions that would help to clarify.

@derekwaynecarr I clarified those in the comments. TL;DR

It is not clear to me if the kubelet updates node conditions at a frequency less than the node status
update frequency in the case of meaningful change. If so, at what frequency?

Yes - if something changes it will be updated (with the same frequency as it is today - 10s).

Its not clear why this is not behind a feature gate of some kind as its evaluated rather than immediately
moving to beta.

Yes - it should be behind feature gate - I added to the doc. I think it should be enabled by default once implemented (I mean the feature gate for it), but that's a separate discussion.

@wojtek-t wojtek-t force-pushed the node_heartbeat_kep branch from cd37b1a to 5465e31 Compare May 18, 2018 08:08
Copy link
Member Author

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTAL


## Proposal

We propose introducing a new `Lease` CRD. To make it easily
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Brian!

I changed the doc to talk about built-in API (and the coordination.k8s.io api group). And added some explanation for future reference that we lso considered CRDs and why we decided for built-in api.


The Spec is effectively of already existing (and thus proved) [LeaderElectionRecord][].
The only difference is using `MicroTime` instead of `Time` for better precision.
That would hopefully allow us go get directly to Beta.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it.
That said, with a decision to proceed with built-in api, this part is no longer valid.

Copy link
Member

@derekwaynecarr derekwaynecarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few minor nits, but this is good for me.

/lgtm
/approve

As a result, we decided to proceed with built-in API.


We this new API in place, we will change Kubelet so that:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: With

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in #2172

1. Kubelet creates and periodically updates its own Lease object and frequency
of those updates is independent from NodeStatus update frequency.

In the meantime, we will change `NodeController` to treat both updates of NodeStatus
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the updated text lets me answer my own question. non renewal of a lease will cause the node controller to mark the node ready condition to condition unknown. any other conditions source from the node will still be set when there is meaningful change, and the node controller will convert those conditions to taints as usual.


Once all the code changes are done, we will:

1. start updating `Lease` object every 10s by default, at the same time
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: i know this is clarified earlier, but this is just changing the default frequency, but does not impact the frequency by which the node sends "meaningful" changes (i.e. changes in node condition) to the apiserver. it only impacts the frequency of non-meaningful changes being updated.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in follow up PR.

1. announce that we are going to reduce frequency of NodeStatus updates further
and give people 1-2 releases to switch their code to use `Lease`
object (if they relied on frequent NodeStatus changes)
1. further reduce NodeStatus updates frequency to not less often than once per
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

further iterations of this proposal will need to debate what is meaningful or not, but yes, this is clear enough for now.

1. We may consider reducing frequency of NodeStatus updates to once every 5 minutes
(instead of 1 minute). That would help with performance/scalability even more.
Caveats:
- NodeProblemDetector is currently updating (some) node conditions every 1 minute
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 21, 2018
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: derekwaynecarr

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 21, 2018
@k8s-ci-robot k8s-ci-robot merged commit 8f7aae4 into kubernetes:master May 21, 2018
Copy link
Member Author

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow up PR: #2175

As a result, we decided to proceed with built-in API.


We this new API in place, we will change Kubelet so that:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in #2172

1. Kubelet creates and periodically updates its own Lease object and frequency
of those updates is independent from NodeStatus update frequency.

In the meantime, we will change `NodeController` to treat both updates of NodeStatus
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - that is correct.


Once all the code changes are done, we will:

1. start updating `Lease` object every 10s by default, at the same time
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in follow up PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.