diff --git a/keps/0008-node-heartbeat.md b/keps/0008-node-heartbeat.md
new file mode 100644
index 00000000000..59d824dd8a6
--- /dev/null
+++ b/keps/0008-node-heartbeat.md
@@ -0,0 +1,376 @@
+---
+kep-number: 8
+title: Efficient Node Heartbeat
+authors:
+ - "@wojtek-t"
+ - "with input from @bgrant0607, @dchen1107, @yujuhong, @lavalamp"
+owning-sig: sig-node
+participating-sigs:
+ - sig-scalability
+ - sig-apimachinery
+reviewers:
+ - "@deads2k"
+ - "@lavalamp"
+approvers:
+ - "@dchen1107"
+ - "@derekwaynecarr"
+editor: TBD
+creation-date: 2018-04-27
+last-updated: 2018-04-27
+status: provisional
+see-also:
+ - https://github.com/kubernetes/kubernetes/issues/14733
+ - https://github.com/kubernetes/kubernetes/pull/14735
+replaces:
+ - n/a
+superseded-by:
+ - n/a
+---
+
+# Efficient Node Heartbeats
+
+## Table of Contents
+
+Table of Contents
+=================
+
+* [Efficient Node Heartbeats](#efficient-node-heartbeats)
+ * [Table of Contents](#table-of-contents)
+ * [Summary](#summary)
+ * [Motivation](#motivation)
+ * [Goals](#goals)
+ * [Non-Goals](#non-goals)
+ * [Proposal](#proposal)
+ * [Risks and Mitigations](#risks-and-mitigations)
+ * [Graduation Criteria](#graduation-criteria)
+ * [Implementation History](#implementation-history)
+ * [Alternatives](#alternatives)
+ * [Dedicated “heartbeat” object instead of “leader election” one](#dedicated-heartbeat-object-instead-of-leader-election-one)
+ * [Events instead of dedicated heartbeat object](#events-instead-of-dedicated-heartbeat-object)
+ * [Reuse the Component Registration mechanisms](#reuse-the-component-registration-mechanisms)
+ * [Split Node object into two parts at etcd level](#split-node-object-into-two-parts-at-etcd-level)
+ * [Delta compression in etcd](#delta-compression-in-etcd)
+ * [Replace etcd with other database](#replace-etcd-with-other-database)
+
+## Summary
+
+Node heartbeats are necessary for correct functioning of Kubernetes cluster.
+This proposal makes them significantly cheaper from both scalability and
+performance perspective.
+
+## Motivation
+
+While running different scalability tests we observed that in big enough clusters
+(more than 2000 noodes) with non-trivial number of images used by pods on all
+nodes (10-15), we were hitting etcd limits for its database size. That effectively
+means that etcd enters "alert mode" and stops accepting all write requests.
+
+The underlying root cause is combination of:
+
+- etcd keeping both current state and transaction log with copy-on-write
+- node heartbeats being pontetially very large objects (note that images
+ are only one potential problem, the second are volumes and customers
+ want to mount 100+ volumes to a single node) - they may easily exceed 15kB;
+ even though the patch send over network is small, in etcd we store the
+ whole Node object
+- Kubelet sending heartbeats every 10s
+
+This proposal presents a proper solution for that problem.
+
+
+Note that currently (by default):
+
+- Lack of NodeStatus update for 40s results in NodeController marking node
+ as NotReady (pods are no longer scheduled on that node)
+- Lack of NodeStatus updates for 5m results in NodeController starting pod
+ evictions from that node
+
+We would like to preserve that behavior.
+
+
+### Goals
+
+- Reduce size of etcd by making node heartbeats cheaper
+
+### Non-Goals
+
+The following are nice-to-haves, but not primary goals:
+
+- Reduce resource usage (cpu/memory) of control plane (e.g. due to processing
+ less and/or smaller objects)
+- Reduce watch-related load on Node objects
+
+## Proposal
+
+We propose introducing a new `LeaderElection` CRD with the following schema:
+```
+type LeaderElection struct {
+ metav1.TypeMeta `json:",inline"`
+ // Standard object's metadata.
+ // More info: https://git.k8s.io/community/contributors/devel/api-conventions.md#metadata
+ // +optional
+ ObjectMeta metav1.ObjectMeta `json:"metadata,omitempty"`
+
+ // Specification of the LeaderElection.
+ // More info: https://git.k8s.io/community/contributors/devel/api-conventions.md#spec-and-status
+ // +optional
+ Spec LeaderElectionSpec `json:"spec,omitempty"`
+}
+
+type LeaderElectionSpec struct {
+ HolderIdentity string `json:"holderIdentity"`
+ LeaseDurationSeconds int32 `json:"leaseDurationSeconds"`
+ AcquireTime metav1.Time `json:"acquireTime"`
+ RenewTime metav1.Time `json:"renewTime"`
+ LeaderTransitions int32 `json:"leaderTransitions"`
+}
+```
+
+The Spec is effectively of already existing (and thus proved) [LeaderElectionRecord][].
+That would hopefully allow us go get directly to Beta.
+
+We will use CRD (as opposed to built-in API), because:
+
+- CRDs are `the` way to created new APIs
+- even though CRDs aren't very efficient due to lack of protobuf support now:
+ - performance should be acceptable (because LeaderElection object will be small)
+
TODO(wojtek-t): Working on microbenchmark to prove with data.
+ - protobuf support for CRDs is on the roadmap (though not near term)
+
+We will change Kubelet so that:
+
+1. Kubelet is periodically computing NodeStatus every 10s (at it is now), but that will
+ be independent from reporting status
+1. Kubelet is reporting NodeStatus if:
+ - there was a meaningful change in it (initially we can probably assume that every
+ change is meaningful, including e.g. images on the node)
+ - or it didn’t report it over last `node-status-update-period` seconds
+1. Kubelet creates and periodically updates its own LeaderElection object and frequency
+ of those updates is independent from NodeStatus update frequency.
+
+In the meantime, we will change `NodeController` to treat both updates of NodeStatus
+object as well as updates of the new `LeaderElection` object corresponding to a given
+node as healthiness signal from a given Kubelet. This will make it work for both old
+and new Kubelets.
+
+We should also:
+
+1. audit all other existing core controllers to verify if they also don’t require
+ similar changes in their logic ([ttl controller][] being one of the examples)
+1. change controller manager to auto-register that `LeaderElection` CRD
+1. ensure that `LeaderElection` resource is deleted when corresponding node is
+ deleted (probably via owner references)
+1. [out-of-scope] migrate all LeaderElection code to use that CRD
+
+Once all the code changes are done, we will:
+
+1. start updating `LeaderElection` object every 10s by default, at the same time
+ reducing frequency of NodeStatus updates initially to 20s by default.
+ We will reduce it further later.
+
TODO: That still results in higher average QPS. It should be acceptable but
+ needs to be verified.
+1. announce that we are going to reduce frequency of NodeStatus updates further
+ and give people 1-2 releases to switch their code to use `LeaderElection`
+ object (if they relied on frequent NodeStatus changes)
+1. further reduce NodeStatus updates frequency to not less often than once per
+ 1 minute.
+ We can’t stop periodically updating NodeStatus as it would be API breaking change,
+ but it’s fine to reduce its frequency (though we should continue writing it at
+ least once per eviction period).
+
+
+To be considered:
+
+1. We may consider reducing frequency of NodeStatus updates to once every 5 minutes
+ (instead of 1 minute). That would help with performance/scalability even more.
+ Caveats:
+ - NodeProblemDetector is currently updating (some) node conditions every 1 minute
+ (unconditionally, because lastHeartbeatTime always changes). To make reduction
+ of NodeStatus updates frequency really useful, we should also change NPD to
+ work in a similar mode (check periodically if condition changes, but report only
+ when something changed or no status was reported for a given time) and decrease
+ its reporting frequency too.
+ - In general, we recommend to keep frequencies of NodeStatus reporting in both
+ Kubelet and NodeProblemDetector in sync (once all changes will be done) and
+ that should be reflected in [NPD documentation][].
+ - Note that reducing frequency to 1 minute already gives us almost 6x improvment.
+ It seems more than enough for any foreseeable future assuming we won’t
+ significantly increase the size of object Node.
+ Note that if we keep adding node conditions owned by other components, the
+ number of writes of Node object will go up. But that issue is separate from
+ that proposal.
+1. Increasing default frequency of NodeStatus updates may potentially break customers
+ relying on frequent Node object updates. However, in non-managed solutions,
+ customers will still be able to restore previous behavior by setting appropriate
+ flag values. Thus, changing defaults to what we recommend is the path to go with.
+
+Other notes:
+
+1. Additional advantage of using LeaderElection for that purpose would be the
+ ability to exclude it from audit profile and thus reduce the audit logs footprint.
+
+[LeaderElectionRecord]: https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/client-go/tools/leaderelection/resourcelock/interface.go#L37
+[ttl controller]: https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/ttl/ttl_controller.go#L155
+[NPD documentation]: https://kubernetes.io/docs/tasks/debug-application-cluster/monitor-node-health/
+
+### Risks and Mitigations
+
+Increasing default frequency of of NodeStatus updates may potential break customers
+relying on frequent Node object updates. However, in non-managed solution, customers
+will still be able to restore previous behavior by setting appropriate flag values.
+
+## Graduation Criteria
+
+This can be immediately promoted to Beta, as the API is effectively a copy of
+already existing LeaderElectionRecord.
+
+This will be promoted to GA once it's gone a sufficient amount of time as Beta with
+no changes.
+
+## Implementation History
+
+- RRRR-MM-DD: KEP Summary, Motivation and Proposal merged
+
+## Alternatives
+
+We considered a number of alternatives, most important mentioned below.
+
+### Dedicated “heartbeat” object instead of “leader election” one
+
+Instead of introducing and using “leader election” object, we considered
+introducing a dedicated “heartbeat” object for that purpose. Apart from that,
+all the details about the solution remain pretty much the same.
+
+Pros:
+
+- Conceptually easier to understand what the object is for
+
+Cons:
+
+- Introduces a new, narrow-purpose API. Leader election is already used by other
+ components, implemented using annotations on Endpoints and ConfigMaps.
+
+### Events instead of dedicated heartbeat object
+
+Instead of introducing a dedicated object, we considered using “Event” object
+for that purpose. At the high-level the solution looks very similar.
+The differences from the initial proposal are:
+
+- we use existing “Event” api instead of introducing a new API
+- we create a dedicated namespace; events that should be treated as healthiness
+ signal by NodeController will be written by Kubelets (unconditionally) to that
+ namespace
+- NodeController will be watching only Events from that namespace to avoid
+ processing all events in the system (the volume of all events will be huge)
+- dedicated namespace also helps with security - we can give access to write to
+ that namespace only to Kubelets
+
+Pros:
+
+- No need to introduce new API
+ - We can use that approach much earlier due to that.
+- We already need to optimize event throughput - separate etcd instance we have
+ for them may help with tuning
+- Low-risk roll-forward/roll-back: no new objects is involved (node controller
+ starts watching events, kubelet just reduces the frequency of heartbeats)
+
+Cons:
+
+- Events are conceptually “best-effort” in the system:
+ - they may be silently dropped in case of problems in the system (the event recorder
+ library doesn’t retry on errors, e.g. to not make things worse when control-plane
+ is starved)
+ - currently, components reporting events don’t even know if it succeeded or not (the
+ library is built in a way that you throw the event into it and are not notified if
+ that was successfully submitted or not).
+ Kubelet sending any other update has full control on how/if retry errors.
+ - lack of fairness mechanisms means that even when some events are being successfully
+ send, there is no guarantee that any event from a given Kubelet will be submitted
+ over a given time period
+ So this would require a different mechanism of reporting those “heartbeat” events.
+- Once we have “request priority” concept, I think events should have the lowest one.
+ Even though no particular heartbeat is important, guarantee that some heartbeats will
+ be successfully send it crucial (not delivering any of them will result in unnecessary
+ evictions or not-scheduling to a given node). So heartbeats should be of the highest
+ priority. OTOH, node heartbeats are one of the most important things in the system
+ (not delivering them may result in unnecessary evictions), so they should have the
+ highest priority.
+- No core component in the system is currently watching events
+ - it would make system`s operation harder to explain
+- Users watch Node objects for heartbeats (even though we didn’t recommend it).
+ Introducing a new object for the purpose of heartbeat will allow those users to
+ migrate, while using events for that purpose breaks that ability. (Watching events
+ may put us in tough situation also from performance reasons.)
+- Deleting all events (e.g. event etcd failure + playbook response) should continue to
+ not cause a catastrophic failure and the design will need to account for this.
+
+### Reuse the Component Registration mechanisms
+
+Kubelet is one of control-place components (shared controller). Some time ago, Component
+Registration proposal converged into three parts:
+
+- Introducing an API for registering non-pod endpoints, including readiness information: #18610
+- Changing endpoints controller to also watch those endpoints
+- Identifying some of those endpoints as “components”
+
+We could reuse that mechanism to represent Kubelets as non-pod endpoint API.
+
+Pros:
+
+- Utilizes desired API
+
+Cons:
+
+- Requires introducing that new API
+- Stabilizing the API would take some time
+- Implementing that API requires multiple changes in different components
+
+### Split Node object into two parts at etcd level
+
+We may stick to existing Node API and solve the problem at storage layer. At the
+high level, this means splitting the Node object into two parts in etcd (frequently
+modified one and the rest).
+
+Pros:
+
+- No need to introduce new API
+- No need to change any components other than kube-apiserver
+
+Cons:
+
+- Very complicated to support watch
+- Not very generic (e.g. splitting Spec and Status doesn’t help, it needs to be just
+ heartbeat part)
+- [minor] Doesn’t reduce amount of data that should be processed in the system (writes,
+ reads, watches, …)
+
+### Delta compression in etcd
+
+An alternative for the above can be solving this completely at the etcd layer. To
+achieve that, instead of storing full updates in etcd transaction log, we will just
+store “deltas” and snapshot the whole object only every X seconds/minutes.
+
+Pros:
+
+- Doesn’t require any changes to any Kubernetes components
+
+Cons:
+
+- Computing delta is tricky (etcd doesn’t understand Kubernetes data model, and
+ delta between two protobuf-encoded objects is not necessary small)
+- May require a major rewrite of etcd code and not even be accepted by its maintainers
+- More expensive computationally to get an object in a given resource version (which
+ is what e.g. watch is doing)
+
+### Replace etcd with other database
+
+Instead of using etcd, we may also consider using some other open-source solution.
+
+Pros:
+
+- Doesn’t require new API
+
+Cons:
+
+- We don’t even know if there exists solution that solves our problems and can be used.
+- Migration will take us years.
diff --git a/keps/NEXT_KEP_NUMBER b/keps/NEXT_KEP_NUMBER
index 45a4fb75db8..ec635144f60 100644
--- a/keps/NEXT_KEP_NUMBER
+++ b/keps/NEXT_KEP_NUMBER
@@ -1 +1 @@
-8
+9