Add KEP for avoid serializing the same object multiple times

kubernetes · Mar 27, 2019 · 36daee4 · 36daee4
1 parent 47249c4
commit 36daee4
Showing 1 changed file with 258 additions and 0 deletions.
diff --git a/keps/sig-api-machinery/20190329-less-object-serializations.md b/keps/sig-api-machinery/20190329-less-object-serializations.md
@@ -0,0 +1,258 @@
+---
+title: Less object serializations
+authors:
+  - "@wojtek-t"
+owning-sig: sig-apimachinery
+participating-sigs:
+  - sig-scalability
+reviewers:
+  - "@jpbetz"
+  - "@justinsb"
+  - "@smarterclayton"
+approvers:
+  - "@deads2k"
+  - "@lavalamp"
+creation-date: 2019-03-27
+last-updated: 2019-03-27
+status: provisional
+see-also:
+  - TODO
+replaces:
+  - n/a
+superseded-by:
+  - n/a
+---
+
+# Less object serializations
+
+## Table of Contents
+
+* [Less object serializations](#less-object-serializations)
+   * [Table of Contents](#table-of-contents)
+   * [Release Signoff Checklist](#release-signoff-checklist)
+   * [Summary](#summary)
+   * [Motivation](#motivation)
+      * [Goals](#goals)
+      * [Non-Goals](#non-goals)
+   * [Proposal](#proposal)
+      * [Risks and Mitigations](#risks-and-mitigations)
+   * [Design Details](#design-details)
+      * [Test Plan](#test-plan)
+      * [Graduation Criteria](#graduation-criteria)
+      * [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
+      * [Version Skew Strategy](#version-skew-strategy)
+   * [Implementation History](#implementation-history)
+   * [Drawbacks](#drawbacks)
+   * [Alternatives](#alternatives)
+
+## Release Signoff Checklist
+
+**ACTION REQUIRED:** In order to merge code into a release, there must be an issue in [kubernetes/enhancements] referencing this KEP and targeting a release milestone **before [Enhancement Freeze](https://github.com/kubernetes/sig-release/tree/master/releases)
+of the targeted release**.
+
+For enhancements that make changes to code or processes/procedures in core Kubernetes i.e., [kubernetes/kubernetes], we require the following Release Signoff checklist to be completed.
+
+Check these off as they are completed for the Release Team to track. These checklist items _must_ be updated for the enhancement to be released.
+
+- [ ] kubernetes/enhancements issue in release milestone, which links to KEP (this should be a link to the KEP location in kubernetes/enhancements, not the initial KEP PR)
+- [ ] KEP approvers have set the KEP status to `implementable`
+- [ ] Design details are appropriately documented
+- [ ] Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
+- [ ] Graduation criteria is in place
+- [ ] "Implementation History" section is up-to-date for milestone
+- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
+- [ ] Supporting documentation e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
+
+**Note:** Any PRs to move a KEP to `implementable` or significant changes once it is marked `implementable` should be approved by each of the KEP approvers. If any of those approvers is no longer appropriate than changes to that list should be approved by the remaining approvers and/or the owning SIG (or SIG-arch for cross cutting KEPs).
+
+**Note:** This checklist is iterative and should be reviewed and updated every time this enhancement is being considered for a milestone.
+
+[kubernetes.io]: https://kubernetes.io/
+[kubernetes/enhancements]: https://github.com/kubernetes/enhancements/issues
+[kubernetes/kubernetes]: https://github.com/kubernetes/kubernetes
+[kubernetes/website]: https://github.com/kubernetes/website
+
+## Summary
+
+Scalability and performance of kube-apiserver is crucial for scalability
+of the whole Kubernetes cluster. Given that kube-apiserver is cpu-intensive
+process, scaling a single instance of it translates to optimizing amount
+of work is needed to process a request (cpu cycles and amount of allocated
+memory, as memory management is significant part of work done be
+kube-apiserver).
+
+This proposal is aiming to significantly reduce amount of work spent on
+serializing objects as well as amount of allocated memory to process that.
+
+## Motivation
+
+Running different types of scalability tests and analyzing large production
+clusters proves that large number of watches watching the same set of objects
+may cause significant load on kube-apiserver. An extreme example of it is
+[#75294][], where creation of a single large Endpoints object (almost 1MB of
+size, due to 5k pods backing it) in 5k-node cluster can completely overload
+kube-apiserver for 5 seconds.
+
+The main reason for that is that for every watcher (Endpoints are being watched
+by kube-proxy running on every one) kube-apiserver independently serializes
+(which also requires deep-copy) every single object being send via this watch.
+
+While this problem is extremely visible for watch, the situation looks the same
+for regular GET/LIST operations - reading the same object N times will result
+in serializing that N times independently.
+
+This proposal presents a solution for that problem.
+
+[#75294]: https://github.com/kubernetes/kubernetes/issues/75294
+
+### Goals
+
+- Reduce load on kube-apiserver, by not serializing the same object multiple
+times wherever it is possible.
+
+### Non-Goals
+
+- Change overall architecture of the system, by changing what data is being
+read/watched by different components.
+
+## Proposal
+
+This proposal does not introduce any user-visible changes - the proposed changes
+are purely implementational detail of kube-apiserver.
+
+We propose introducing an internal type implementing `runtime.Object` interface,
+that is encapsulating the original object and is additionally able to accumulate
+its serialized versions. It will look like this:
+```
+// TODO: Better name is welcome :)
+type CachingObject struct {
+	// Object is the object (potentially in the internal version)
+	// for which serializations should be cached for future reuse.
+	Object Object
+	// Versioned is a list of cached versioned serializations.
+	// Note that a list is needed, because a given object can be serialized
+	// to different external versions.
+	Versioned []*CachingVersionedObject
+}
+
+// TODO: Better name is welcome :)
+type CachingVersionedObject struct {
+	// Object in an versioned object.
+	Object Object
+	// Serializations is a list of cached serializations of Object.
+	// Note that a list is needed, because a given versioned object can be
+	// serialized to different media types.
+	//
+	// TODO: Maybe we should be more explicit and just store one for json
+	// and one for protobuf?
+	Serializations []*serializedObject
+}
+
+// TODO: Better name is welcome :)
+type serializedObject struct {
+	once sync.Once
+	// mediaType to which the object is serialized.
+	mediaType string
+	// raw is serialized object.
+	raw []byte
+	// err is error from serialization.
+	err error
+}
+```
+Note that both `CachingObject` and `CachingVersionedObject` should implement
+`runtime.Object` interface.
+
+For those two objects, we will implement `Marshal()`, `MarshallTo()` (for
+protobuf) and `MarshalJSON()` (for json) methods, that:
+- first check if such serialization has already been done, and if so just
+  returning the cached result
+- otherwise performing the serialization, caching the result and returning
+  it to the called
+
+With that, we will need to modify `versioning.codec` so that when encoding
+`CachingObject`, it will perform necessary conversion if needed to the
+`CachingVersionedObject`.
+
+Now the question is what is responsible for creating `CachingObject`.
+We will put that responsibility into `watchCache` - when receiving an
+event via watch from etcd it will be opaquing it into `CachingObject`
+and operating on objects of that type later.
+
+Note that given that objects are already opaqued in `watchCache`, we
+also avoid unnecessary serialization when GET/LIST requests are served
+from cache.
+
+One open questions is how do we decide what exactly we want to cache.
+Caching everything stored in `watchCache` seems expensive from memory
+usage point of view. What we suggest, is allow configuring individual
+`registries` with exact versions that we want to cache and propagate
+that information down to `Cacher`. That would give us ability to switch
+it on for most demanding resources (e.g. only Node, Endpoints and Pod
+ifrom core/v1 group).
+
+Note that based on a [POC][] (slightly different that above design),
+the gains of implementing it include:
+- eliminating kube-apiserver unresponsiveness in case of write of
+a single huge Endpoints object: [#75294#comment-472728088][]
+- ~7% lower cpu-usage
+- XX% less memory allocations
+
+[POC]: https://github.com/kubernetes/kubernetes/pull/60067
+[#75294#comment-472728088]: https://github.com/kubernetes/kubernetes/issues/75294#issuecomment-472728088
+
+### Risks and Mitigations
+
+Given this proposal doesn't introduce any user visible change, the only risk
+are bugs in implementation. Fortunately, the serialization code is widely
+used by all end-to-end tests, so any bugs should be catched by those or
+unit tests of newly added logic.
+
+## Design Details
+
+### Test Plan
+
+* Unit tests covering all corner cases.
+* Regular e2e tests are passing.
+
+### Graduation Criteria
+
+* All existing e2e tests are passing.
+* Scalability tests confirm gains of that change.
+
+Given it is purely implementation detail, we are not going to expose it
+to users via flag or feature-gate (though the implementation will be done
+behind some kind of hard-coded guard).
+
+### Upgrade / Downgrade Strategy
+
+Not applicable
+
+### Version Skew Strategy
+
+Not applicable
+
+## Implementation History
+
+- 2019-03-27: Created
+
+## Drawbacks
+
+This proposal doesn't improve performance of read-requests being served
+from etcd. See [Alternatives](#alternatives-optional) section for more
+details why it was chosen.
+
+## Alternatives
+
+The main alternative that we considered was using some simple LRU cache
+to store serialized objects.
+
+Pros:
+- performance gains also for reads served from etcd;
+
+Cons:
+- worse encapsulation of logic or less flexibility; given that we may
+  want to use caching only for some resources, that information would
+  have be passed to serializers or codec
+- potentially, significant point of contention
+- gains from handling read requests served from etcd aren't huge
+  according to our experiments