-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add KEP for avoid serializing the same object multiple times #924
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,305 @@ | ||
--- | ||
title: Less object serializations | ||
authors: | ||
- "@wojtek-t" | ||
owning-sig: sig-apimachinery | ||
participating-sigs: | ||
- sig-scalability | ||
reviewers: | ||
- "@jpbetz" | ||
- "@justinsb" | ||
- "@smarterclayton" | ||
approvers: | ||
- "@deads2k" | ||
- "@lavalamp" | ||
creation-date: 2019-03-27 | ||
last-updated: 2019-03-27 | ||
status: provisional | ||
see-also: | ||
- TODO | ||
replaces: | ||
- n/a | ||
superseded-by: | ||
- n/a | ||
--- | ||
|
||
# Less object serializations | ||
|
||
## Table of Contents | ||
|
||
* [Less object serializations](#less-object-serializations) | ||
* [Table of Contents](#table-of-contents) | ||
* [Release Signoff Checklist](#release-signoff-checklist) | ||
* [Summary](#summary) | ||
* [Motivation](#motivation) | ||
* [Goals](#goals) | ||
* [Non-Goals](#non-goals) | ||
* [Proposal](#proposal) | ||
* [Risks and Mitigations](#risks-and-mitigations) | ||
* [Design Details](#design-details) | ||
* [Test Plan](#test-plan) | ||
* [Graduation Criteria](#graduation-criteria) | ||
* [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) | ||
* [Version Skew Strategy](#version-skew-strategy) | ||
* [Implementation History](#implementation-history) | ||
* [Drawbacks](#drawbacks) | ||
* [Alternatives](#alternatives) | ||
|
||
## Release Signoff Checklist | ||
|
||
**ACTION REQUIRED:** In order to merge code into a release, there must be an issue in [kubernetes/enhancements] referencing this KEP and targeting a release milestone **before [Enhancement Freeze](https://github.com/kubernetes/sig-release/tree/master/releases) | ||
of the targeted release**. | ||
|
||
For enhancements that make changes to code or processes/procedures in core Kubernetes i.e., [kubernetes/kubernetes], we require the following Release Signoff checklist to be completed. | ||
|
||
Check these off as they are completed for the Release Team to track. These checklist items _must_ be updated for the enhancement to be released. | ||
|
||
- [ ] kubernetes/enhancements issue in release milestone, which links to KEP (this should be a link to the KEP location in kubernetes/enhancements, not the initial KEP PR) | ||
- [ ] KEP approvers have set the KEP status to `implementable` | ||
- [ ] Design details are appropriately documented | ||
- [ ] Test plan is in place, giving consideration to SIG Architecture and SIG Testing input | ||
- [ ] Graduation criteria is in place | ||
- [ ] "Implementation History" section is up-to-date for milestone | ||
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] | ||
- [ ] Supporting documentation e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes | ||
|
||
**Note:** Any PRs to move a KEP to `implementable` or significant changes once it is marked `implementable` should be approved by each of the KEP approvers. If any of those approvers is no longer appropriate than changes to that list should be approved by the remaining approvers and/or the owning SIG (or SIG-arch for cross cutting KEPs). | ||
|
||
**Note:** This checklist is iterative and should be reviewed and updated every time this enhancement is being considered for a milestone. | ||
|
||
[kubernetes.io]: https://kubernetes.io/ | ||
[kubernetes/enhancements]: https://github.com/kubernetes/enhancements/issues | ||
[kubernetes/kubernetes]: https://github.com/kubernetes/kubernetes | ||
[kubernetes/website]: https://github.com/kubernetes/website | ||
|
||
## Summary | ||
|
||
Scalability and performance of kube-apiserver is crucial for scalability | ||
of the whole Kubernetes cluster. Given that kube-apiserver is cpu-intensive | ||
process, scaling a single instance of it translates to optimizing amount | ||
of work is needed to process a request (cpu cycles and amount of allocated | ||
memory, as memory management is significant part of work done be | ||
kube-apiserver). | ||
|
||
This proposal is aiming to significantly reduce amount of work spent on | ||
serializing objects as well as amount of allocated memory to process that. | ||
|
||
## Motivation | ||
|
||
Running different types of scalability tests and analyzing large production | ||
clusters proves that large number of watches watching the same set of objects | ||
may cause significant load on kube-apiserver. An extreme example of it is | ||
[#75294][], where creation of a single large Endpoints object (almost 1MB of | ||
size, due to 5k pods backing it) in 5k-node cluster can completely overload | ||
kube-apiserver for 5 seconds. | ||
|
||
The main reason for that is that for every watcher (Endpoints are being watched | ||
by kube-proxy running on every one) kube-apiserver independently serializes | ||
(which also requires deep-copy) every single object being send via this watch. | ||
|
||
While this problem is extremely visible for watch, the situation looks the same | ||
for regular GET/LIST operations - reading the same object N times will result | ||
in serializing that N times independently. | ||
|
||
This proposal presents a solution for that problem. | ||
|
||
[#75294]: https://github.com/kubernetes/kubernetes/issues/75294 | ||
|
||
### Goals | ||
|
||
- Reduce load on kube-apiserver and number of memory allocations, by avoiding | ||
serializing the same object multiple times for different watchers. | ||
|
||
### Non-Goals | ||
|
||
- Change overall architecture of the system, by changing what data is being | ||
read/watched by different components. | ||
|
||
## Proposal | ||
|
||
This proposal does not introduce any user-visible changes - the proposed changes | ||
are purely implementation details of kube-apiserver. | ||
|
||
The first observation is that a given object may be serialized to multiple | ||
different formats, based on it's: | ||
- group and version | ||
- subresource | ||
- output media type | ||
However, group, version and subresource are reflected in the `SelfLink` of | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Removing self link completely would actually probably remove several allocations from each object, which should not be discounted. I don't want to get tied into having to return self link when it is used at most in 1% of client interactions (if that). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah - I wanted to restore that discussion about removing SelfLink. But that will not happen within a quarter even if we make the decision about it quickly due to deprecation policy, right? |
||
returned object. As a result, for a given (potentially unversioned) object, | ||
we can identify all its possible serializations by (SelfLink, media-type) | ||
pairs, and that is what we will do below. | ||
|
||
We propose to extend [WithVersionEncoder][] by adding ObjectConvertor to it: | ||
``` | ||
type WithVersionEncoder struct { | ||
Version GroupVersioner | ||
Encoder | ||
ObjectConvertor | ||
ObjectTyper | ||
} | ||
``` | ||
|
||
On top of that, we propose introducing `CustomEncoder` interface: | ||
``` | ||
type CustomEncoder interface { | ||
InterceptEncode(encoder WithVersionEncoder, w io.Writer) error | ||
} | ||
``` | ||
|
||
With that, we will change existing serializers (json, protobuf and versioning), | ||
to check if the to-be-serialized object implements that interface and if so simply | ||
call its `Encode()` method instead of using existing logic. | ||
|
||
[WithVersionEncoder]: https://github.com/kubernetes/kubernetes/blob/990ee3c09c0104cc1045b343040fe76082862d73/staging/src/k8s.io/apimachinery/pkg/runtime/helper.go#L215 | ||
|
||
With those (very local and small) changes, we will introduce an internal type | ||
in package `cacher` implementing both `runtime.Object` and `CustomEncoder` | ||
interfaces. The idea behind it is that it will be encapsulating the original | ||
object and additionally it will be able to accumulate its serialized versions. | ||
It will look like this: | ||
``` | ||
// TODO: Better name is welcome :) | ||
type CachingObject struct { | ||
// Object is the object (potentially in the internal version) | ||
// for which serializations should be cached for future reuse. | ||
Object runtime.Object | ||
|
||
// FIXME: We may want to change that during performance experiments | ||
// e.g. to use sync.Map or slice or something different, also to | ||
// allow some fast-path. | ||
lock sync.Mutex | ||
Versioned map[runtime.GroupVersioner]*CachingVersionedObject | ||
} | ||
|
||
// TODO: Better name is welcome :) | ||
type CachingVersionedObject struct { | ||
// Ojbect is the versioned object for which serializations | ||
// should be cached. | ||
|
||
// FIXME: We may want to change that during performance experiments | ||
// e.g. to use sync.Map or slice or something different, also to | ||
// allow some fast-path. | ||
lock sync.Mutex | ||
serialization map[cachingKey]*serializationResult | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Given that (at least initially) we will use that only for watch (which doesn't support subresources), cachingKey can probably be simply replaced by encoder... |
||
} | ||
|
||
type cachineKey struct { | ||
// encoder is a proxy for mediaType - given we don't have access to it | ||
// (Serializer interface doesn't expose it) and we want to minimize | ||
// changes to apimachinery, we take runtime.Encoder which is a singleton | ||
// for a given mediaType in apiserver. | ||
encoder runtime.Encoder | ||
// selfLink of the serialized object identifying endpoint | ||
// (e.g. group, version, subresource) | ||
selfLink string | ||
wojtek-t marked this conversation as resolved.
Show resolved
Hide resolved
|
||
} | ||
|
||
// TODO: Better name is welcome :) | ||
type serializationResult struct { | ||
once sync.Once | ||
wojtek-t marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
// raw is serialized object. | ||
raw []byte | ||
// err is error from serialization. | ||
err error | ||
} | ||
``` | ||
|
||
In the initial attempt, watchCache when receiving an event via watch from | ||
etcd will be opaquing it into `CachingObject` and operating on object of | ||
that type later. | ||
|
||
That means that we won't have gains from avoid serialization for any GET/LIST | ||
requests server from cache as well as for `init event` that we process when | ||
initializing a new watch, but that seems good enough for the initial attempt. | ||
The obvious gain from it is that the memory used for caching is used only | ||
for a very short period of time (when delivering this watch to watchers) and | ||
quickly released, which means we don't need to be afraid about increased | ||
memory usage. | ||
We may want to revisit that decision later if we would need more gains | ||
from avoiding serialization and deep-copies of objects in watchcache. | ||
|
||
Note that based on a [POC][] (slightly different that above design), | ||
the gains of implementing it include: | ||
- eliminating kube-apiserver unresponsiveness in case of write of | ||
a single huge Endpoints object: [#75294#comment-472728088][] | ||
- ~7% lower cpu-usage | ||
- XX% less memory allocations | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It seems to me that most of these gains are from preventing multiple serializations for separate watchers of the same resource, right? What if we constrained the serialization caching just to those goroutines, so that the cached serialization can be dumped as soon as the object has been sent to the last watcher? I feel like that makes more local changes (complexity is contained within the watch cache code, the rest of the stack never runs the risk of getting a CachingObject, etc), doesn't suffer from excessive memory usage, and captures most of the benefit. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
This is an interesting observation. There is also an aspect of restarting watches after timeout, where we potentially send a bunch of "older" cache events. But once we have watch bookmarks, that seems at least like something worth considering. We probably would need to gather a bit more data about this - I will see if I can find a bit of time to do that. So let's for now assume that we can get 90%+ savings from it (which sounds like a reasonable assumption to me).
That sounds reasonable. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I was thinking that this way the CachingObject can be private to the watch cache (or in an internal/ package). But, on further thought, maybe that was true anyway? If so I agree this doesn't help with complexity too much. I do think it is probably good to either always do it this way, or be able to switch (ideally automatically) to doing it this way for really big collections, to avoid memory problems. If the complexity here can be contained in the watch cache (sub-) packages and doesn't need to e.g. change anything in the runtime package, I'm much happier overall. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I agree.
In the current proposal it's not - it requires some (small but still) changes in serializers. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What I started thinking about (though I'm wondering if it's not overkill) is that we may:
I know it's a lot of hand-waving, but maybe it's something worth exploring. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
The current proposal doesn't require any changes (neither mechanical nor non-mechanical) for any usage of existing libraries - only minor changes to runtime/serializers. But I started wondering, if what I described above actually won't be less complicated (or at least, the complication will be much more separated from anything else). I will try to think about it a bit more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks, I made a couple of concrete suggestions in the PR for if you want to do it that way. I'm (predictably?) not too excited about solving this with more autogenerated code :) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This took a long of time, but I finally got back to it. I have created a new version of POC: https://github.com/kubernetes/kubernetes/pull/76942/files that should be addressing those comments. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks, I spent a while reading that code. It's definitely a lot more palatable than the first take. I'll leave some comments/questions there. Have you considered trying to cache the serialized form of the entire watch event, maybe even post-compression (#1115)? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Not sure I understand this - this is what I'm doing right? Or maybe what you're saying is that I'm just caching the "runtime.Object" instead of "watch.Event" (the different is opaquing it into another object and adding type field - so I don't think it will change much). Re compression - I didn't try it. But IIUC what we're now proceeding with is that we will be only compressing large enough list requests and not do that for watch. |
||
|
||
[POC]: https://github.com/kubernetes/kubernetes/pull/60067 | ||
[#75294#comment-472728088]: https://github.com/kubernetes/kubernetes/issues/75294#issuecomment-472728088 | ||
|
||
### Risks and Mitigations | ||
|
||
Given this proposal doesn't introduce any user visible change, the only risk | ||
are bugs in implementation. Fortunately, the serialization code is widely | ||
used by all end-to-end tests, so any bugs should be catched by those or | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry, I need to disagree here. I think that our experience with the watch cache is that subtle bugs can exist for a long time across multiple releases in spite of usage in tests and by end users. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I have to agree with David here :/ There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's leave that discussion for a bit later... |
||
unit tests of newly added logic. | ||
|
||
## Design Details | ||
|
||
### Test Plan | ||
|
||
* Unit tests covering all corner cases. | ||
* Regular e2e tests are passing. | ||
|
||
### Graduation Criteria | ||
|
||
* All existing e2e tests are passing. | ||
* Scalability tests confirm gains of that change. | ||
|
||
Given it is purely implementation detail, we are not going to expose it | ||
to users via flag or feature-gate (though the implementation will be done | ||
behind some kind of hard-coded guard). | ||
|
||
### Upgrade / Downgrade Strategy | ||
|
||
Not applicable | ||
|
||
### Version Skew Strategy | ||
|
||
Not applicable | ||
|
||
## Implementation History | ||
|
||
- 2019-03-27: Created | ||
|
||
## Alternatives | ||
|
||
### Bake-in caching objects into apimachinery | ||
We considered making objects above part of apimachinery. | ||
|
||
Pros: | ||
- Expose ability to use it for others | ||
|
||
Cons: | ||
- Complicated code hits apimachinery | ||
|
||
### LRU cache | ||
We considered using simple LRU cache to store serialized objects. | ||
|
||
Pros: | ||
- performance gains also for reads served from etcd (though these | ||
doesn't seem to be huge based on experiments) | ||
|
||
Cons: | ||
- potentially significant point of contention | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Really? Isn't there a significant time/space locality to serializations? Couldn't we be exploiting this affinity between multiple watches and the serialization of an object better? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I did a quick and super dirty poc for that an year ago, and the contention was quite huge actually... |
||
- no-control over what is still cached (e.g. for frequently changing | ||
resources, we still keep them in cache, even if they will never be | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I can imagine an implementation with this problem but it doesn't seem like we'd have to do it that badly--the LRU I suggested above doesn't have this problem. |
||
served again) | ||
|
||
### Smart objects | ||
We also considered using `smart objects` - an object that carries the | ||
serialized format of object from etcd with itself. | ||
|
||
Pros: | ||
- very clear encapsulation | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also helps the direct read path. |
||
|
||
Cons: | ||
- We need an ability to in-place add fields to serialized object | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we have evidence that we have multiple watchers at different versions? I assume 99% of watchers are from the same api version, so even a very dumb smart object (last serialization to a version is preserved) could keep that locality. |
||
(i.e. SelfLink) - very tricky and error-prone | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That depends a lot on how you do it, I think. If we e.g. annotate the fields we're interested in, I bet we can make the proto encoding library emit functions which safely do these things. @jennybuckley and @apelisse have just obtained some relevant experience. |
||
- This doesn't work across different (group, version) pairs. As | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Again, only if we do it wrong. We could e.g. convert, reserialize, and cache the reserialized forms. |
||
an example, if at some point we will be migrating `Endpoints` | ||
object to the new API, this will stop working for the whole | ||
migration period (i.e. at least one release). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Protobuf encoding should only result in 1 or 2 memory allocations. Removing that is unlikely to show any significant benefit.
Note that the protobuf encoding changes proposed by @apelisse should have a significant speed up in protobuf encoding, which means that we should really consider the impact of this change AFTER we get those changes in mainline. It may improve things - or it may have much less impact now that protobuf encoding is significantly faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is still a deep-copy issue that we're doing for every single watch event. So that helps, but there is still a lot of place for improvement.