-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add KEP for avoid serializing the same object multiple times #924
Add KEP for avoid serializing the same object multiple times #924
Conversation
|
||
### Goals | ||
|
||
- Reduce load on kube-apiserver, by not serializing the same object multiple |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes a big assumption about our code, which is that our serialization is idempotent. In particular I can imagine some paths getting e.g. a self link wrong or different. What can we do to ensure that all serialization entrypoints produce the exact same serialized bytes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In particular I can imagine some paths getting e.g. a self link wrong or different
Not to derail your general point, but do we think anyone has actually needed and used the self-link. They were broken for many release in a row a while back....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question, I'm open to removing it, but it would be a significant undertaking...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes a big assumption about our code, which is that our serialization is idempotent.
I'm not sure I understand this question - object once serialized, can't be serialized again (it's no longer object, but []byte). Can you clarify?
Regarding selflinks - they are quite painful indeed (it's the only think we set further in the stack), so if we could get rid of those, it would make things easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I'm concerned that objects egressing apiserver via different routes might have been serialized differently, and that would stop being true. It sounds like a bug when I phrase it like that...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
subresource is the only one that comes to mind, but it's more a matter of the information only being available at output time (in transformObject)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is problem for my proposal - where serializing is done at the very end (when self-links are already set) [so I'm proposing to do serialization (or get data from cache] when regular serialization is done].
What I just realized is that actually, given that SelfLink reflects both group0-version and subresource, maybe we can actually do caching based on SelfLink, which would solve this particular issue (caching is done when all the information is already set).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using selfLink as the key for different serializations of a given object sounds clever.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The more I think about it, the more I'm convinced that something like that will be needed in every single approach I can imagine.
I will try to update the proposal in the next couple days.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the proposal to reflect the "SelfLink"-related discussion.
|
||
We propose introducing an internal type implementing `runtime.Object` interface, | ||
that is encapsulating the original object and is additionally able to accumulate | ||
its serialized versions. It will look like this: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another idea for an alternative: modify / add to the serialization code such that
- proto objects keep the raw bytes; a metadata accessor is generated that either reads the raw proto bytes or deserializes only metadata.
- Setting self link and RV shuffles bytes around rather than decoding and re-encoding.
This would be more work but I'm willing to bet that the performance gains would be 2x as big. Additionally, this should save memory overall rather than consume additional memory, since the deserialized form never needs to be produced.
I also feel like this would be less invasive--I can probably live with the design currently written in the doc, but I'm not excited about the distributed implicit cache, it adds a lot of non-local state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another idea for an alternative: modify / add to the serialization code such that
- proto objects keep the raw bytes; a metadata accessor is generated that either reads the raw proto bytes or deserializes only metadata.
- Setting self link and RV shuffles bytes around rather than decoding and re-encoding.
This sounds similar to the smart objects discussion we've had in the past. I'm interested in the idea of smart objects that only serialize or deserialize on demand, but the effort is significant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
proto objects keep the raw bytes; a metadata accessor is generated that either reads the raw proto bytes or deserializes only metadata.
Metadata is not enough - as an example for pods, we are filtering by NodeName.
Setting self link and RV shuffles bytes around rather than decoding and re-encoding.
That's quite tricky and error prone for protobufs, I'm afraid.
One thing that we're keep forgetting are different object versions - from etcd we get object in version X, it doesn't mean that it is being served to watcher also in version X. I think making that assumption may cause us significant issues, when we e.g. will decide to migrate to new version of Endpoints object (which is under discussion).
So I'm quite reluctant to this option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's a good point. The proto munging would only be available in one case, although it's likely to be a common case.
|
||
### Goals | ||
|
||
- Reduce load on kube-apiserver, by not serializing the same object multiple |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wojtek-t how come we made a deserialization cache, but never a serialization cache?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deserialization cachen was added purely in etcd v2 code in the very early days, when there is no watchcache, which meant that every most of watches (like kubelets watching their pods) was watching huge stream of objects, deserializing all of them independently and rejecting majority of events as non-interested. And it was added when the whole code was much easier - serialization cache is more tricky in my opinion.
|
||
Given this proposal doesn't introduce any user visible change, the only risk | ||
are bugs in implementation. Fortunately, the serialization code is widely | ||
used by all end-to-end tests, so any bugs should be catched by those or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I need to disagree here. I think that our experience with the watch cache is that subtle bugs can exist for a long time across multiple releases in spite of usage in tests and by end users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have to agree with David here :/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's leave that discussion for a bit later...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for comments - I responded to most of those.
|
||
### Goals | ||
|
||
- Reduce load on kube-apiserver, by not serializing the same object multiple |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes a big assumption about our code, which is that our serialization is idempotent.
I'm not sure I understand this question - object once serialized, can't be serialized again (it's no longer object, but []byte). Can you clarify?
Regarding selflinks - they are quite painful indeed (it's the only think we set further in the stack), so if we could get rid of those, it would make things easier.
|
||
### Goals | ||
|
||
- Reduce load on kube-apiserver, by not serializing the same object multiple |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deserialization cachen was added purely in etcd v2 code in the very early days, when there is no watchcache, which meant that every most of watches (like kubelets watching their pods) was watching huge stream of objects, deserializing all of them independently and rejecting majority of events as non-interested. And it was added when the whole code was much easier - serialization cache is more tricky in my opinion.
|
||
We propose introducing an internal type implementing `runtime.Object` interface, | ||
that is encapsulating the original object and is additionally able to accumulate | ||
its serialized versions. It will look like this: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
proto objects keep the raw bytes; a metadata accessor is generated that either reads the raw proto bytes or deserializes only metadata.
Metadata is not enough - as an example for pods, we are filtering by NodeName.
Setting self link and RV shuffles bytes around rather than decoding and re-encoding.
That's quite tricky and error prone for protobufs, I'm afraid.
One thing that we're keep forgetting are different object versions - from etcd we get object in version X, it doesn't mean that it is being served to watcher also in version X. I think making that assumption may cause us significant issues, when we e.g. will decide to migrate to new version of Endpoints object (which is under discussion).
So I'm quite reluctant to this option.
|
||
Given this proposal doesn't introduce any user visible change, the only risk | ||
are bugs in implementation. Fortunately, the serialization code is widely | ||
used by all end-to-end tests, so any bugs should be catched by those or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's leave that discussion for a bit later...
36daee4
to
93eb16e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the proposal to reflect the comment threads here - PTAL
|
||
### Goals | ||
|
||
- Reduce load on kube-apiserver, by not serializing the same object multiple |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the proposal to reflect the "SelfLink"-related discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a little concerned at the (likely) implementation complexity. This also locks us in to the watch cache even more. I left another alternative idea in a comment.
|
||
Now the question is what is responsible for creating `CachingObject`. | ||
In the first attempty, we will put that responsibility into `watchCache`: | ||
when receiving an event via watch from etcd it will be opaquing it into |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
opaquing? Maybe "wrapping" is more usual in this context?
|
||
One open questions is how do we decide what exactly we want to cache. | ||
Caching everything stored in `watchCache` seems expensive from memory | ||
usage point of view. What we suggest, is allow configuring individual |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a lot of work, and at the end you leave it on for the most hungry resources anyway. Perhaps modify this to "we'll measure and see if it's a problem"?
One open questions is how do we decide what exactly we want to cache. | ||
Caching everything stored in `watchCache` seems expensive from memory | ||
usage point of view. What we suggest, is allow configuring individual | ||
`registries` with exact versions that we want to cache and propagate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another idea: each caching object, on construction, is registered with a per-resource LRU. The LRU deletes cached serializations when it evicts objects. Objects also remove themselves when they are dropped from the watch cache. Something like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So let me clarify one thing - the "registering" part is used to ensure that we will really serialize object once, right?
The non-obvious part here is that, given our identifier is selfLink (or at cache level, probably (selfLink, RV) pair), we don't know the selfLink at the point of creation. So it seems that we would conceptually register a tree with root being RV, to which we can add selfLink-identified children, and media-type below those.
But at the high-level, I think it's basically variation about LRU option, which I think is a useful variation, though I'm not 100% sure it will actually help that much with complexity of the solution. I will think about it a bit more.
Cons: | ||
- potentially significant point of contention | ||
- no-control over what is still cached (e.g. for frequently changing | ||
resources, we still keep them in cache, even if they will never be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can imagine an implementation with this problem but it doesn't seem like we'd have to do it that badly--the LRU I suggested above doesn't have this problem.
serialized format of object from etcd with itself. | ||
|
||
Pros: | ||
- very clear encapsulation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also helps the direct read path.
|
||
Cons: | ||
- We need an ability to in-place add fields to serialized object | ||
(i.e. SelfLink) - very tricky and error-prone |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That depends a lot on how you do it, I think. If we e.g. annotate the fields we're interested in, I bet we can make the proto encoding library emit functions which safely do these things. @jennybuckley and @apelisse have just obtained some relevant experience.
Cons: | ||
- We need an ability to in-place add fields to serialized object | ||
(i.e. SelfLink) - very tricky and error-prone | ||
- This doesn't work across different (group, version) pairs. As |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, only if we do it wrong. We could e.g. convert, reserialize, and cache the reserialized forms.
- eliminating kube-apiserver unresponsiveness in case of write of | ||
a single huge Endpoints object: [#75294#comment-472728088][] | ||
- ~7% lower cpu-usage | ||
- XX% less memory allocations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me that most of these gains are from preventing multiple serializations for separate watchers of the same resource, right?
What if we constrained the serialization caching just to those goroutines, so that the cached serialization can be dumped as soon as the object has been sent to the last watcher?
I feel like that makes more local changes (complexity is contained within the watch cache code, the rest of the stack never runs the risk of getting a CachingObject, etc), doesn't suffer from excessive memory usage, and captures most of the benefit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me that most of these gains are from preventing multiple serializations for separate watchers of the same resource, right?
This is an interesting observation. There is also an aspect of restarting watches after timeout, where we potentially send a bunch of "older" cache events. But once we have watch bookmarks, that seems at least like something worth considering. We probably would need to gather a bit more data about this - I will see if I can find a bit of time to do that.
So let's for now assume that we can get 90%+ savings from it (which sounds like a reasonable assumption to me).
What if we constrained the serialization caching just to those goroutines, so that the cached serialization can be dumped as soon as the object has been sent to the last watcher?
That sounds reasonable.
But I'm not sure I really understand how that really helps with complexity. Yes - it helps with bounding memory increase etc., but all the mechanics will still be there (we will need this object that is serialized differently, right?)
Or an I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but all the mechanics will still be there (we will need this object that is serialized differently, right?)
I was thinking that this way the CachingObject can be private to the watch cache (or in an internal/ package). But, on further thought, maybe that was true anyway? If so I agree this doesn't help with complexity too much.
I do think it is probably good to either always do it this way, or be able to switch (ideally automatically) to doing it this way for really big collections, to avoid memory problems.
If the complexity here can be contained in the watch cache (sub-) packages and doesn't need to e.g. change anything in the runtime package, I'm much happier overall.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do think it is probably good to either always do it this way, or be able to switch (ideally automatically) to doing it this way for really big collections, to avoid memory problem
I agree.
If the complexity here can be contained in the watch cache (sub-) packages and doesn't need to e.g. change anything in the runtime package, I'm much happier overall.
In the current proposal it's not - it requires some (small but still) changes in serializers.
While I believe that it should be possible to get rid of potential changes in json/proto serializers, I don't yet know how to solve the versioning problem.
Encode we're doing in apiserver is doing "versioning+serialization", and I don't know how we can make this versioning independently from the whole apimachinery infrastructure purely in watchcache.
But if you have any suggestions, I'm more than happy to hear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I started thinking about (though I'm wondering if it's not overkill) is that we may:
- implement some autogeneration of "WrappedObject" for each object that is worth caching serialization (in every single GV)
- in cacher, just wrap an object into its wrapper
- (if needed) autogenerate custom conversions for these objects
I think that with something like that we would be able to solve the versioning. If we would solve the marshaling similarly, maybe that has some potential to work?
I know it's a lot of hand-waving, but maybe it's something worth exploring.
I can try digging deeper in the next couple weeks, if you think this option is worth exploring further.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is making them more flexible/reusable, then I think I can live with that, as long as existing usage sites don't have to be changed in non-mechanical ways.
The current proposal doesn't require any changes (neither mechanical nor non-mechanical) for any usage of existing libraries - only minor changes to runtime/serializers.
FTR the POC is here: kubernetes/kubernetes#60067
But I started wondering, if what I described above actually won't be less complicated (or at least, the complication will be much more separated from anything else). I will try to think about it a bit more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I made a couple of concrete suggestions in the PR for if you want to do it that way.
I'm (predictably?) not too excited about solving this with more autogenerated code :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This took a long of time, but I finally got back to it. I have created a new version of POC: https://github.com/kubernetes/kubernetes/pull/76942/files that should be addressing those comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I spent a while reading that code. It's definitely a lot more palatable than the first take. I'll leave some comments/questions there.
Have you considered trying to cache the serialized form of the entire watch event, maybe even post-compression (#1115)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you considered trying to cache the serialized form of the entire watch event, maybe even post-compression (#1115)?
Not sure I understand this - this is what I'm doing right? Or maybe what you're saying is that I'm just caching the "runtime.Object" instead of "watch.Event" (the different is opaquing it into another object and adding type field - so I don't think it will change much).
Re compression - I didn't try it. But IIUC what we're now proceeding with is that we will be only compressing large enough list requests and not do that for watch.
93eb16e
to
92f3106
Compare
92f3106
to
83fd9e3
Compare
Changed names according to suggestions from review of POC. |
|
||
### Goals | ||
|
||
- Reduce load on kube-apiserver and number of memory allocations, by avoiding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Protobuf encoding should only result in 1 or 2 memory allocations. Removing that is unlikely to show any significant benefit.
Note that the protobuf encoding changes proposed by @apelisse should have a significant speed up in protobuf encoding, which means that we should really consider the impact of this change AFTER we get those changes in mainline. It may improve things - or it may have much less impact now that protobuf encoding is significantly faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is still a deep-copy issue that we're doing for every single watch event. So that helps, but there is still a lot of place for improvement.
- group and version | ||
- subresource | ||
- output media type | ||
However, group, version and subresource are reflected in the `SelfLink` of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing self link completely would actually probably remove several allocations from each object, which should not be discounted. I don't want to get tied into having to return self link when it is used at most in 1% of client interactions (if that).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah - I wanted to restore that discussion about removing SelfLink. But that will not happen within a quarter even if we make the decision about it quickly due to deprecation policy, right?
doesn't seem to be huge based on experiments) | ||
|
||
Cons: | ||
- potentially significant point of contention |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really? Isn't there a significant time/space locality to serializations? Couldn't we be exploiting this affinity between multiple watches and the serialization of an object better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a quick and super dirty poc for that an year ago, and the contention was quite huge actually...
- very clear encapsulation | ||
|
||
Cons: | ||
- We need an ability to in-place add fields to serialized object |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have evidence that we have multiple watchers at different versions? I assume 99% of watchers are from the same api version, so even a very dumb smart object (last serialization to a version is preserved) could keep that locality.
I really would want to see prototype numbers this after the protobuf marshal efficiency stuff is landed. A significant speed up there may significantly reduce the benefit of this approach. Note that we have a very high per GET / watch event fixed allocations. Right now each GET to the API server does about 120 allocations, of which at least a hundred are things besides serialization. Each watch is ~20 allocations - only 2 of which are serialization. I would wager we are in the point where we need to reduce fixed cost rather than variable cost (in encoding). |
Protobuf is only one part of this problem. Another one is the problem of deep-copying the whole object for every watch-event that is happening (because we're modifying it for conversions, selflinks, etc.)
Cost of deep-copies is exactly that cost. I wouldn't say it's fixed, but it's huge. |
// e.g. to use sync.Map or slice or something different, also to | ||
// allow some fast-path. | ||
lock sync.Mutex | ||
serialization map[cachingKey]*serializationResult |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that (at least initially) we will use that only for watch (which doesn't support subresources), cachingKey can probably be simply replaced by encoder...
We discussed at the last sig and no one had specific concerns, just a vague sense of uneasiness over complexity. so... /approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: lavalamp, wojtek-t The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Thanks Daniel - I will open an issue in this repo and would like to target Alpha (disabled by default) for 1.16. Will be sending some smaller updated to the KEP in the meantime. |
No description provided.