Design for consistent API chunking in Kubernetes #896

smarterclayton · 2017-08-11T01:19:47Z

In order to reduce the memory allocation impact of very large LIST operations against the Kubernetes apiserver, it should be possible to receive a single large resource list as many individual page requests to the server.

Part of kubernetes/enhancements#365. Taken from kubernetes/kubernetes#49338. Original discussion in kubernetes/kubernetes#2349

smarterclayton · 2017-08-11T01:22:13Z

@kubernetes/sig-api-machinery-api-reviews @kubernetes/sig-scalability-api-reviews

liggitt · 2017-08-11T17:31:11Z

contributors/design-proposals/api-paging.md

+
+The schema of the continue token is chosen by the storage layer and is not guaranteed to remain consistent for clients - clients *must* consider the continue token as opaque. Server implementations *should* ensure that continue tokens can persist across server restarts and across upgrades.
+
+Servers *may* return fewer results than `limit` if server side filtering returns no results such as when a `label` or `field` selector is used. If the entire result set is filtered, the server *may* return zero results with a valid `continue` token.


can you clarify the interaction of the continue parameter with the other parameters (resourceVersion, labelSelector/fieldSelector) passed to list calls? Is the expectation that all parameters present on the initial request are also passed in subsequent requests with identical values? Is the server responsible for ensuring that is the case? Paged list results with resourceVersion=0 stands out as having potentially odd implications

So etcd3 does not require that any of the parameters stay the same - it'll work regardless. I'm slightly inclined to describe changing field/label selectors as undefined behavior, and that the server may return a 400 bad request if it is not allowed. I think in general the storage structure required to do paging in a useful way to clients is going to be based on being able to range over an ordered list with snapshot isolation of versions, and if people are implementing paging with things like "holding open database cursors for 5 minutes" they're going to come to tears anyway.

Specifying resourceVersion inconsistent with continue should be a 400 bad request. Will comment.

liggitt · 2017-08-11T21:36:19Z

contributors/design-proposals/api-paging.md

+
+#### etcd3
+
+For etcd3 the continue token would contain a resource version (the snapshot that we are reading that is consistent across the entire LIST) and the start key for the next set of results. Upon receiving a valid continue token the apiserver would instruct etcd3 to retrieve the set of results at a given resource version, beginning at the provided start key, limited by the maximum number of requests provided by the continue token (or optionally, by a different limit specified by the client). If more results remain after reading up to the limit, the  storage should calculate a continue token that would begin at the next possible key, and the continue token set on the returned list.


and the start key for the next set of results

this has the potential to return information in the continue token that the user would not have received in a filtered list API call

if we gain the ability to authorize clients to only do LIST calls filtered to specific label selectors (c.f. kubernetes/kubernetes#40403, kubernetes/kubernetes#44703), this could expose etcd key names the user doesn't actually have permission to know about, containing namespace and object name info

Yeah. Namespace names and object names (for those reading at home). We could encrypt and sign the continuation token but that's another value we have to distribute. Would also be vulnerable to repeat attacks. Hash is out because it has to be reversible.

This concern could be handled per resource type with the following possibilities:

disallow paging on certain resources (a resource whose name was a secret)

disallow paging when performing certain filtering queries that would otherwise expose info the user could not see

continue paging until at least one visible key is found, and return that as the next key to continue on (i.e. users only ever see keys they would see normally)

encrypt the key name using a pre shared secret across masters

I think all of these are possible options and easy to implement. Will add this to the proposal.

continue paging until at least one visible key is found, and return that as the next key to continue on (i.e. users only ever see keys they would see normally)

This seems like the most reasonable.

I'm wondering whether setting "continue" to the "next key" is the best approach. Not only because of a possible security implication, but if we get to the point where we're paging over data that is not immediately available to us right away, and the server wants to return "what it has as of now", then it may not know the "next key". I think it would be better to return some "token" that's really only known to the paging implementation of that resource so it can figure out how best to represent "where to continue".

we could set it as the last key, then increase it for the next range call.

There's no guarantee the user should be able to see that last key, and if large chunks of entries are being filtered out there's no longer a guarantee that the paging can make progress.

where we're paging over data that is not immediately available to us right away

That would explicitly not be supported - this is about returning the consistent read in multiple steps, so there is no "data we don't know about yet".

liggitt · 2017-08-12T00:47:30Z

contributors/design-proposals/api-paging.md

+
+#### etcd3
+
+For etcd3 the continue token would contain a resource version (the snapshot that we are reading that is consistent across the entire LIST) and the start key for the next set of results. Upon receiving a valid continue token the apiserver would instruct etcd3 to retrieve the set of results at a given resource version, beginning at the provided start key, limited by the maximum number of requests provided by the continue token (or optionally, by a different limit specified by the client). If more results remain after reading up to the limit, the  storage should calculate a continue token that would begin at the next possible key, and the continue token set on the returned list.


limited by the maximum number of requests provided by the continue token (or optionally, by a different limit specified by the client)

When filtering (via labelSelector/fieldSelector) is done after obtaining the results, there is no guarantee the number of items matching the filter will equal the number retrieved from etcd. the number of matching results returned to the client may be much smaller (or in pathological cases, even 0). Would you expect the server to continue paging internally until it had obtained enough filtered results to satisfy the client's requested limit?

I think for a first approach no, but leave it up to the impl to be improved.

that's fine, as long as the documentation is clear that when a continue token is returned, it must be used to ensure obtaining a complete list (even if fewer results were requested, even possibly 0)

sttts · 2017-08-16T06:35:40Z

contributors/design-proposals/api-paging.md

+* The data from etcd has to be transferred to the apiserver in one large chunk
+* The `kube-apiserver` also has to deserialize that response into a single object, and then re-serialize it back to the client
+  * Much of the decoded etcd memory is copied into the struct used to serialize to the client
+* A client like `kubectl get` will then decode the response from JSON or protobuf


"etcd client", "kube client" would make it easier to read. Many clients involved here.

sttts · 2017-08-16T06:37:50Z

contributors/design-proposals/api-paging.md

+* The `kube-apiserver` also has to deserialize that response into a single object, and then re-serialize it back to the client
+  * Much of the decoded etcd memory is copied into the struct used to serialize to the client
+* A client like `kubectl get` will then decode the response from JSON or protobuf
+  * A client with a slow connection may not be able to receive the entire response body within the default 60s timeout


and those connections with >60s pile up due to retries and non-adaptive back preasure, easily leading to congestion of an apiserver exceeding max-inflight all the time.

sttts · 2017-08-16T06:38:58Z

contributors/design-proposals/api-paging.md

+
+The standard solution for reducing the impact of large reads is to allow them to be broken into smaller reads via a technique commonly referred to as paging. By efficiently splitting large list ranges from clients to etcd into many smaller list ranges, we can reduce the maximum memory allocation on etcd and the apiserver.
+
+Our primary consistent store etcd3 offers support for efficient paging with minimal overhead, and mechanisms exist for other potential future stores such as SQL databases or Consul to also implement a simple form of paging.


missing the word "consistent" page mechanism, i.e. transactional view

sttts · 2017-08-16T06:40:49Z

contributors/design-proposals/api-paging.md

+
+## Proposed change:
+
+Expose a simple paging mechanism to allow large API responses to be broken into consistent partial responses. Clients would indicate a tolerance for paging (opt-in) by specifying a desired maximum number of results to return in a `LIST` call. The server would return up to that amount of objects, and if more exist it would return a `continue` parameter that the client could pass to receive the next set of results.  The server would be allowed to ignore the limit if it does not implement limiting (backward compatible), but it is not allowed to support limiting without supporting a way to continue the query past the limit (may not implement `limit` without `continue`).


is there a life-time of such a continue parameter? Can I come back tomorrow and continue a paged list from today?

If the client asked for just 5 because it can't handle more, I'm not sure the server should be allowed to ignore the request and flood them with a massive response. I'd prefer an error that tells the client that the server doesn't support paging and then let the client decide what to do next.

is there a life-time of such a continue parameter?

It's up to the backing store to have the historical data available. So for our default config is 5m for etcd3.

If the client asked for just 5 because it can't handle more, I'm not sure the server should be allowed to ignore the request and flood them with a massive response. I'd prefer an error that tells the client that the server doesn't support paging and then let the client decide what to do next.

That is not backwards compatible - old servers wouldn't return an error. We generally err on not introducing new incompatibilities. Paging is a way to get less results in a chunk, not a way to slowly browse through the data store (slow being on the timescales of a human). So naive web clients that want to display a chunk of results at a time needs something more involved than simple continuations and would be an additional set of parameters.

sttts · 2017-08-16T06:43:42Z

contributors/design-proposals/api-paging.md

+}
+```
+
+The token returned by the server for `continue` would be an opaque serialized string that would contain a simple serialization of the initially requested limit, a version identifier (to allow future extension), and any additional data needed by the server storage to identify where to start the next range. 


why is the initially requested limit important? Sounds like an implementation detail if at all.

It's not actually important, and hadn't planned on implementing that. You still have to specify limit on each call.

sttts · 2017-08-16T06:44:59Z

contributors/design-proposals/api-paging.md

+
+The token returned by the server for `continue` would be an opaque serialized string that would contain a simple serialization of the initially requested limit, a version identifier (to allow future extension), and any additional data needed by the server storage to identify where to start the next range. 
+
+The schema of the continue token is chosen by the storage layer and is not guaranteed to remain consistent for clients - clients *must* consider the continue token as opaque. Server implementations *should* ensure that continue tokens can persist across server restarts and across upgrades.


and be valid in an HA environment on different servers.

and be valid in an HA environment on different servers.

Seems like this may actually be a must. Otherwise the feature isn't as useful.

sttts · 2017-08-16T06:46:54Z

contributors/design-proposals/api-paging.md

+
+The schema of the continue token is chosen by the storage layer and is not guaranteed to remain consistent for clients - clients *must* consider the continue token as opaque. Server implementations *should* ensure that continue tokens can persist across server restarts and across upgrades.
+
+Servers *may* return fewer results than `limit` if server side filtering returns no results such as when a `label` or `field` selector is used. If the entire result set is filtered, the server *may* return zero results with a valid `continue` token.


in other words: either the result is empty, we get a full bucket with continue param, or we get the remaining items without continue param. But non-empty, len != limit plus continue is forbidden.

Continue AND 0 < x < limit = allowed Continue AND x == 0 = allowed Continue AND x == limit = allowed No Continue AND x > limit = allowed (backwards compatibility) Continue AND x > limit = not allowed

Continue being non empty means there are more results

The server may not return more than Limit if it supports Limit.

The server may always return less than or equal to Limit

sttts · 2017-08-16T06:47:59Z

contributors/design-proposals/api-paging.md

+
+### Specific Implementations
+
+Etcd2 would not be supported as it has no options to perform a consistent read and is on track to be deprecated in Kubernetes.  Other databases that might back Kubernetes could either chose to not implement limiting, or leverage their own transactional characteristics to return a consistent list. In the near term our primary store remains etcd3 which can provide this capability at low complexity.


s/chose/choose/

Etcd2 -> etcd2

sttts · 2017-08-16T06:49:53Z

contributors/design-proposals/api-paging.md

+
+Etcd2 would not be supported as it has no options to perform a consistent read and is on track to be deprecated in Kubernetes.  Other databases that might back Kubernetes could either chose to not implement limiting, or leverage their own transactional characteristics to return a consistent list. In the near term our primary store remains etcd3 which can provide this capability at low complexity.
+
+Implementations that cannot offer consistent ranging (returning a set of results that are logically equivalent to receiving all results in one response) must not allow continuation, because consistent listing is a requirement of the Kubernetes API list and watch pattern.


missing some words about discovery of this feature above. Do we want this being part of discovery info?

This version of paging is for processes that expected to get a full list but want it in chunks. I think the design is biased towards that being an optional implementation. A future improvement to paging might make it usable for end users who want to walk the tree (supporting ?fromKey=) but that is not in scope for this design.

sttts · 2017-08-16T06:51:36Z

contributors/design-proposals/api-paging.md

+     // no more than 500 items
+   ]
+}
+GET /api/v1/pods?limit=500&continue=ABC...


from the "token encodes limit" below I conclude that we can also continue without passing the limit value again?

You can continue without passing limit, but the intent is that limit is applied per call. If you continue without specifying limit you'll get the rest. Limit should be able to be changed so that a client can potentially perform a slow start (reserving the right to do so). The text about token encodes limit is wrong and will be removed.

sttts · 2017-08-16T06:52:55Z

contributors/design-proposals/api-paging.md

+
+For etcd3 the continue token would contain a resource version (the snapshot that we are reading that is consistent across the entire LIST) and the start key for the next set of results. Upon receiving a valid continue token the apiserver would instruct etcd3 to retrieve the set of results at a given resource version, beginning at the provided start key, limited by the maximum number of requests provided by the continue token (or optionally, by a different limit specified by the client). If more results remain after reading up to the limit, the  storage should calculate a continue token that would begin at the next possible key, and the continue token set on the returned list.
+
+The storage layer in the apiserver must apply consistency checking to the provided continue token to ensure that malicious users cannot trick the server into serving results outside of its range. The storage layer must perform defensive checking on the provided value, check for path traversal attacks, and have stable versioning for the continue token.


does this mean we encode the token cryptographically (e.g. via JSON token standard) ?

does this mean we encode the token cryptographically (e.g. via JSON token standard) ?

I think you could just validate it.

using something other than "the next key" would solve this issue too.

sttts · 2017-08-16T06:56:51Z

contributors/design-proposals/api-paging.md

+
+Some clients may wish to follow a failed paged list with a full list attempt.
+
+The 5 minute default compaction interval for etcd3 bounds how long a list can run.  Since clients may wish to perform processing over very large sets, increasing that timeout may make sense for large clusters. It should be possible to alter the interval at which compaction runs to accomodate larger clusters.


would a non-consistent continuation make sense? Are there clients which might be happy with this? Would it lessen the burden of the apiserver to always return consistent views?

discussed below....

Shedding some thoughts here :)

Aren't most paging-list things are immutable? For example, names? If use cases are well understood, there are good alternative rather than listing non-consistently.

In a resource, there are frequently updated parts (status) and pretty much unchanged parts (spec, metadata). If they are decoupled, and keep different versions, then it is easy to list pages of names without losing consistency. Right?

deads2k · 2017-08-16T17:38:53Z

contributors/design-proposals/api-paging.md

+}
+```
+
+The token returned by the server for `continue` would be an opaque serialized string that would contain a simple serialization of the initially requested limit, a version identifier (to allow future extension), and any additional data needed by the server storage to identify where to start the next range. 


Why would it include the initially requested limit? That doesn't sound very opaque if we're getting into the details of what's in it.

while I agree that to the client its opaque, as part of this design I think it would be good to know how we expect kube servers expect to populate it. For example, do we expect different data in there based on the resource type?

Doesn't need the limit, that was part of an earlier draft. Will remove.

deads2k · 2017-08-16T17:41:43Z

contributors/design-proposals/api-paging.md

+
+The schema of the continue token is chosen by the storage layer and is not guaranteed to remain consistent for clients - clients *must* consider the continue token as opaque. Server implementations *should* ensure that continue tokens can persist across server restarts and across upgrades.
+
+Servers *may* return fewer results than `limit` if server side filtering returns no results such as when a `label` or `field` selector is used. If the entire result set is filtered, the server *may* return zero results with a valid `continue` token.


If the user requests 20 items and the server finds a list to return with 50 items, can the server choose to only send 10 items?

The server may return any number of results <= limit at any time, as long as a continue token is also returned. If the continue token is not set, the server MUST return all remaining results.

deads2k · 2017-08-16T17:49:16Z

contributors/design-proposals/api-paging.md

+
+The schema of the continue token is chosen by the storage layer and is not guaranteed to remain consistent for clients - clients *must* consider the continue token as opaque. Server implementations *should* ensure that continue tokens can persist across server restarts and across upgrades.
+
+Servers *may* return fewer results than `limit` if server side filtering returns no results such as when a `label` or `field` selector is used. If the entire result set is filtered, the server *may* return zero results with a valid `continue` token.


How does the server indicate that the continue token is no longer valid?

410 Resource Expired

deads2k · 2017-08-16T17:53:43Z

contributors/design-proposals/api-paging.md

+* Controllers with full caches
+  * Any controller with a full in-memory cache of one or more resources almost certainly depends on having a consistent view of resources, and so will either need to perform a full list or a paged list, without dropping results
+* `kubectl get`
+  * Most administrators would probably prefer to see a very large set with some inconsistency rather than no results (due to a timeout under load).  They would likely be ok with handling `410 ResourceExpired` as "continue from the last key I processed"


How would such a client indicate that?

Will clarify, would probably be a statusDetails. sub field. We probably want to let opaque data be returned on status.

deads2k · 2017-08-16T17:55:14Z

contributors/design-proposals/api-paging.md

+* Migration style commands
+  * Assuming a migration command has to run on the full data set (to upgrade a resource from json to protobuf, or to check a large set of resources for errors) and is performing some expensive calculation on each, very large sets may not complete over the server expiration window.
+
+For clients that do not care about consistency, the storage *may* return a `continue` value with the `ResourceExpired` error that allows the client to restart from the same prefix key, but using the latest resource version.  This would allow clients that do not require a fully consistent LIST to opt in to partially consistent LISTs but still be able to scan the entire working set.


may, not must? Seems generous, but ok. Move this up closer to where you describe the behavior.

smarterclayton · 2017-08-27T21:47:20Z

To avoid confusion I am renaming this as "API chunking" since "API paging" is more broad and can cover use cases that we do not intend to support in the near term. Chunking and paging would reuse similar mechanisms, but chunking is more specific. All review comments addressed.

liggitt · 2017-08-28T01:49:52Z

contributors/design-proposals/api-chunking.md

+
+The token returned by the server for `continue` would be an opaque serialized string that would contain a simple serialization of a version identifier (to allow future extension), and any additional data needed by the server storage to identify where to start the next range. 
+
+The continue token is not required to encode other filtering parameters present on the initial request, and clients may alter their filter parameters on subsequent chunk reads. However, the server implementation **may** reject such changes with a `400 Bad Request` error, and clients should consider this behavior undefined and left to future clarification. Chunking is intended to return consistent lists.


if the behavior is undefined and the server may return a "bad request" error, I'd characterize this as "clients should not alter their filter parameters on subsequent chunk reads"

liggitt · 2017-08-28T01:50:45Z

contributors/design-proposals/api-chunking.md

+
+The continue token is not required to encode other filtering parameters present on the initial request, and clients may alter their filter parameters on subsequent chunk reads. However, the server implementation **may** reject such changes with a `400 Bad Request` error, and clients should consider this behavior undefined and left to future clarification. Chunking is intended to return consistent lists.
+
+If the resource version parameter specified on the request is inconsistent with the `continue` token, the server **must** reject the request with a `400 Bad Request` error.


is limit/continue supported with resourceVersion=0? also, can the resourceVersion parameter be unspecified and rely on the continue token for consistency? (most list requests do not specify a resource version, and the examples above do not)

Continue + resourceVersion=0 should probably return a bad request. Continue token should be the only field required to get a consistent paging. resourceVersion=!0 should probably be accepted with continuation. resourceVersion=0... probably should as well, since chunking should be orthogonal. We might have to say that it's unsupported in the alpha release though if it involves mucking with the watch cache.

liggitt · 2017-08-28T01:53:30Z

contributors/design-proposals/api-chunking.md

+
+The schema of the continue token is chosen by the storage layer and is not guaranteed to remain consistent for clients - clients **must** consider the continue token as opaque. Server implementations **should** ensure that continue tokens can persist across server restarts and across upgrades.
+
+Servers **may** return fewer results than `limit` if server side filtering returns no results such as when a `label` or `field` selector is used. If the entire result set is filtered, the server **may** return zero results with a valid `continue` token. A client **must** use the presence of a `continue` token in the response to determine whether more results are available, regardless of the number of results returned. A server that supports limits **must not** return more results than `limit` if a `continue` token is also returned. If the server does not return a `continue` token, the server **must** return all remaining results.


If the entire result set is filtered, the server may return zero results with a valid continue token.

On the final set, the server may also return zero results with no continue token

smarterclayton · 2017-08-29T17:09:41Z

@liggitt addressed your final comments. Any other comments from anyone?

lpabon · 2017-08-29T17:29:16Z

@smarterclayton LGTM, just one quick question. If the apiserver notices it has OOM or close to it, what is the expected result at the client?

liggitt · 2017-08-29T17:29:55Z

LGTM as a first step we can iterate on. The exposed API surface is minimal, largely opaque, and aligns well with existing chunking mechanisms in other APIs.

liggitt · 2017-08-29T17:31:00Z

If the apiserver notices it has OOM or close to it, what is the expected result at the client?

Unchanged by this proposal. API server OOM during request handling is considered abnormal, and if it occurs, the connection to the client is terminated.

smarterclayton · 2017-08-29T18:23:20Z

Squashing

In order to reduce the memory allocation impact of very large LIST operations against the Kubernetes apiserver, it should be possible to receive a single large list as many individual requests to the server.

liggitt · 2017-08-29T19:57:37Z

/lgtm

k8s-github-robot · 2017-08-29T20:01:45Z

Automatic merge from submit-queue

lavalamp

Sorry, late to the party. A few thoughts. No major complaints.

lavalamp · 2017-08-30T23:48:18Z

contributors/design-proposals/api-chunking.md

+
+The schema of the continue token is chosen by the storage layer and is not guaranteed to remain consistent for clients - clients **must** consider the continue token as opaque. Server implementations **should** ensure that continue tokens can persist across server restarts and across upgrades.
+
+Servers **may** return fewer results than `limit` if server side filtering returns no results such as when a `label` or `field` selector is used. If the entire result set is filtered, the server **may** return zero results with a valid `continue` token. A client **must** use the presence of a `continue` token in the response to determine whether more results are available, regardless of the number of results returned. A server that supports limits **must not** return more results than `limit` if a `continue` token is also returned. If the server does not return a `continue` token, the server **must** return all remaining results. The server **may** return zero results with no `continue` token on the last call.


I understand why you have it this way, but does it prevent us from moving the filtering step into the db layer in the future? I guess not. But it is an implementation detail leaking through.

No it doesn't. In fact, filtering at the DB level should be opaque and possible - the continue token is generated by storage and surfaced all the way up, and filtering at the DB would simply identify the appropriate next key.

lavalamp · 2017-08-30T23:48:56Z

contributors/design-proposals/api-chunking.md

+
+The server **may** limit the amount of time a continue token is valid for. Clients **should** assume continue tokens last only a few minutes.
+
+The server **must** support `continue` tokens that are valid across multiple API servers. The server **must** support a mechanism for rolling restart such that continue tokens are valid after one or all API servers have been restarted.


And storage backend restarts.

lavalamp · 2017-08-30T23:49:57Z

contributors/design-proposals/api-chunking.md

+
+Servers **may** return fewer results than `limit` if server side filtering returns no results such as when a `label` or `field` selector is used. If the entire result set is filtered, the server **may** return zero results with a valid `continue` token. A client **must** use the presence of a `continue` token in the response to determine whether more results are available, regardless of the number of results returned. A server that supports limits **must not** return more results than `limit` if a `continue` token is also returned. If the server does not return a `continue` token, the server **must** return all remaining results. The server **may** return zero results with no `continue` token on the last call.
+
+The server **may** limit the amount of time a continue token is valid for. Clients **should** assume continue tokens last only a few minutes.


Servers must give a clear error on expired continue tokens.

What happens if the collection is large enough that the last page has always expired by the time the client gets there?

They do, 410 Resource Expired.

For that case the default pager impl falls back to not paged. On our largest pathological data sets (1.2 gb at rest, 20gb in memory) i was able to read 100k secrets (500M of data) in chunks of 500 in under 4 seconds.

lavalamp · 2017-08-30T23:51:47Z

contributors/design-proposals/api-chunking.md

+
+etcd2 will not be supported as it has no option to perform a consistent read and is on track to be deprecated in Kubernetes.  Other databases that might back Kubernetes could either choose to not implement limiting, or leverage their own transactional characteristics to return a consistent list. In the near term our primary store remains etcd3 which can provide this capability at low complexity.
+
+Implementations that cannot offer consistent ranging (returning a set of results that are logically equivalent to receiving all results in one response) must not allow continuation, because consistent listing is a requirement of the Kubernetes API list and watch pattern.


+1, we should come up with a stress test for this.

lavalamp · 2017-08-30T23:56:11Z

contributors/design-proposals/api-chunking.md

+
+#### Possible SQL database implementation
+
+A SQL database backing a Kubernetes server would need to implement a consistent snapshot read of an entire resource type, plus support changefeed style updates in order to implement the WATCH primitive. A likely implementation in SQL would be a table that stores multiple versions of each object, ordered by key and version, and filters out all historical versions of an object. A consistent paged list over such a table might be similar to:


Wow, do you really expect anyone to try this? Does it really need to be in this proposal?

I just want to be really clear on the equivalence between an MVCC store like etcd and someone who really believes they want to have an alternate backend. At our scales it doesn't look like it's necessary anymore, but I want to be able to point back to a clear consideration.

lavalamp · 2017-08-30T23:59:55Z

contributors/design-proposals/api-chunking.md

+
+1. Disable chunking on specific resources
+2. Disable chunking when the user does not have permission to view all resources within a range
+3. Encrypt the next key or the continue token using a shared secret across all API servers


Generate a random string as the continuation token; a shared map maps from that string to the actual data structure.

Wouldn't work across api servers unless persisted to etcd.

lavalamp · 2017-08-31T00:00:59Z

contributors/design-proposals/api-chunking.md

+1. Disable chunking on specific resources
+2. Disable chunking when the user does not have permission to view all resources within a range
+3. Encrypt the next key or the continue token using a shared secret across all API servers
+4. When chunking, continue reading until the next visible start key is located after filtering, so that start keys are always keys the user has access to.


If you're worried only about construction and not about making it opaque, you could sign or HMAC the key to prevent users from constructing them.

lavalamp · 2017-08-31T00:03:53Z

contributors/design-proposals/api-chunking.md

+* Controllers with full caches
+  * Any controller with a full in-memory cache of one or more resources almost certainly depends on having a consistent view of resources, and so will either need to perform a full list or a paged list, without dropping results
+* `kubectl get`
+  * Most administrators would probably prefer to see a very large set with some inconsistency rather than no results (due to a timeout under load).  They would likely be ok with handling `410 ResourceExpired` as "continue from the last key I processed"


Doesn't etcd's ordering potentially change with each update? I would expect lots of duplicates / omissions in this case.

It does not. Etcd guarantees key order.

lavalamp · 2017-08-31T00:06:13Z

contributors/design-proposals/api-chunking.md

+* Migration style commands
+  * Assuming a migration command has to run on the full data set (to upgrade a resource from json to protobuf, or to check a large set of resources for errors) and is performing some expensive calculation on each, very large sets may not complete over the server expiration window.
+
+For clients that do not care about consistency, the server **may** return a `continue` value on the `ResourceExpired` error that allows the client to restart from the same prefix key, but using the latest resource version.  This would allow clients that do not require a fully consistent LIST to opt in to partially consistent LISTs but still be able to scan the entire working set. It is likely this could be a sub field (opaque data) of the `Status` response under `statusDetails`.


Are etcd responses sorted by key? For some reason I thought it was by position in the etcd db file. @jpbetz

Etcd is a sorted map in lexicographic key order.

lavalamp · 2017-08-31T00:08:43Z

contributors/design-proposals/api-chunking.md

+
+### Other solutions
+
+Compression from the apiserver and between the apiserver and etcd can reduce total network bandwidth, but cannot reduce the peak CPU and memory used inside the client, apiserver, or etcd processes. 


Why can't apiserver do stream processing?

Depends on how you define a stream. It can. Would require a dramatic change to the API shape of LIST.

bgrant0607 · 2017-09-01T03:39:52Z

Speaking of dramatic changes to APIs, we need to address the Endpoints API eventually.

Automatic merge from submit-queue (batch tested with PRs 50832, 51119, 51636, 48921, 51712) Alpha list paging implementation Design in kubernetes/community#896 Support `?limit=NUMBER`, `?continue=CONTINUATIONTOKEN`, and a `continue` field on ListMeta and pass through to etcd. Perform minor validation as an example. ``` # first out of three $ curl http://127.0.0.1:8080/api/v1/namespaces?limit=1 { "kind": "NamespaceList", "apiVersion": "v1", "metadata": { "selfLink": "/api/v1/namespaces", "resourceVersion": "146", "next": "ZGVmYXVsdA" }, "items": [ { "metadata": { "name": "default", "selfLink": "/api/v1/namespaces/default", "uid": "f95e1390-6852-11e7-ab03-7831c1b76042", "resourceVersion": "4", "creationTimestamp": "2017-07-14T05:12:03Z" }, "spec": { "finalizers": [ "kubernetes" ] }, "status": { "phase": "Active" } } ] } ... # last $ curl "http://127.0.0.1:8080/api/v1/namespaces?limit=1&continue=a3ViZS1wdWJsaWM" { "kind": "NamespaceList", "apiVersion": "v1", "metadata": { "selfLink": "/api/v1/namespaces", "resourceVersion": "145" }, "items": [ { "metadata": { "name": "kube-system", "selfLink": "/api/v1/namespaces/kube-system", "uid": "f95e9484-6852-11e7-ab03-7831c1b76042", "resourceVersion": "5", "creationTimestamp": "2017-07-14T05:12:03Z" }, "spec": { "finalizers": [ "kubernetes" ] }, "status": { "phase": "Active" } } ] } ```

Automatic merge from submit-queue (batch tested with PRs 50832, 51119, 51636, 48921, 51712) Alpha list paging implementation Design in kubernetes/community#896 Support `?limit=NUMBER`, `?continue=CONTINUATIONTOKEN`, and a `continue` field on ListMeta and pass through to etcd. Perform minor validation as an example. ``` # first out of three $ curl http://127.0.0.1:8080/api/v1/namespaces?limit=1 { "kind": "NamespaceList", "apiVersion": "v1", "metadata": { "selfLink": "/api/v1/namespaces", "resourceVersion": "146", "next": "ZGVmYXVsdA" }, "items": [ { "metadata": { "name": "default", "selfLink": "/api/v1/namespaces/default", "uid": "f95e1390-6852-11e7-ab03-7831c1b76042", "resourceVersion": "4", "creationTimestamp": "2017-07-14T05:12:03Z" }, "spec": { "finalizers": [ "kubernetes" ] }, "status": { "phase": "Active" } } ] } ... # last $ curl "http://127.0.0.1:8080/api/v1/namespaces?limit=1&continue=a3ViZS1wdWJsaWM" { "kind": "NamespaceList", "apiVersion": "v1", "metadata": { "selfLink": "/api/v1/namespaces", "resourceVersion": "145" }, "items": [ { "metadata": { "name": "kube-system", "selfLink": "/api/v1/namespaces/kube-system", "uid": "f95e9484-6852-11e7-ab03-7831c1b76042", "resourceVersion": "5", "creationTimestamp": "2017-07-14T05:12:03Z" }, "spec": { "finalizers": [ "kubernetes" ] }, "status": { "phase": "Active" } } ] } ``` Kubernetes-commit: 35ffb5c6cf70974c0a571cd1ebdc72ad8d0f8332

serathius · 2023-08-28T13:22:18Z

contributors/design-proposals/api-chunking.md

+
+For etcd3 the continue token would contain a resource version (the snapshot that we are reading that is consistent across the entire LIST) and the start key for the next set of results. Upon receiving a valid continue token the apiserver would instruct etcd3 to retrieve the set of results at a given resource version, beginning at the provided start key, limited by the maximum number of requests provided by the continue token (or optionally, by a different limit specified by the client). If more results remain after reading up to the limit, the  storage should calculate a continue token that would begin at the next possible key, and the continue token set on the returned list.
+
+The storage layer in the apiserver must apply consistency checking to the provided continue token to ensure that malicious users cannot trick the server into serving results outside of its range. The storage layer must perform defensive checking on the provided value, check for path traversal attacks, and have stable versioning for the continue token.


This cannot happen as etcd doesn't have idea of path. Key is just a blob, it's Kubernetes that puts paths in key.

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Aug 11, 2017

smarterclayton added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. labels Aug 11, 2017

smarterclayton mentioned this pull request Aug 11, 2017

Support paged LIST queries from the Kubernetes API kubernetes/enhancements#365

Closed

smarterclayton force-pushed the list branch from 816c68f to f686c54 Compare August 11, 2017 01:26

smarterclayton mentioned this pull request Aug 11, 2017

Alpha list paging implementation kubernetes/kubernetes#48921

Merged

liggitt reviewed Aug 11, 2017

View reviewed changes

liggitt reviewed Aug 12, 2017

View reviewed changes

k8s-github-robot assigned idvoretskyi and grodrigues3 Aug 15, 2017

k8s-github-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Aug 15, 2017

sttts reviewed Aug 16, 2017

View reviewed changes

deads2k reviewed Aug 16, 2017

View reviewed changes

smarterclayton force-pushed the list branch from f686c54 to 9b17b6e Compare August 27, 2017 21:46

liggitt reviewed Aug 28, 2017

View reviewed changes

Design for consistent API chunking in Kubernetes

4c2eb5c

In order to reduce the memory allocation impact of very large LIST operations against the Kubernetes apiserver, it should be possible to receive a single large list as many individual requests to the server.

smarterclayton force-pushed the list branch from bb0dedf to 4c2eb5c Compare August 29, 2017 19:56

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 29, 2017

smarterclayton changed the title ~~Design for consistent API paging in Kubernetes~~ Design for consistent API chunking in Kubernetes Aug 29, 2017

k8s-github-robot merged commit 1886e7a into kubernetes:master Aug 29, 2017

lavalamp reviewed Aug 31, 2017

View reviewed changes

cben mentioned this pull request Jan 10, 2018

Support API chunking ManageIQ/kubeclient#283

Closed

caesarxuchao mentioned this pull request Aug 4, 2018

Support LIST starting from a key kubernetes/kubernetes#66981

Closed

brendandburns mentioned this pull request Nov 27, 2018

I wonder that if listNode() has paging function ? kubernetes-client/java#443

Closed

devaips mentioned this pull request Sep 14, 2019

Create new API chunking test kubernetes/kubernetes#82297

Closed

pacoxu mentioned this pull request Feb 24, 2021

APIListChunking - feature gate to GA? kubernetes/kubernetes#96497

Closed

danehans pushed a commit to danehans/community that referenced this pull request Jul 18, 2023

Add self as a member (kubernetes#896)

e4550ff

serathius reviewed Aug 28, 2023

View reviewed changes


		The schema of the continue token is chosen by the storage layer and is not guaranteed to remain consistent for clients - clients must consider the continue token as opaque. Server implementations should ensure that continue tokens can persist across server restarts and across upgrades.

		Servers may return fewer results than `limit` if server side filtering returns no results such as when a `label` or `field` selector is used. If the entire result set is filtered, the server may return zero results with a valid `continue` token.


		#### etcd3

		For etcd3 the continue token would contain a resource version (the snapshot that we are reading that is consistent across the entire LIST) and the start key for the next set of results. Upon receiving a valid continue token the apiserver would instruct etcd3 to retrieve the set of results at a given resource version, beginning at the provided start key, limited by the maximum number of requests provided by the continue token (or optionally, by a different limit specified by the client). If more results remain after reading up to the limit, the storage should calculate a continue token that would begin at the next possible key, and the continue token set on the returned list.


		The standard solution for reducing the impact of large reads is to allow them to be broken into smaller reads via a technique commonly referred to as paging. By efficiently splitting large list ranges from clients to etcd into many smaller list ranges, we can reduce the maximum memory allocation on etcd and the apiserver.

		Our primary consistent store etcd3 offers support for efficient paging with minimal overhead, and mechanisms exist for other potential future stores such as SQL databases or Consul to also implement a simple form of paging.


		## Proposed change:

		Expose a simple paging mechanism to allow large API responses to be broken into consistent partial responses. Clients would indicate a tolerance for paging (opt-in) by specifying a desired maximum number of results to return in a `LIST` call. The server would return up to that amount of objects, and if more exist it would return a `continue` parameter that the client could pass to receive the next set of results. The server would be allowed to ignore the limit if it does not implement limiting (backward compatible), but it is not allowed to support limiting without supporting a way to continue the query past the limit (may not implement `limit` without `continue`).


		The token returned by the server for `continue` would be an opaque serialized string that would contain a simple serialization of the initially requested limit, a version identifier (to allow future extension), and any additional data needed by the server storage to identify where to start the next range.

		The schema of the continue token is chosen by the storage layer and is not guaranteed to remain consistent for clients - clients must consider the continue token as opaque. Server implementations should ensure that continue tokens can persist across server restarts and across upgrades.


		### Specific Implementations

		Etcd2 would not be supported as it has no options to perform a consistent read and is on track to be deprecated in Kubernetes. Other databases that might back Kubernetes could either chose to not implement limiting, or leverage their own transactional characteristics to return a consistent list. In the near term our primary store remains etcd3 which can provide this capability at low complexity.


		Etcd2 would not be supported as it has no options to perform a consistent read and is on track to be deprecated in Kubernetes. Other databases that might back Kubernetes could either chose to not implement limiting, or leverage their own transactional characteristics to return a consistent list. In the near term our primary store remains etcd3 which can provide this capability at low complexity.

		Implementations that cannot offer consistent ranging (returning a set of results that are logically equivalent to receiving all results in one response) must not allow continuation, because consistent listing is a requirement of the Kubernetes API list and watch pattern.


		For etcd3 the continue token would contain a resource version (the snapshot that we are reading that is consistent across the entire LIST) and the start key for the next set of results. Upon receiving a valid continue token the apiserver would instruct etcd3 to retrieve the set of results at a given resource version, beginning at the provided start key, limited by the maximum number of requests provided by the continue token (or optionally, by a different limit specified by the client). If more results remain after reading up to the limit, the storage should calculate a continue token that would begin at the next possible key, and the continue token set on the returned list.

		The storage layer in the apiserver must apply consistency checking to the provided continue token to ensure that malicious users cannot trick the server into serving results outside of its range. The storage layer must perform defensive checking on the provided value, check for path traversal attacks, and have stable versioning for the continue token.


		Some clients may wish to follow a failed paged list with a full list attempt.

		The 5 minute default compaction interval for etcd3 bounds how long a list can run. Since clients may wish to perform processing over very large sets, increasing that timeout may make sense for large clusters. It should be possible to alter the interval at which compaction runs to accomodate larger clusters.


		The token returned by the server for `continue` would be an opaque serialized string that would contain a simple serialization of a version identifier (to allow future extension), and any additional data needed by the server storage to identify where to start the next range.

		The continue token is not required to encode other filtering parameters present on the initial request, and clients may alter their filter parameters on subsequent chunk reads. However, the server implementation may reject such changes with a `400 Bad Request` error, and clients should consider this behavior undefined and left to future clarification. Chunking is intended to return consistent lists.


		The continue token is not required to encode other filtering parameters present on the initial request, and clients may alter their filter parameters on subsequent chunk reads. However, the server implementation may reject such changes with a `400 Bad Request` error, and clients should consider this behavior undefined and left to future clarification. Chunking is intended to return consistent lists.

		If the resource version parameter specified on the request is inconsistent with the `continue` token, the server must reject the request with a `400 Bad Request` error.


		The schema of the continue token is chosen by the storage layer and is not guaranteed to remain consistent for clients - clients must consider the continue token as opaque. Server implementations should ensure that continue tokens can persist across server restarts and across upgrades.

		Servers may return fewer results than `limit` if server side filtering returns no results such as when a `label` or `field` selector is used. If the entire result set is filtered, the server may return zero results with a valid `continue` token. A client must use the presence of a `continue` token in the response to determine whether more results are available, regardless of the number of results returned. A server that supports limits must not return more results than `limit` if a `continue` token is also returned. If the server does not return a `continue` token, the server must return all remaining results.


		The server may limit the amount of time a continue token is valid for. Clients should assume continue tokens last only a few minutes.

		The server must support `continue` tokens that are valid across multiple API servers. The server must support a mechanism for rolling restart such that continue tokens are valid after one or all API servers have been restarted.


		etcd2 will not be supported as it has no option to perform a consistent read and is on track to be deprecated in Kubernetes. Other databases that might back Kubernetes could either choose to not implement limiting, or leverage their own transactional characteristics to return a consistent list. In the near term our primary store remains etcd3 which can provide this capability at low complexity.

		Implementations that cannot offer consistent ranging (returning a set of results that are logically equivalent to receiving all results in one response) must not allow continuation, because consistent listing is a requirement of the Kubernetes API list and watch pattern.


		#### Possible SQL database implementation

		A SQL database backing a Kubernetes server would need to implement a consistent snapshot read of an entire resource type, plus support changefeed style updates in order to implement the WATCH primitive. A likely implementation in SQL would be a table that stores multiple versions of each object, ordered by key and version, and filters out all historical versions of an object. A consistent paged list over such a table might be similar to:


		### Other solutions

		Compression from the apiserver and between the apiserver and etcd can reduce total network bandwidth, but cannot reduce the peak CPU and memory used inside the client, apiserver, or etcd processes.

Design for consistent API chunking in Kubernetes #896

Design for consistent API chunking in Kubernetes #896

Conversation

smarterclayton commented Aug 11, 2017 • edited Loading

smarterclayton commented Aug 11, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liggitt Aug 11, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smarterclayton Aug 16, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smarterclayton commented Aug 27, 2017

liggitt Aug 28, 2017 • edited Loading

Choose a reason for hiding this comment

liggitt Aug 28, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smarterclayton commented Aug 29, 2017

lpabon commented Aug 29, 2017

liggitt commented Aug 29, 2017

liggitt commented Aug 29, 2017

smarterclayton commented Aug 29, 2017

liggitt commented Aug 29, 2017

k8s-github-robot commented Aug 29, 2017

lavalamp left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smarterclayton commented Aug 11, 2017 •

edited

Loading

liggitt Aug 11, 2017 •

edited

Loading

smarterclayton Aug 16, 2017 •

edited

Loading

liggitt Aug 28, 2017 •

edited

Loading

liggitt Aug 28, 2017 •

edited

Loading