Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design for consistent API chunking in Kubernetes #896

Merged
merged 1 commit into from
Aug 29, 2017

Conversation

smarterclayton
Copy link
Contributor

@smarterclayton smarterclayton commented Aug 11, 2017

In order to reduce the memory allocation impact of very large LIST operations against the Kubernetes apiserver, it should be possible to receive a single large resource list as many individual page requests to the server.

Part of kubernetes/enhancements#365. Taken from kubernetes/kubernetes#49338. Original discussion in kubernetes/kubernetes#2349

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Aug 11, 2017
@smarterclayton smarterclayton added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. labels Aug 11, 2017
@smarterclayton
Copy link
Contributor Author

@kubernetes/sig-api-machinery-api-reviews @kubernetes/sig-scalability-api-reviews


The schema of the continue token is chosen by the storage layer and is not guaranteed to remain consistent for clients - clients *must* consider the continue token as opaque. Server implementations *should* ensure that continue tokens can persist across server restarts and across upgrades.

Servers *may* return fewer results than `limit` if server side filtering returns no results such as when a `label` or `field` selector is used. If the entire result set is filtered, the server *may* return zero results with a valid `continue` token.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you clarify the interaction of the continue parameter with the other parameters (resourceVersion, labelSelector/fieldSelector) passed to list calls? Is the expectation that all parameters present on the initial request are also passed in subsequent requests with identical values? Is the server responsible for ensuring that is the case? Paged list results with resourceVersion=0 stands out as having potentially odd implications

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So etcd3 does not require that any of the parameters stay the same - it'll work regardless. I'm slightly inclined to describe changing field/label selectors as undefined behavior, and that the server may return a 400 bad request if it is not allowed. I think in general the storage structure required to do paging in a useful way to clients is going to be based on being able to range over an ordered list with snapshot isolation of versions, and if people are implementing paging with things like "holding open database cursors for 5 minutes" they're going to come to tears anyway.

Specifying resourceVersion inconsistent with continue should be a 400 bad request. Will comment.


#### etcd3

For etcd3 the continue token would contain a resource version (the snapshot that we are reading that is consistent across the entire LIST) and the start key for the next set of results. Upon receiving a valid continue token the apiserver would instruct etcd3 to retrieve the set of results at a given resource version, beginning at the provided start key, limited by the maximum number of requests provided by the continue token (or optionally, by a different limit specified by the client). If more results remain after reading up to the limit, the storage should calculate a continue token that would begin at the next possible key, and the continue token set on the returned list.
Copy link
Member

@liggitt liggitt Aug 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and the start key for the next set of results

this has the potential to return information in the continue token that the user would not have received in a filtered list API call

if we gain the ability to authorize clients to only do LIST calls filtered to specific label selectors (c.f. kubernetes/kubernetes#40403, kubernetes/kubernetes#44703), this could expose etcd key names the user doesn't actually have permission to know about, containing namespace and object name info

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. Namespace names and object names (for those reading at home). We could encrypt and sign the continuation token but that's another value we have to distribute. Would also be vulnerable to repeat attacks. Hash is out because it has to be reversible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This concern could be handled per resource type with the following possibilities:

  1. disallow paging on certain resources (a resource whose name was a secret)
  2. disallow paging when performing certain filtering queries that would otherwise expose info the user could not see
  3. continue paging until at least one visible key is found, and return that as the next key to continue on (i.e. users only ever see keys they would see normally)
  4. encrypt the key name using a pre shared secret across masters

I think all of these are possible options and easy to implement. Will add this to the proposal.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

continue paging until at least one visible key is found, and return that as the next key to continue on (i.e. users only ever see keys they would see normally)

This seems like the most reasonable.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering whether setting "continue" to the "next key" is the best approach. Not only because of a possible security implication, but if we get to the point where we're paging over data that is not immediately available to us right away, and the server wants to return "what it has as of now", then it may not know the "next key". I think it would be better to return some "token" that's really only known to the paging implementation of that resource so it can figure out how best to represent "where to continue".

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could set it as the last key, then increase it for the next range call.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no guarantee the user should be able to see that last key, and if large chunks of entries are being filtered out there's no longer a guarantee that the paging can make progress.

where we're paging over data that is not immediately available to us right away

That would explicitly not be supported - this is about returning the consistent read in multiple steps, so there is no "data we don't know about yet".


#### etcd3

For etcd3 the continue token would contain a resource version (the snapshot that we are reading that is consistent across the entire LIST) and the start key for the next set of results. Upon receiving a valid continue token the apiserver would instruct etcd3 to retrieve the set of results at a given resource version, beginning at the provided start key, limited by the maximum number of requests provided by the continue token (or optionally, by a different limit specified by the client). If more results remain after reading up to the limit, the storage should calculate a continue token that would begin at the next possible key, and the continue token set on the returned list.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

limited by the maximum number of requests provided by the continue token (or optionally, by a different limit specified by the client)

When filtering (via labelSelector/fieldSelector) is done after obtaining the results, there is no guarantee the number of items matching the filter will equal the number retrieved from etcd. the number of matching results returned to the client may be much smaller (or in pathological cases, even 0). Would you expect the server to continue paging internally until it had obtained enough filtered results to satisfy the client's requested limit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for a first approach no, but leave it up to the impl to be improved.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's fine, as long as the documentation is clear that when a continue token is returned, it must be used to ensure obtaining a complete list (even if fewer results were requested, even possibly 0)

@k8s-github-robot k8s-github-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Aug 15, 2017
* The data from etcd has to be transferred to the apiserver in one large chunk
* The `kube-apiserver` also has to deserialize that response into a single object, and then re-serialize it back to the client
* Much of the decoded etcd memory is copied into the struct used to serialize to the client
* A client like `kubectl get` will then decode the response from JSON or protobuf
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"etcd client", "kube client" would make it easier to read. Many clients involved here.

* The `kube-apiserver` also has to deserialize that response into a single object, and then re-serialize it back to the client
* Much of the decoded etcd memory is copied into the struct used to serialize to the client
* A client like `kubectl get` will then decode the response from JSON or protobuf
* A client with a slow connection may not be able to receive the entire response body within the default 60s timeout
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and those connections with >60s pile up due to retries and non-adaptive back preasure, easily leading to congestion of an apiserver exceeding max-inflight all the time.


The standard solution for reducing the impact of large reads is to allow them to be broken into smaller reads via a technique commonly referred to as paging. By efficiently splitting large list ranges from clients to etcd into many smaller list ranges, we can reduce the maximum memory allocation on etcd and the apiserver.

Our primary consistent store etcd3 offers support for efficient paging with minimal overhead, and mechanisms exist for other potential future stores such as SQL databases or Consul to also implement a simple form of paging.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing the word "consistent" page mechanism, i.e. transactional view


## Proposed change:

Expose a simple paging mechanism to allow large API responses to be broken into consistent partial responses. Clients would indicate a tolerance for paging (opt-in) by specifying a desired maximum number of results to return in a `LIST` call. The server would return up to that amount of objects, and if more exist it would return a `continue` parameter that the client could pass to receive the next set of results. The server would be allowed to ignore the limit if it does not implement limiting (backward compatible), but it is not allowed to support limiting without supporting a way to continue the query past the limit (may not implement `limit` without `continue`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a life-time of such a continue parameter? Can I come back tomorrow and continue a paged list from today?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the client asked for just 5 because it can't handle more, I'm not sure the server should be allowed to ignore the request and flood them with a massive response. I'd prefer an error that tells the client that the server doesn't support paging and then let the client decide what to do next.

Copy link
Contributor Author

@smarterclayton smarterclayton Aug 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a life-time of such a continue parameter?

It's up to the backing store to have the historical data available. So for our default config is 5m for etcd3.

If the client asked for just 5 because it can't handle more, I'm not sure the server should be allowed to ignore the request and flood them with a massive response. I'd prefer an error that tells the client that the server doesn't support paging and then let the client decide what to do next.

That is not backwards compatible - old servers wouldn't return an error. We generally err on not introducing new incompatibilities. Paging is a way to get less results in a chunk, not a way to slowly browse through the data store (slow being on the timescales of a human). So naive web clients that want to display a chunk of results at a time needs something more involved than simple continuations and would be an additional set of parameters.

}
```

The token returned by the server for `continue` would be an opaque serialized string that would contain a simple serialization of the initially requested limit, a version identifier (to allow future extension), and any additional data needed by the server storage to identify where to start the next range.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is the initially requested limit important? Sounds like an implementation detail if at all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not actually important, and hadn't planned on implementing that. You still have to specify limit on each call.


The token returned by the server for `continue` would be an opaque serialized string that would contain a simple serialization of the initially requested limit, a version identifier (to allow future extension), and any additional data needed by the server storage to identify where to start the next range.

The schema of the continue token is chosen by the storage layer and is not guaranteed to remain consistent for clients - clients *must* consider the continue token as opaque. Server implementations *should* ensure that continue tokens can persist across server restarts and across upgrades.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and be valid in an HA environment on different servers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and be valid in an HA environment on different servers.

Seems like this may actually be a must. Otherwise the feature isn't as useful.


The schema of the continue token is chosen by the storage layer and is not guaranteed to remain consistent for clients - clients *must* consider the continue token as opaque. Server implementations *should* ensure that continue tokens can persist across server restarts and across upgrades.

Servers *may* return fewer results than `limit` if server side filtering returns no results such as when a `label` or `field` selector is used. If the entire result set is filtered, the server *may* return zero results with a valid `continue` token.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in other words: either the result is empty, we get a full bucket with continue param, or we get the remaining items without continue param. But non-empty, len != limit plus continue is forbidden.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Continue AND 0 < x < limit = allowed
Continue AND x == 0 = allowed
Continue AND x == limit = allowed
No Continue AND x > limit = allowed (backwards compatibility)
Continue AND x > limit = not allowed
  • Continue being non empty means there are more results
  • The server may not return more than Limit if it supports Limit.
  • The server may always return less than or equal to Limit


### Specific Implementations

Etcd2 would not be supported as it has no options to perform a consistent read and is on track to be deprecated in Kubernetes. Other databases that might back Kubernetes could either chose to not implement limiting, or leverage their own transactional characteristics to return a consistent list. In the near term our primary store remains etcd3 which can provide this capability at low complexity.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/chose/choose/

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Etcd2 -> etcd2


Etcd2 would not be supported as it has no options to perform a consistent read and is on track to be deprecated in Kubernetes. Other databases that might back Kubernetes could either chose to not implement limiting, or leverage their own transactional characteristics to return a consistent list. In the near term our primary store remains etcd3 which can provide this capability at low complexity.

Implementations that cannot offer consistent ranging (returning a set of results that are logically equivalent to receiving all results in one response) must not allow continuation, because consistent listing is a requirement of the Kubernetes API list and watch pattern.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing some words about discovery of this feature above. Do we want this being part of discovery info?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This version of paging is for processes that expected to get a full list but want it in chunks. I think the design is biased towards that being an optional implementation. A future improvement to paging might make it usable for end users who want to walk the tree (supporting ?fromKey=) but that is not in scope for this design.

// no more than 500 items
]
}
GET /api/v1/pods?limit=500&continue=ABC...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from the "token encodes limit" below I conclude that we can also continue without passing the limit value again?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can continue without passing limit, but the intent is that limit is applied per call. If you continue without specifying limit you'll get the rest. Limit should be able to be changed so that a client can potentially perform a slow start (reserving the right to do so). The text about token encodes limit is wrong and will be removed.


For etcd3 the continue token would contain a resource version (the snapshot that we are reading that is consistent across the entire LIST) and the start key for the next set of results. Upon receiving a valid continue token the apiserver would instruct etcd3 to retrieve the set of results at a given resource version, beginning at the provided start key, limited by the maximum number of requests provided by the continue token (or optionally, by a different limit specified by the client). If more results remain after reading up to the limit, the storage should calculate a continue token that would begin at the next possible key, and the continue token set on the returned list.

The storage layer in the apiserver must apply consistency checking to the provided continue token to ensure that malicious users cannot trick the server into serving results outside of its range. The storage layer must perform defensive checking on the provided value, check for path traversal attacks, and have stable versioning for the continue token.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this mean we encode the token cryptographically (e.g. via JSON token standard) ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this mean we encode the token cryptographically (e.g. via JSON token standard) ?

I think you could just validate it.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using something other than "the next key" would solve this issue too.


Some clients may wish to follow a failed paged list with a full list attempt.

The 5 minute default compaction interval for etcd3 bounds how long a list can run. Since clients may wish to perform processing over very large sets, increasing that timeout may make sense for large clusters. It should be possible to alter the interval at which compaction runs to accomodate larger clusters.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would a non-consistent continuation make sense? Are there clients which might be happy with this? Would it lessen the burden of the apiserver to always return consistent views?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

discussed below....

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shedding some thoughts here :)

Aren't most paging-list things are immutable? For example, names? If use cases are well understood, there are good alternative rather than listing non-consistently.

In a resource, there are frequently updated parts (status) and pretty much unchanged parts (spec, metadata). If they are decoupled, and keep different versions, then it is easy to list pages of names without losing consistency. Right?

}
```

The token returned by the server for `continue` would be an opaque serialized string that would contain a simple serialization of the initially requested limit, a version identifier (to allow future extension), and any additional data needed by the server storage to identify where to start the next range.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would it include the initially requested limit? That doesn't sound very opaque if we're getting into the details of what's in it.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while I agree that to the client its opaque, as part of this design I think it would be good to know how we expect kube servers expect to populate it. For example, do we expect different data in there based on the resource type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't need the limit, that was part of an earlier draft. Will remove.


The schema of the continue token is chosen by the storage layer and is not guaranteed to remain consistent for clients - clients *must* consider the continue token as opaque. Server implementations *should* ensure that continue tokens can persist across server restarts and across upgrades.

Servers *may* return fewer results than `limit` if server side filtering returns no results such as when a `label` or `field` selector is used. If the entire result set is filtered, the server *may* return zero results with a valid `continue` token.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the user requests 20 items and the server finds a list to return with 50 items, can the server choose to only send 10 items?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The server may return any number of results <= limit at any time, as long as a continue token is also returned. If the continue token is not set, the server MUST return all remaining results.


The schema of the continue token is chosen by the storage layer and is not guaranteed to remain consistent for clients - clients *must* consider the continue token as opaque. Server implementations *should* ensure that continue tokens can persist across server restarts and across upgrades.

Servers *may* return fewer results than `limit` if server side filtering returns no results such as when a `label` or `field` selector is used. If the entire result set is filtered, the server *may* return zero results with a valid `continue` token.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does the server indicate that the continue token is no longer valid?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

410 Resource Expired

* Controllers with full caches
* Any controller with a full in-memory cache of one or more resources almost certainly depends on having a consistent view of resources, and so will either need to perform a full list or a paged list, without dropping results
* `kubectl get`
* Most administrators would probably prefer to see a very large set with some inconsistency rather than no results (due to a timeout under load). They would likely be ok with handling `410 ResourceExpired` as "continue from the last key I processed"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would such a client indicate that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will clarify, would probably be a statusDetails. sub field. We probably want to let opaque data be returned on status.

* Migration style commands
* Assuming a migration command has to run on the full data set (to upgrade a resource from json to protobuf, or to check a large set of resources for errors) and is performing some expensive calculation on each, very large sets may not complete over the server expiration window.

For clients that do not care about consistency, the storage *may* return a `continue` value with the `ResourceExpired` error that allows the client to restart from the same prefix key, but using the latest resource version. This would allow clients that do not require a fully consistent LIST to opt in to partially consistent LISTs but still be able to scan the entire working set.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may, not must? Seems generous, but ok. Move this up closer to where you describe the behavior.

@smarterclayton
Copy link
Contributor Author

To avoid confusion I am renaming this as "API chunking" since "API paging" is more broad and can cover use cases that we do not intend to support in the near term. Chunking and paging would reuse similar mechanisms, but chunking is more specific. All review comments addressed.


The token returned by the server for `continue` would be an opaque serialized string that would contain a simple serialization of a version identifier (to allow future extension), and any additional data needed by the server storage to identify where to start the next range.

The continue token is not required to encode other filtering parameters present on the initial request, and clients may alter their filter parameters on subsequent chunk reads. However, the server implementation **may** reject such changes with a `400 Bad Request` error, and clients should consider this behavior undefined and left to future clarification. Chunking is intended to return consistent lists.
Copy link
Member

@liggitt liggitt Aug 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the behavior is undefined and the server may return a "bad request" error, I'd characterize this as "clients should not alter their filter parameters on subsequent chunk reads"


The continue token is not required to encode other filtering parameters present on the initial request, and clients may alter their filter parameters on subsequent chunk reads. However, the server implementation **may** reject such changes with a `400 Bad Request` error, and clients should consider this behavior undefined and left to future clarification. Chunking is intended to return consistent lists.

If the resource version parameter specified on the request is inconsistent with the `continue` token, the server **must** reject the request with a `400 Bad Request` error.
Copy link
Member

@liggitt liggitt Aug 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is limit/continue supported with resourceVersion=0? also, can the resourceVersion parameter be unspecified and rely on the continue token for consistency? (most list requests do not specify a resource version, and the examples above do not)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Continue + resourceVersion=0 should probably return a bad request. Continue token should be the only field required to get a consistent paging. resourceVersion=!0 should probably be accepted with continuation. resourceVersion=0... probably should as well, since chunking should be orthogonal. We might have to say that it's unsupported in the alpha release though if it involves mucking with the watch cache.


The schema of the continue token is chosen by the storage layer and is not guaranteed to remain consistent for clients - clients **must** consider the continue token as opaque. Server implementations **should** ensure that continue tokens can persist across server restarts and across upgrades.

Servers **may** return fewer results than `limit` if server side filtering returns no results such as when a `label` or `field` selector is used. If the entire result set is filtered, the server **may** return zero results with a valid `continue` token. A client **must** use the presence of a `continue` token in the response to determine whether more results are available, regardless of the number of results returned. A server that supports limits **must not** return more results than `limit` if a `continue` token is also returned. If the server does not return a `continue` token, the server **must** return all remaining results.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the entire result set is filtered, the server may return zero results with a valid continue token.

On the final set, the server may also return zero results with no continue token

@smarterclayton
Copy link
Contributor Author

@liggitt addressed your final comments. Any other comments from anyone?

@lpabon
Copy link

lpabon commented Aug 29, 2017

@smarterclayton LGTM, just one quick question. If the apiserver notices it has OOM or close to it, what is the expected result at the client?

@liggitt
Copy link
Member

liggitt commented Aug 29, 2017

LGTM as a first step we can iterate on. The exposed API surface is minimal, largely opaque, and aligns well with existing chunking mechanisms in other APIs.

@liggitt
Copy link
Member

liggitt commented Aug 29, 2017

If the apiserver notices it has OOM or close to it, what is the expected result at the client?

Unchanged by this proposal. API server OOM during request handling is considered abnormal, and if it occurs, the connection to the client is terminated.

@smarterclayton
Copy link
Contributor Author

Squashing

In order to reduce the memory allocation impact of very large LIST
operations against the Kubernetes apiserver, it should be possible to
receive a single large list as many individual requests to the
server.
@liggitt
Copy link
Member

liggitt commented Aug 29, 2017

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 29, 2017
@smarterclayton smarterclayton changed the title Design for consistent API paging in Kubernetes Design for consistent API chunking in Kubernetes Aug 29, 2017
@k8s-github-robot
Copy link

Automatic merge from submit-queue

@k8s-github-robot k8s-github-robot merged commit 1886e7a into kubernetes:master Aug 29, 2017
Copy link
Member

@lavalamp lavalamp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, late to the party. A few thoughts. No major complaints.


The schema of the continue token is chosen by the storage layer and is not guaranteed to remain consistent for clients - clients **must** consider the continue token as opaque. Server implementations **should** ensure that continue tokens can persist across server restarts and across upgrades.

Servers **may** return fewer results than `limit` if server side filtering returns no results such as when a `label` or `field` selector is used. If the entire result set is filtered, the server **may** return zero results with a valid `continue` token. A client **must** use the presence of a `continue` token in the response to determine whether more results are available, regardless of the number of results returned. A server that supports limits **must not** return more results than `limit` if a `continue` token is also returned. If the server does not return a `continue` token, the server **must** return all remaining results. The server **may** return zero results with no `continue` token on the last call.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand why you have it this way, but does it prevent us from moving the filtering step into the db layer in the future? I guess not. But it is an implementation detail leaking through.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it doesn't. In fact, filtering at the DB level should be opaque and possible - the continue token is generated by storage and surfaced all the way up, and filtering at the DB would simply identify the appropriate next key.


The server **may** limit the amount of time a continue token is valid for. Clients **should** assume continue tokens last only a few minutes.

The server **must** support `continue` tokens that are valid across multiple API servers. The server **must** support a mechanism for rolling restart such that continue tokens are valid after one or all API servers have been restarted.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And storage backend restarts.


Servers **may** return fewer results than `limit` if server side filtering returns no results such as when a `label` or `field` selector is used. If the entire result set is filtered, the server **may** return zero results with a valid `continue` token. A client **must** use the presence of a `continue` token in the response to determine whether more results are available, regardless of the number of results returned. A server that supports limits **must not** return more results than `limit` if a `continue` token is also returned. If the server does not return a `continue` token, the server **must** return all remaining results. The server **may** return zero results with no `continue` token on the last call.

The server **may** limit the amount of time a continue token is valid for. Clients **should** assume continue tokens last only a few minutes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Servers must give a clear error on expired continue tokens.

What happens if the collection is large enough that the last page has always expired by the time the client gets there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They do, 410 Resource Expired.

For that case the default pager impl falls back to not paged. On our largest pathological data sets (1.2 gb at rest, 20gb in memory) i was able to read 100k secrets (500M of data) in chunks of 500 in under 4 seconds.


etcd2 will not be supported as it has no option to perform a consistent read and is on track to be deprecated in Kubernetes. Other databases that might back Kubernetes could either choose to not implement limiting, or leverage their own transactional characteristics to return a consistent list. In the near term our primary store remains etcd3 which can provide this capability at low complexity.

Implementations that cannot offer consistent ranging (returning a set of results that are logically equivalent to receiving all results in one response) must not allow continuation, because consistent listing is a requirement of the Kubernetes API list and watch pattern.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, we should come up with a stress test for this.


#### Possible SQL database implementation

A SQL database backing a Kubernetes server would need to implement a consistent snapshot read of an entire resource type, plus support changefeed style updates in order to implement the WATCH primitive. A likely implementation in SQL would be a table that stores multiple versions of each object, ordered by key and version, and filters out all historical versions of an object. A consistent paged list over such a table might be similar to:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, do you really expect anyone to try this? Does it really need to be in this proposal?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just want to be really clear on the equivalence between an MVCC store like etcd and someone who really believes they want to have an alternate backend. At our scales it doesn't look like it's necessary anymore, but I want to be able to point back to a clear consideration.


1. Disable chunking on specific resources
2. Disable chunking when the user does not have permission to view all resources within a range
3. Encrypt the next key or the continue token using a shared secret across all API servers
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generate a random string as the continuation token; a shared map maps from that string to the actual data structure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't work across api servers unless persisted to etcd.

1. Disable chunking on specific resources
2. Disable chunking when the user does not have permission to view all resources within a range
3. Encrypt the next key or the continue token using a shared secret across all API servers
4. When chunking, continue reading until the next visible start key is located after filtering, so that start keys are always keys the user has access to.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're worried only about construction and not about making it opaque, you could sign or HMAC the key to prevent users from constructing them.

* Controllers with full caches
* Any controller with a full in-memory cache of one or more resources almost certainly depends on having a consistent view of resources, and so will either need to perform a full list or a paged list, without dropping results
* `kubectl get`
* Most administrators would probably prefer to see a very large set with some inconsistency rather than no results (due to a timeout under load). They would likely be ok with handling `410 ResourceExpired` as "continue from the last key I processed"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't etcd's ordering potentially change with each update? I would expect lots of duplicates / omissions in this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not. Etcd guarantees key order.

* Migration style commands
* Assuming a migration command has to run on the full data set (to upgrade a resource from json to protobuf, or to check a large set of resources for errors) and is performing some expensive calculation on each, very large sets may not complete over the server expiration window.

For clients that do not care about consistency, the server **may** return a `continue` value on the `ResourceExpired` error that allows the client to restart from the same prefix key, but using the latest resource version. This would allow clients that do not require a fully consistent LIST to opt in to partially consistent LISTs but still be able to scan the entire working set. It is likely this could be a sub field (opaque data) of the `Status` response under `statusDetails`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are etcd responses sorted by key? For some reason I thought it was by position in the etcd db file. @jpbetz

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Etcd is a sorted map in lexicographic key order.


### Other solutions

Compression from the apiserver and between the apiserver and etcd can reduce total network bandwidth, but cannot reduce the peak CPU and memory used inside the client, apiserver, or etcd processes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't apiserver do stream processing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depends on how you define a stream. It can. Would require a dramatic change to the API shape of LIST.

@bgrant0607
Copy link
Member

Speaking of dramatic changes to APIs, we need to address the Endpoints API eventually.

k8s-github-robot pushed a commit to kubernetes/kubernetes that referenced this pull request Sep 3, 2017
Automatic merge from submit-queue (batch tested with PRs 50832, 51119, 51636, 48921, 51712)

Alpha list paging implementation

Design in kubernetes/community#896

Support `?limit=NUMBER`, `?continue=CONTINUATIONTOKEN`, and a `continue` field
on ListMeta and pass through to etcd. Perform minor validation as an example.

```
# first out of three
$ curl http://127.0.0.1:8080/api/v1/namespaces?limit=1
{
  "kind": "NamespaceList",
  "apiVersion": "v1",
  "metadata": {
    "selfLink": "/api/v1/namespaces",
    "resourceVersion": "146",
    "next": "ZGVmYXVsdA"
  },
  "items": [
    {
      "metadata": {
        "name": "default",
        "selfLink": "/api/v1/namespaces/default",
        "uid": "f95e1390-6852-11e7-ab03-7831c1b76042",
        "resourceVersion": "4",
        "creationTimestamp": "2017-07-14T05:12:03Z"
      },
      "spec": {
        "finalizers": [
          "kubernetes"
        ]
      },
      "status": {
        "phase": "Active"
      }
    }
  ]
}
...
# last
$ curl "http://127.0.0.1:8080/api/v1/namespaces?limit=1&continue=a3ViZS1wdWJsaWM"
{
  "kind": "NamespaceList",
  "apiVersion": "v1",
  "metadata": {
    "selfLink": "/api/v1/namespaces",
    "resourceVersion": "145"
  },
  "items": [
    {
      "metadata": {
        "name": "kube-system",
        "selfLink": "/api/v1/namespaces/kube-system",
        "uid": "f95e9484-6852-11e7-ab03-7831c1b76042",
        "resourceVersion": "5",
        "creationTimestamp": "2017-07-14T05:12:03Z"
      },
      "spec": {
        "finalizers": [
          "kubernetes"
        ]
      },
      "status": {
        "phase": "Active"
      }
    }
  ]
}
```
sttts pushed a commit to sttts/code-generator that referenced this pull request Sep 22, 2017
Automatic merge from submit-queue (batch tested with PRs 50832, 51119, 51636, 48921, 51712)

Alpha list paging implementation

Design in kubernetes/community#896

Support `?limit=NUMBER`, `?continue=CONTINUATIONTOKEN`, and a `continue` field
on ListMeta and pass through to etcd. Perform minor validation as an example.

```
# first out of three
$ curl http://127.0.0.1:8080/api/v1/namespaces?limit=1
{
  "kind": "NamespaceList",
  "apiVersion": "v1",
  "metadata": {
    "selfLink": "/api/v1/namespaces",
    "resourceVersion": "146",
    "next": "ZGVmYXVsdA"
  },
  "items": [
    {
      "metadata": {
        "name": "default",
        "selfLink": "/api/v1/namespaces/default",
        "uid": "f95e1390-6852-11e7-ab03-7831c1b76042",
        "resourceVersion": "4",
        "creationTimestamp": "2017-07-14T05:12:03Z"
      },
      "spec": {
        "finalizers": [
          "kubernetes"
        ]
      },
      "status": {
        "phase": "Active"
      }
    }
  ]
}
...
# last
$ curl "http://127.0.0.1:8080/api/v1/namespaces?limit=1&continue=a3ViZS1wdWJsaWM"
{
  "kind": "NamespaceList",
  "apiVersion": "v1",
  "metadata": {
    "selfLink": "/api/v1/namespaces",
    "resourceVersion": "145"
  },
  "items": [
    {
      "metadata": {
        "name": "kube-system",
        "selfLink": "/api/v1/namespaces/kube-system",
        "uid": "f95e9484-6852-11e7-ab03-7831c1b76042",
        "resourceVersion": "5",
        "creationTimestamp": "2017-07-14T05:12:03Z"
      },
      "spec": {
        "finalizers": [
          "kubernetes"
        ]
      },
      "status": {
        "phase": "Active"
      }
    }
  ]
}
```

Kubernetes-commit: 35ffb5c6cf70974c0a571cd1ebdc72ad8d0f8332
tamalsaha pushed a commit to kmodules/shared-informer that referenced this pull request Aug 13, 2020
Automatic merge from submit-queue (batch tested with PRs 50832, 51119, 51636, 48921, 51712)

Alpha list paging implementation

Design in kubernetes/community#896

Support `?limit=NUMBER`, `?continue=CONTINUATIONTOKEN`, and a `continue` field
on ListMeta and pass through to etcd. Perform minor validation as an example.

```
# first out of three
$ curl http://127.0.0.1:8080/api/v1/namespaces?limit=1
{
  "kind": "NamespaceList",
  "apiVersion": "v1",
  "metadata": {
    "selfLink": "/api/v1/namespaces",
    "resourceVersion": "146",
    "next": "ZGVmYXVsdA"
  },
  "items": [
    {
      "metadata": {
        "name": "default",
        "selfLink": "/api/v1/namespaces/default",
        "uid": "f95e1390-6852-11e7-ab03-7831c1b76042",
        "resourceVersion": "4",
        "creationTimestamp": "2017-07-14T05:12:03Z"
      },
      "spec": {
        "finalizers": [
          "kubernetes"
        ]
      },
      "status": {
        "phase": "Active"
      }
    }
  ]
}
...
# last
$ curl "http://127.0.0.1:8080/api/v1/namespaces?limit=1&continue=a3ViZS1wdWJsaWM"
{
  "kind": "NamespaceList",
  "apiVersion": "v1",
  "metadata": {
    "selfLink": "/api/v1/namespaces",
    "resourceVersion": "145"
  },
  "items": [
    {
      "metadata": {
        "name": "kube-system",
        "selfLink": "/api/v1/namespaces/kube-system",
        "uid": "f95e9484-6852-11e7-ab03-7831c1b76042",
        "resourceVersion": "5",
        "creationTimestamp": "2017-07-14T05:12:03Z"
      },
      "spec": {
        "finalizers": [
          "kubernetes"
        ]
      },
      "status": {
        "phase": "Active"
      }
    }
  ]
}
```

Kubernetes-commit: 35ffb5c6cf70974c0a571cd1ebdc72ad8d0f8332
akhilerm pushed a commit to akhilerm/apimachinery that referenced this pull request Sep 20, 2022
Automatic merge from submit-queue (batch tested with PRs 50832, 51119, 51636, 48921, 51712)

Alpha list paging implementation

Design in kubernetes/community#896

Support `?limit=NUMBER`, `?continue=CONTINUATIONTOKEN`, and a `continue` field
on ListMeta and pass through to etcd. Perform minor validation as an example.

```
# first out of three
$ curl http://127.0.0.1:8080/api/v1/namespaces?limit=1
{
  "kind": "NamespaceList",
  "apiVersion": "v1",
  "metadata": {
    "selfLink": "/api/v1/namespaces",
    "resourceVersion": "146",
    "next": "ZGVmYXVsdA"
  },
  "items": [
    {
      "metadata": {
        "name": "default",
        "selfLink": "/api/v1/namespaces/default",
        "uid": "f95e1390-6852-11e7-ab03-7831c1b76042",
        "resourceVersion": "4",
        "creationTimestamp": "2017-07-14T05:12:03Z"
      },
      "spec": {
        "finalizers": [
          "kubernetes"
        ]
      },
      "status": {
        "phase": "Active"
      }
    }
  ]
}
...
# last
$ curl "http://127.0.0.1:8080/api/v1/namespaces?limit=1&continue=a3ViZS1wdWJsaWM"
{
  "kind": "NamespaceList",
  "apiVersion": "v1",
  "metadata": {
    "selfLink": "/api/v1/namespaces",
    "resourceVersion": "145"
  },
  "items": [
    {
      "metadata": {
        "name": "kube-system",
        "selfLink": "/api/v1/namespaces/kube-system",
        "uid": "f95e9484-6852-11e7-ab03-7831c1b76042",
        "resourceVersion": "5",
        "creationTimestamp": "2017-07-14T05:12:03Z"
      },
      "spec": {
        "finalizers": [
          "kubernetes"
        ]
      },
      "status": {
        "phase": "Active"
      }
    }
  ]
}
```

Kubernetes-commit: 35ffb5c6cf70974c0a571cd1ebdc72ad8d0f8332
danehans pushed a commit to danehans/community that referenced this pull request Jul 18, 2023

For etcd3 the continue token would contain a resource version (the snapshot that we are reading that is consistent across the entire LIST) and the start key for the next set of results. Upon receiving a valid continue token the apiserver would instruct etcd3 to retrieve the set of results at a given resource version, beginning at the provided start key, limited by the maximum number of requests provided by the continue token (or optionally, by a different limit specified by the client). If more results remain after reading up to the limit, the storage should calculate a continue token that would begin at the next possible key, and the continue token set on the returned list.

The storage layer in the apiserver must apply consistency checking to the provided continue token to ensure that malicious users cannot trick the server into serving results outside of its range. The storage layer must perform defensive checking on the provided value, check for path traversal attacks, and have stable versioning for the continue token.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cannot happen as etcd doesn't have idea of path. Key is just a blob, it's Kubernetes that puts paths in key.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.