-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Repository analysis timeout should apply to register operations #101182
Labels
>bug
:Distributed Coordination/Snapshot/Restore
Anything directly related to the `_snapshot/*` APIs
Team:Distributed (Obsolete)
Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.
Comments
DaveCTurner
added
>bug
:Distributed Coordination/Snapshot/Restore
Anything directly related to the `_snapshot/*` APIs
labels
Oct 20, 2023
DaveCTurner
added a commit
to DaveCTurner/elasticsearch
that referenced
this issue
Oct 21, 2023
Replaces the transport-level timeout with an overall timeout on the whole repository analysis task to ensure that all child tasks terminate promptly. Relates elastic#66992 Closes elastic#101182
DaveCTurner
added a commit
to DaveCTurner/elasticsearch
that referenced
this issue
Oct 21, 2023
Pinging @elastic/es-distributed (Team:Distributed) |
elasticsearchmachine
added
the
Team:Distributed (Obsolete)
Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.
label
Oct 21, 2023
elasticsearchmachine
pushed a commit
that referenced
this issue
Oct 23, 2023
DaveCTurner
added a commit
to DaveCTurner/elasticsearch
that referenced
this issue
Oct 23, 2023
Replaces the transport-level timeout with an overall timeout on the whole repository analysis task to ensure that all child tasks terminate promptly. Relates elastic#66992 Closes elastic#101182
elasticsearchmachine
pushed a commit
that referenced
this issue
Oct 23, 2023
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
>bug
:Distributed Coordination/Snapshot/Restore
Anything directly related to the `_snapshot/*` APIs
Team:Distributed (Obsolete)
Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.
Today the
?timeout=
query parameter to the repository analysis API applies to the regular blob operations, but not to the linearizable register operations. The assumption here was that the register operations simply increment a counter once per node which should take almost no time at all, but in practice we've seen a couple of S3-like repositories with incomplete/incorrect support for the multipart APIs which underpin its linearizable register implementation, giving spurious responses that cause endless retries. Specifically, the S3 list multipart upload API returns "all in-progress uploads" but some repositories claiming to be S3-compatible incorrectly omit recently-started uploads from responses to this API.We should apply the timeout to both kinds of operation so that these repository implementations can fail the analysis at the timeout instead of waiting forever.
Relates #101185 which adds verification for uncontended register operations, which need no retries and therefore will allow to distinguish this incorrect behaviour from other reasons for an analysis timeout.
Workaround
To work around this issue, implement a client-side timeout when requesting a repository analysis, using a timeout value a few seconds longer than the server-side timeout specified with the
?timeout=
query parameter. Treat the expiry of the client-side timeout as indicative of a repository incompatibility which you should work with your storage supplier to address.Test your repository's behaviour with linearizable registers first by setting the query parameters
?blob_count=1&max_blob_size=1b
. If this analysis takes more than a few seconds to complete, it is likely that your repository behaves incorrectly in a manner that will cause Elasticsearch to retry endlessly.The text was updated successfully, but these errors were encountered: