Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Repository analysis timeout should apply to register operations #101182

Closed
DaveCTurner opened this issue Oct 20, 2023 · 1 comment · Fixed by #101184
Closed

Repository analysis timeout should apply to register operations #101182

DaveCTurner opened this issue Oct 20, 2023 · 1 comment · Fixed by #101184
Labels
>bug :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.

Comments

@DaveCTurner
Copy link
Contributor

DaveCTurner commented Oct 20, 2023

Today the ?timeout= query parameter to the repository analysis API applies to the regular blob operations, but not to the linearizable register operations. The assumption here was that the register operations simply increment a counter once per node which should take almost no time at all, but in practice we've seen a couple of S3-like repositories with incomplete/incorrect support for the multipart APIs which underpin its linearizable register implementation, giving spurious responses that cause endless retries. Specifically, the S3 list multipart upload API returns "all in-progress uploads" but some repositories claiming to be S3-compatible incorrectly omit recently-started uploads from responses to this API.

We should apply the timeout to both kinds of operation so that these repository implementations can fail the analysis at the timeout instead of waiting forever.

Relates #101185 which adds verification for uncontended register operations, which need no retries and therefore will allow to distinguish this incorrect behaviour from other reasons for an analysis timeout.


Workaround

To work around this issue, implement a client-side timeout when requesting a repository analysis, using a timeout value a few seconds longer than the server-side timeout specified with the ?timeout= query parameter. Treat the expiry of the client-side timeout as indicative of a repository incompatibility which you should work with your storage supplier to address.

Test your repository's behaviour with linearizable registers first by setting the query parameters ?blob_count=1&max_blob_size=1b. If this analysis takes more than a few seconds to complete, it is likely that your repository behaves incorrectly in a manner that will cause Elasticsearch to retry endlessly.

@DaveCTurner DaveCTurner added >bug :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs labels Oct 20, 2023
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Oct 21, 2023
Replaces the transport-level timeout with an overall timeout on the
whole repository analysis task to ensure that all child tasks terminate
promptly.

Relates elastic#66992
Closes elastic#101182
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Oct 21, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@elasticsearchmachine elasticsearchmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Oct 21, 2023
elasticsearchmachine pushed a commit that referenced this issue Oct 23, 2023
Replaces the transport-level timeout with an overall timeout on the
whole repository analysis task to ensure that all child tasks terminate
promptly.

Relates #66992 Closes #101182
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Oct 23, 2023
Replaces the transport-level timeout with an overall timeout on the
whole repository analysis task to ensure that all child tasks terminate
promptly.

Relates elastic#66992
Closes elastic#101182
elasticsearchmachine pushed a commit that referenced this issue Oct 23, 2023
Replaces the transport-level timeout with an overall timeout on the
whole repository analysis task to ensure that all child tasks terminate
promptly.

Relates #66992
Closes #101182
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants