-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stricter failure handling in TransportGetSnapshotsAction
#107191
Stricter failure handling in TransportGetSnapshotsAction
#107191
Conversation
Today if there's a failure during a multi-repo get-snapshots request then we record a per-repository failure but allow the rest of the request to proceed. This is trappy for clients, it means that they must always remember to check the `failures` response field or else risk missing some results. It's also a pain for the implementation because it means we have to collect the per-repository results separately first before adding them to the final results set just in case the last one triggers a failure. This commit drops this leniency and bubbles all failures straight up to the top level.
Pinging @elastic/es-distributed (Team:Distributed) |
Hi @DaveCTurner, I've created a changelog YAML for you. Note that since this PR is labelled |
I think this is a design bug which was an artefact of the original multi-repo implementation in #42090 which we've preserved throughout future changes. AFAICT there's no good reason for being so lenient here. Moreover today's behaviour is an obstacle to improving the performance (#95345) and memory usage (#104607) of this API in clusters with high snapshot counts. To be precise, this changes a few cases from partial failures into total failures:
I'm marking this as |
We (the @elastic/es-distributed coordination subteam) discussed this today and broadly agreed with this direction, pending confirmation from Kibana and Cloud callers that this won't cause them any problems, and then agreement with the breaking changes committee that this change is acceptable. |
Hi @DaveCTurner, I've updated the changelog YAML for you. Note that since this PR is labelled |
This reverts commit 52665dc.
Bad bot 🤖 |
Hi @DaveCTurner, I've updated the changelog YAML for you. Note that since this PR is labelled |
To clarify, today a partial failure (with
Note that we get no results from
Note that this is already the behaviour if the request targets a single repository, it's only the multi-repo case that supports partial failures. |
Thanks a lot for the example, @DaveCTurner! I had a look at the Kibana code where the snapshots list is being retrieved for the UI and it uses the parameter |
Ah that sounds promising thanks @yuliacech. How does Kibana determine which repositories to request? Does it make a list of exact repository names or does it use |
Thanks, that means there would be a very slight change in behaviour here: if a repository is removed from the cluster in between listing the repos and the user selecting the removed repo (plus at least one other repo) then with this PR we'd return a |
I think the UI should be able to handle the |
So does it mean that the Get snapshot request won't have |
That's correct, you'll only see the |
Hi @DaveCTurner, I've updated the changelog YAML for you. |
With elastic#107191 we can now safely accumulate results from all targetted repositories as they're built, rather than staging each repository's results in intermediate lists in case of failure.
With #107191 we can now safely accumulate results from all targetted repositories as they're built, rather than staging each repository's results in intermediate lists in case of failure.
With elastic#107191 we can now safely accumulate results from all targetted repositories as they're built, rather than staging each repository's results in intermediate lists in case of failure.
With elastic#107191 we can now safely accumulate results from all targetted repositories as they're built, rather than staging each repository's results in intermediate lists in case of failure.
Failure handling for snapshots was made stricter in elastic#107191 (8.15), so this field is always empty since then. Clients don't need to check it anymore for failure handling, we can remove it from API responses in 9.0
Today if there's a failure during a multi-repo get-snapshots request
then we record a per-repository failure but allow the rest of the
request to proceed. This is trappy for clients, it means that they must
always remember to check the
failures
response field or else riskmissing some results. It's also a pain for the implementation because it
means we have to collect the per-repository results separately first
before adding them to the final results set just in case the last one
triggers a failure.
This commit drops this leniency and bubbles all failures straight up to
the top level.