Stricter failure handling in `TransportGetSnapshotsAction` #107191

DaveCTurner · 2024-04-08T06:58:05Z

Today if there's a failure during a multi-repo get-snapshots request
then we record a per-repository failure but allow the rest of the
request to proceed. This is trappy for clients, it means that they must
always remember to check the failures response field or else risk
missing some results. It's also a pain for the implementation because it
means we have to collect the per-repository results separately first
before adding them to the final results set just in case the last one
triggers a failure.

This commit drops this leniency and bubbles all failures straight up to
the top level.

Today if there's a failure during a multi-repo get-snapshots request then we record a per-repository failure but allow the rest of the request to proceed. This is trappy for clients, it means that they must always remember to check the `failures` response field or else risk missing some results. It's also a pain for the implementation because it means we have to collect the per-repository results separately first before adding them to the final results set just in case the last one triggers a failure. This commit drops this leniency and bubbles all failures straight up to the top level.

elasticsearchmachine · 2024-04-08T06:58:29Z

Pinging @elastic/es-distributed (Team:Distributed)

elasticsearchmachine · 2024-04-08T06:58:29Z

Hi @DaveCTurner, I've created a changelog YAML for you. Note that since this PR is labelled >breaking, you need to update the changelog YAML to fill out the extended information sections.

DaveCTurner · 2024-04-08T07:23:27Z

I think this is a design bug which was an artefact of the original multi-repo implementation in #42090 which we've preserved throughout future changes. AFAICT there's no good reason for being so lenient here. Moreover today's behaviour is an obstacle to improving the performance (#95345) and memory usage (#104607) of this API in clusters with high snapshot counts.

To be precise, this changes a few cases from partial failures into total failures:

the user requests snapshots in some collection of repositories which includes a specific named repository that is not registered with the cluster. Previously we'd list all the other repositories and mark the specific named repository with a RepositoryMissingException. Now the whole request fails with a RepositoryMissingException. This makes sense to me, the user should use the get-repositories API to determine what repositories are registered with the cluster rather than using this API.
one of the repositories we're listing is so broken that we cannot read its RepositoryData. Again, previously we'd list all the other repositories and skip the broken one. Now the whole request fails. Again, this makes sense to me, the user can exclude the broken repository in the request if desired.
one of the target snapshots has an unreadable SnapshotInfo blob. Previously if ?ignore_unavailable=false we'd skip the whole repository and return incomplete results, whereas now we fail the whole request with a SnapshotMissingException. If ?ignore_unavailable=true we skip the missing snapshot as before.

I'm marking this as >breaking since it's changing the behaviour of certain failure cases, even though I think we should go ahead with it. Also marking it as team-discuss so we remember to talk about it. If the team is ok with the idea then I'll formally propose it as a breaking change.

DaveCTurner · 2024-04-09T13:59:13Z

We (the @elastic/es-distributed coordination subteam) discussed this today and broadly agreed with this direction, pending confirmation from Kibana and Cloud callers that this won't cause them any problems, and then agreement with the breaking changes committee that this change is acceptable.

elasticsearchmachine · 2024-04-09T13:59:41Z

Hi @DaveCTurner, I've updated the changelog YAML for you. Note that since this PR is labelled >breaking, you need to update the changelog YAML to fill out the extended information sections.

…r-repo-failures

This reverts commit 52665dc.

DaveCTurner · 2024-04-09T14:09:17Z

Hi @DaveCTurner, I've updated the changelog YAML for you. Note that since this PR is labelled >breaking, you need to update the changelog YAML to fill out the extended information sections.

Bad bot 🤖

elasticsearchmachine · 2024-04-17T17:11:41Z

Hi @DaveCTurner, I've updated the changelog YAML for you. Note that since this PR is labelled >breaking, you need to update the changelog YAML to fill out the extended information sections.

DaveCTurner · 2024-04-22T09:50:09Z

To clarify, today a partial failure (with ?ignore_unavailable=false) when retrieving snapshots from multiple repositories looks like this:

200 OK
{
  "snapshots" : [
    {
      "snapshot" : "cfmqmfrheg",
      "uuid" : "Wb7b0_zVRH207wPtuMOy-w",
      "repository" : "repo0",
      "version_id" : 8505000,
      "version" : "8505000",
      "indices" : [
        "test-idx"
      ],
      "data_streams" : [ ],
      "include_global_state" : true,
      "state" : "SUCCESS",
      "start_time" : "2024-04-22T09:44:36.205Z",
      "start_time_in_millis" : 1713779076205,
      "end_time" : "2024-04-22T09:44:36.256Z",
      "end_time_in_millis" : 1713779076256,
      "duration" : "51ms",
      "duration_in_millis" : 51,
      "failures" : [ ],
      "shards" : {
        "total" : 4,
        "failed" : 0,
        "successful" : 4
      },
      "feature_states" : [ ]
    },
    {
      "snapshot" : "lhbqjriekp",
      "uuid" : "R5_AuiaHSEi4J0X2RvTBSw",
      "repository" : "repo0",
      "version_id" : 8505000,
      "version" : "8505000",
      "indices" : [
        "test-idx"
      ],
      "data_streams" : [ ],
      "include_global_state" : true,
      "state" : "SUCCESS",
      "start_time" : "2024-04-22T09:44:36.346Z",
      "start_time_in_millis" : 1713779076346,
      "end_time" : "2024-04-22T09:44:36.370Z",
      "end_time_in_millis" : 1713779076370,
      "duration" : "24ms",
      "duration_in_millis" : 24,
      "failures" : [ ],
      "shards" : {
        "total" : 4,
        "failed" : 0,
        "successful" : 4
      },
      "feature_states" : [ ]
    },
    {
      "snapshot" : "kqxbwkttzr",
      "uuid" : "0xa8mF3tQDaBe77-56jvqg",
      "repository" : "repo0",
      "version_id" : 8505000,
      "version" : "8505000",
      "indices" : [
        "test-idx"
      ],
      "data_streams" : [ ],
      "include_global_state" : true,
      "state" : "SUCCESS",
      "start_time" : "2024-04-22T09:44:36.462Z",
      "start_time_in_millis" : 1713779076462,
      "end_time" : "2024-04-22T09:44:36.495Z",
      "end_time_in_millis" : 1713779076495,
      "duration" : "33ms",
      "duration_in_millis" : 33,
      "failures" : [ ],
      "shards" : {
        "total" : 4,
        "failed" : 0,
        "successful" : 4
      },
      "feature_states" : [ ]
    },
    {
      "snapshot" : "jfljnrqutw",
      "uuid" : "G7zg1e6JQDKRTib6wO4erQ",
      "repository" : "repo1",
      "version_id" : 8505000,
      "version" : "8505000",
      "indices" : [
        "test-idx"
      ],
      "data_streams" : [ ],
      "include_global_state" : true,
      "state" : "SUCCESS",
      "start_time" : "2024-04-22T09:44:36.674Z",
      "start_time_in_millis" : 1713779076674,
      "end_time" : "2024-04-22T09:44:36.707Z",
      "end_time_in_millis" : 1713779076707,
      "duration" : "33ms",
      "duration_in_millis" : 33,
      "failures" : [ ],
      "shards" : {
        "total" : 4,
        "failed" : 0,
        "successful" : 4
      },
      "feature_states" : [ ]
    },
    {
      "snapshot" : "dnmxscesew",
      "uuid" : "_ZIcQBRaSMinWxa0pX8llA",
      "repository" : "repo1",
      "version_id" : 8505000,
      "version" : "8505000",
      "indices" : [
        "test-idx"
      ],
      "data_streams" : [ ],
      "include_global_state" : true,
      "state" : "SUCCESS",
      "start_time" : "2024-04-22T09:44:36.791Z",
      "start_time_in_millis" : 1713779076791,
      "end_time" : "2024-04-22T09:44:36.807Z",
      "end_time_in_millis" : 1713779076807,
      "duration" : "16ms",
      "duration_in_millis" : 16,
      "failures" : [ ],
      "shards" : {
        "total" : 4,
        "failed" : 0,
        "successful" : 4
      },
      "feature_states" : [ ]
    }
  ],
  "failures" : {
    "repo3" : {
      "type" : "snapshot_missing_exception",
      "reason" : "[repo3:jfljnrqutw] is missing"
    },
    "repo4" : {
      "type" : "snapshot_missing_exception",
      "reason" : "[repo4:lhbqjriekp] is missing"
    }
  },
  "total" : 5,
  "remaining" : 0
}

Note that we get no results from repo3 or repo4 in this case. With this change, those snapshot_missing_exception exceptions will be pulled up to the top level:

404 Not Found
{
  "error" : {
    "root_cause" : [
      {
        "type" : "snapshot_missing_exception",
        "reason" : "[repo3:jfljnrqutw] is missing"
      }
    ],
    "type" : "snapshot_missing_exception",
    "reason" : "[repo3:jfljnrqutw] is missing",
    "caused_by" : {
      "type" : "some_inner_exception",
      "reason" : "File [foo/bar/baz.dat] not found"
    }
  },
  "status" : 404
}

Note that this is already the behaviour if the request targets a single repository, it's only the multi-repo case that supports partial failures.

…r-repo-failures

yuliacech · 2024-04-26T10:00:31Z

Thanks a lot for the example, @DaveCTurner! I had a look at the Kibana code where the snapshots list is being retrieved for the UI and it uses the parameter ignore_unavailable: true, does this mean the response won't fail if some repos are missing?

DaveCTurner · 2024-04-26T10:15:07Z

Ah that sounds promising thanks @yuliacech. How does Kibana determine which repositories to request? Does it make a list of exact repository names or does it use _all or *?

yuliacech · 2024-04-26T10:29:50Z

For the UI we first get the list of existing repos with Get repos request with _all and then the user can select some repositories and for those we get the snapshots (see the filter button in the screenshot below). If the user doesn't select a specific repo, we use _all for the repos name on the Get snapshots request.

DaveCTurner · 2024-04-26T10:37:11Z

Thanks, that means there would be a very slight change in behaviour here: if a repository is removed from the cluster in between listing the repos and the user selecting the removed repo (plus at least one other repo) then with this PR we'd return a RepositoryMissingException whereas previously they'd get a list of all the snapshots in the other repositories. IMO that's preferable, it directs the user back to select a different collection of repositories, but we could also go back to something more like the older behaviour without losing most of the benefits of this PR. WDYT?

yuliacech · 2024-04-26T12:00:01Z

I think the UI should be able to handle the RepositoryMissingException by default and this is is probably very unlikely to happen, since the list of repos is re-fetched from ES on every page load. So I don't think we need to keep the old behaviour in this case.

yuliacech · 2024-04-26T12:00:10Z

So does it mean that the Get snapshot request won't have failures property at all anymore, or are there still other cases where this property is used?

…r-repo-failures

DaveCTurner · 2024-04-26T13:14:25Z

That's correct, you'll only see the failures field in a mixed-cluster situation, and we'll drop it entirely in v9.

…r-repo-failures

elasticsearchmachine · 2024-07-03T08:48:10Z

Hi @DaveCTurner, I've updated the changelog YAML for you.

…r-repo-failures

With elastic#107191 we can now safely accumulate results from all targetted repositories as they're built, rather than staging each repository's results in intermediate lists in case of failure.

With #107191 we can now safely accumulate results from all targetted repositories as they're built, rather than staging each repository's results in intermediate lists in case of failure.

With elastic#107191 we can now safely accumulate results from all targetted repositories as they're built, rather than staging each repository's results in intermediate lists in case of failure.

Failure handling for snapshots was made stricter in elastic#107191 (8.15), so this field is always empty since then. Clients don't need to check it anymore for failure handling, we can remove it from API responses in 9.0

DaveCTurner added >bug >breaking :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs team-discuss v8.14.0 labels Apr 8, 2024

elasticsearchmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Apr 8, 2024

Update docs/changelog/107191.yaml

ed8f3bc

DaveCTurner added 3 commits April 8, 2024 08:26

Always fail on missing repo

e8481e8

Fix changelog

2af9554

Improve test

d5ae8b5

DaveCTurner removed the team-discuss label Apr 9, 2024

Update docs/changelog/107191.yaml

52665dc

DaveCTurner added 2 commits April 9, 2024 15:04

Merge branch 'main' into 2024/04/08/TransportGetSnapshotsAction-no-pe…

d5219ca

…r-repo-failures

Revert "Update docs/changelog/107191.yaml"

6e282b4

This reverts commit 52665dc.

elasticsearchmachine added v8.15.0 and removed v8.14.0 labels Apr 17, 2024

Update docs/changelog/107191.yaml

46b076b

DaveCTurner added 2 commits April 22, 2024 10:59

Merge branch 'main' into 2024/04/08/TransportGetSnapshotsAction-no-pe…

ee01e67

…r-repo-failures

Fix changelog

19a88dc

Merge branch 'main' into 2024/04/08/TransportGetSnapshotsAction-no-pe…

93cb137

…r-repo-failures

DaveCTurner added 2 commits July 1, 2024 17:57

Merge branch 'main' into 2024/04/08/TransportGetSnapshotsAction-no-pe…

6a414aa

…r-repo-failures

Improve changelog

794134d

DaveCTurner removed the >breaking label Jul 3, 2024

Update docs/changelog/107191.yaml

4778b45

DaveCTurner added 2 commits July 3, 2024 09:48

Merge branch 'main' into 2024/04/08/TransportGetSnapshotsAction-no-pe…

40fb85c

…r-repo-failures

Release highlight, not breaking

752274d

DaveCTurner added the release highlight label Jul 3, 2024

DaveCTurner requested review from ywangd and pxsalehi July 3, 2024 08:52

pxsalehi approved these changes Jul 3, 2024

View reviewed changes

DaveCTurner merged commit c5e8173 into elastic:main Jul 3, 2024
15 checks passed

DaveCTurner deleted the 2024/04/08/TransportGetSnapshotsAction-no-per-repo-failures branch July 3, 2024 13:14

DaveCTurner mentioned this pull request Jul 18, 2024

Combine per-repo results in get-snapshots action #111004

Merged

Philippus mentioned this pull request Aug 17, 2024

Update elasticsearch docker image to 8.15.0 Philippus/elastic4s#3132

Merged

arteam mentioned this pull request Oct 10, 2024

Remove the failures field from snapshot responses #114496

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stricter failure handling in `TransportGetSnapshotsAction` #107191

Stricter failure handling in `TransportGetSnapshotsAction` #107191

DaveCTurner commented Apr 8, 2024

elasticsearchmachine commented Apr 8, 2024

elasticsearchmachine commented Apr 8, 2024

DaveCTurner commented Apr 8, 2024 •

edited

Loading

DaveCTurner commented Apr 9, 2024

elasticsearchmachine commented Apr 9, 2024

DaveCTurner commented Apr 9, 2024

elasticsearchmachine commented Apr 17, 2024

DaveCTurner commented Apr 22, 2024 •

edited

Loading

yuliacech commented Apr 26, 2024

DaveCTurner commented Apr 26, 2024

yuliacech commented Apr 26, 2024

DaveCTurner commented Apr 26, 2024

yuliacech commented Apr 26, 2024

yuliacech commented Apr 26, 2024

DaveCTurner commented Apr 26, 2024

elasticsearchmachine commented Jul 3, 2024

Stricter failure handling in TransportGetSnapshotsAction #107191

Stricter failure handling in TransportGetSnapshotsAction #107191

Conversation

DaveCTurner commented Apr 8, 2024

elasticsearchmachine commented Apr 8, 2024

elasticsearchmachine commented Apr 8, 2024

DaveCTurner commented Apr 8, 2024 • edited Loading

DaveCTurner commented Apr 9, 2024

elasticsearchmachine commented Apr 9, 2024

DaveCTurner commented Apr 9, 2024

elasticsearchmachine commented Apr 17, 2024

DaveCTurner commented Apr 22, 2024 • edited Loading

yuliacech commented Apr 26, 2024

DaveCTurner commented Apr 26, 2024

yuliacech commented Apr 26, 2024

DaveCTurner commented Apr 26, 2024

yuliacech commented Apr 26, 2024

yuliacech commented Apr 26, 2024

DaveCTurner commented Apr 26, 2024

elasticsearchmachine commented Jul 3, 2024

Stricter failure handling in `TransportGetSnapshotsAction` #107191

Stricter failure handling in `TransportGetSnapshotsAction` #107191

DaveCTurner commented Apr 8, 2024 •

edited

Loading

DaveCTurner commented Apr 22, 2024 •

edited

Loading