Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stricter failure handling in TransportGetSnapshotsAction #107191

Conversation

DaveCTurner
Copy link
Contributor

Today if there's a failure during a multi-repo get-snapshots request
then we record a per-repository failure but allow the rest of the
request to proceed. This is trappy for clients, it means that they must
always remember to check the failures response field or else risk
missing some results. It's also a pain for the implementation because it
means we have to collect the per-repository results separately first
before adding them to the final results set just in case the last one
triggers a failure.

This commit drops this leniency and bubbles all failures straight up to
the top level.

Today if there's a failure during a multi-repo get-snapshots request
then we record a per-repository failure but allow the rest of the
request to proceed. This is trappy for clients, it means that they must
always remember to check the `failures` response field or else risk
missing some results. It's also a pain for the implementation because it
means we have to collect the per-repository results separately first
before adding them to the final results set just in case the last one
triggers a failure.

This commit drops this leniency and bubbles all failures straight up to
the top level.
@elasticsearchmachine elasticsearchmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Apr 8, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@elasticsearchmachine
Copy link
Collaborator

Hi @DaveCTurner, I've created a changelog YAML for you. Note that since this PR is labelled >breaking, you need to update the changelog YAML to fill out the extended information sections.

@DaveCTurner
Copy link
Contributor Author

DaveCTurner commented Apr 8, 2024

I think this is a design bug which was an artefact of the original multi-repo implementation in #42090 which we've preserved throughout future changes. AFAICT there's no good reason for being so lenient here. Moreover today's behaviour is an obstacle to improving the performance (#95345) and memory usage (#104607) of this API in clusters with high snapshot counts.

To be precise, this changes a few cases from partial failures into total failures:

  1. the user requests snapshots in some collection of repositories which includes a specific named repository that is not registered with the cluster. Previously we'd list all the other repositories and mark the specific named repository with a RepositoryMissingException. Now the whole request fails with a RepositoryMissingException. This makes sense to me, the user should use the get-repositories API to determine what repositories are registered with the cluster rather than using this API.

  2. one of the repositories we're listing is so broken that we cannot read its RepositoryData. Again, previously we'd list all the other repositories and skip the broken one. Now the whole request fails. Again, this makes sense to me, the user can exclude the broken repository in the request if desired.

  3. one of the target snapshots has an unreadable SnapshotInfo blob. Previously if ?ignore_unavailable=false we'd skip the whole repository and return incomplete results, whereas now we fail the whole request with a SnapshotMissingException. If ?ignore_unavailable=true we skip the missing snapshot as before.

I'm marking this as >breaking since it's changing the behaviour of certain failure cases, even though I think we should go ahead with it. Also marking it as team-discuss so we remember to talk about it. If the team is ok with the idea then I'll formally propose it as a breaking change.

@DaveCTurner
Copy link
Contributor Author

We (the @elastic/es-distributed coordination subteam) discussed this today and broadly agreed with this direction, pending confirmation from Kibana and Cloud callers that this won't cause them any problems, and then agreement with the breaking changes committee that this change is acceptable.

@elasticsearchmachine
Copy link
Collaborator

Hi @DaveCTurner, I've updated the changelog YAML for you. Note that since this PR is labelled >breaking, you need to update the changelog YAML to fill out the extended information sections.

@DaveCTurner
Copy link
Contributor Author

Hi @DaveCTurner, I've updated the changelog YAML for you. Note that since this PR is labelled >breaking, you need to update the changelog YAML to fill out the extended information sections.

Bad bot 🤖

@elasticsearchmachine
Copy link
Collaborator

Hi @DaveCTurner, I've updated the changelog YAML for you. Note that since this PR is labelled >breaking, you need to update the changelog YAML to fill out the extended information sections.

@DaveCTurner
Copy link
Contributor Author

DaveCTurner commented Apr 22, 2024

To clarify, today a partial failure (with ?ignore_unavailable=false) when retrieving snapshots from multiple repositories looks like this:

200 OK
{
  "snapshots" : [
    {
      "snapshot" : "cfmqmfrheg",
      "uuid" : "Wb7b0_zVRH207wPtuMOy-w",
      "repository" : "repo0",
      "version_id" : 8505000,
      "version" : "8505000",
      "indices" : [
        "test-idx"
      ],
      "data_streams" : [ ],
      "include_global_state" : true,
      "state" : "SUCCESS",
      "start_time" : "2024-04-22T09:44:36.205Z",
      "start_time_in_millis" : 1713779076205,
      "end_time" : "2024-04-22T09:44:36.256Z",
      "end_time_in_millis" : 1713779076256,
      "duration" : "51ms",
      "duration_in_millis" : 51,
      "failures" : [ ],
      "shards" : {
        "total" : 4,
        "failed" : 0,
        "successful" : 4
      },
      "feature_states" : [ ]
    },
    {
      "snapshot" : "lhbqjriekp",
      "uuid" : "R5_AuiaHSEi4J0X2RvTBSw",
      "repository" : "repo0",
      "version_id" : 8505000,
      "version" : "8505000",
      "indices" : [
        "test-idx"
      ],
      "data_streams" : [ ],
      "include_global_state" : true,
      "state" : "SUCCESS",
      "start_time" : "2024-04-22T09:44:36.346Z",
      "start_time_in_millis" : 1713779076346,
      "end_time" : "2024-04-22T09:44:36.370Z",
      "end_time_in_millis" : 1713779076370,
      "duration" : "24ms",
      "duration_in_millis" : 24,
      "failures" : [ ],
      "shards" : {
        "total" : 4,
        "failed" : 0,
        "successful" : 4
      },
      "feature_states" : [ ]
    },
    {
      "snapshot" : "kqxbwkttzr",
      "uuid" : "0xa8mF3tQDaBe77-56jvqg",
      "repository" : "repo0",
      "version_id" : 8505000,
      "version" : "8505000",
      "indices" : [
        "test-idx"
      ],
      "data_streams" : [ ],
      "include_global_state" : true,
      "state" : "SUCCESS",
      "start_time" : "2024-04-22T09:44:36.462Z",
      "start_time_in_millis" : 1713779076462,
      "end_time" : "2024-04-22T09:44:36.495Z",
      "end_time_in_millis" : 1713779076495,
      "duration" : "33ms",
      "duration_in_millis" : 33,
      "failures" : [ ],
      "shards" : {
        "total" : 4,
        "failed" : 0,
        "successful" : 4
      },
      "feature_states" : [ ]
    },
    {
      "snapshot" : "jfljnrqutw",
      "uuid" : "G7zg1e6JQDKRTib6wO4erQ",
      "repository" : "repo1",
      "version_id" : 8505000,
      "version" : "8505000",
      "indices" : [
        "test-idx"
      ],
      "data_streams" : [ ],
      "include_global_state" : true,
      "state" : "SUCCESS",
      "start_time" : "2024-04-22T09:44:36.674Z",
      "start_time_in_millis" : 1713779076674,
      "end_time" : "2024-04-22T09:44:36.707Z",
      "end_time_in_millis" : 1713779076707,
      "duration" : "33ms",
      "duration_in_millis" : 33,
      "failures" : [ ],
      "shards" : {
        "total" : 4,
        "failed" : 0,
        "successful" : 4
      },
      "feature_states" : [ ]
    },
    {
      "snapshot" : "dnmxscesew",
      "uuid" : "_ZIcQBRaSMinWxa0pX8llA",
      "repository" : "repo1",
      "version_id" : 8505000,
      "version" : "8505000",
      "indices" : [
        "test-idx"
      ],
      "data_streams" : [ ],
      "include_global_state" : true,
      "state" : "SUCCESS",
      "start_time" : "2024-04-22T09:44:36.791Z",
      "start_time_in_millis" : 1713779076791,
      "end_time" : "2024-04-22T09:44:36.807Z",
      "end_time_in_millis" : 1713779076807,
      "duration" : "16ms",
      "duration_in_millis" : 16,
      "failures" : [ ],
      "shards" : {
        "total" : 4,
        "failed" : 0,
        "successful" : 4
      },
      "feature_states" : [ ]
    }
  ],
  "failures" : {
    "repo3" : {
      "type" : "snapshot_missing_exception",
      "reason" : "[repo3:jfljnrqutw] is missing"
    },
    "repo4" : {
      "type" : "snapshot_missing_exception",
      "reason" : "[repo4:lhbqjriekp] is missing"
    }
  },
  "total" : 5,
  "remaining" : 0
}

Note that we get no results from repo3 or repo4 in this case. With this change, those snapshot_missing_exception exceptions will be pulled up to the top level:

404 Not Found
{
  "error" : {
    "root_cause" : [
      {
        "type" : "snapshot_missing_exception",
        "reason" : "[repo3:jfljnrqutw] is missing"
      }
    ],
    "type" : "snapshot_missing_exception",
    "reason" : "[repo3:jfljnrqutw] is missing",
    "caused_by" : {
      "type" : "some_inner_exception",
      "reason" : "File [foo/bar/baz.dat] not found"
    }
  },
  "status" : 404
}

Note that this is already the behaviour if the request targets a single repository, it's only the multi-repo case that supports partial failures.

@yuliacech
Copy link
Contributor

Thanks a lot for the example, @DaveCTurner! I had a look at the Kibana code where the snapshots list is being retrieved for the UI and it uses the parameter ignore_unavailable: true, does this mean the response won't fail if some repos are missing?

@DaveCTurner
Copy link
Contributor Author

Ah that sounds promising thanks @yuliacech. How does Kibana determine which repositories to request? Does it make a list of exact repository names or does it use _all or *?

@yuliacech
Copy link
Contributor

For the UI we first get the list of existing repos with Get repos request with _all and then the user can select some repositories and for those we get the snapshots (see the filter button in the screenshot below). If the user doesn't select a specific repo, we use _all for the repos name on the Get snapshots request.
Screenshot 2024-04-26 at 12 26 18

@DaveCTurner
Copy link
Contributor Author

Thanks, that means there would be a very slight change in behaviour here: if a repository is removed from the cluster in between listing the repos and the user selecting the removed repo (plus at least one other repo) then with this PR we'd return a RepositoryMissingException whereas previously they'd get a list of all the snapshots in the other repositories. IMO that's preferable, it directs the user back to select a different collection of repositories, but we could also go back to something more like the older behaviour without losing most of the benefits of this PR. WDYT?

@yuliacech
Copy link
Contributor

I think the UI should be able to handle the RepositoryMissingException by default and this is is probably very unlikely to happen, since the list of repos is re-fetched from ES on every page load. So I don't think we need to keep the old behaviour in this case.

@yuliacech
Copy link
Contributor

So does it mean that the Get snapshot request won't have failures property at all anymore, or are there still other cases where this property is used?

@DaveCTurner
Copy link
Contributor Author

That's correct, you'll only see the failures field in a mixed-cluster situation, and we'll drop it entirely in v9.

@elasticsearchmachine
Copy link
Collaborator

Hi @DaveCTurner, I've updated the changelog YAML for you.

@DaveCTurner DaveCTurner merged commit c5e8173 into elastic:main Jul 3, 2024
15 checks passed
@DaveCTurner DaveCTurner deleted the 2024/04/08/TransportGetSnapshotsAction-no-per-repo-failures branch July 3, 2024 13:14
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Jul 18, 2024
With elastic#107191 we can now safely accumulate results from all targetted
repositories as they're built, rather than staging each repository's
results in intermediate lists in case of failure.
DaveCTurner added a commit that referenced this pull request Jul 18, 2024
With #107191 we can now safely accumulate results from all targetted
repositories as they're built, rather than staging each repository's
results in intermediate lists in case of failure.
ioanatia pushed a commit to ioanatia/elasticsearch that referenced this pull request Jul 22, 2024
With elastic#107191 we can now safely accumulate results from all targetted
repositories as they're built, rather than staging each repository's
results in intermediate lists in case of failure.
salvatore-campagna pushed a commit to salvatore-campagna/elasticsearch that referenced this pull request Jul 23, 2024
With elastic#107191 we can now safely accumulate results from all targetted
repositories as they're built, rather than staging each repository's
results in intermediate lists in case of failure.
arteam added a commit to arteam/elasticsearch that referenced this pull request Oct 10, 2024
Failure handling for snapshots was made stricter in elastic#107191 (8.15),
so this field is always empty since then. Clients don't need to check it
anymore for failure handling, we can remove it from API responses in 9.0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs release highlight Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. v8.15.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants