Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get repository metadata from the cluster state doesn't throw an exception if a repo is missing #92914

Merged

Conversation

gmarouli
Copy link
Contributor

@gmarouli gmarouli commented Jan 13, 2023

Expected situation

When retrieving snapshots by giving the names of multiple repositories, if a repository is missing we would like to retrieve the snapshots of the repositories that were found and report the other repositories as failures.

The bug

The bug was witnessed during the execution of SLM policies of which one referred to a missing repository. See #92849.

How we fixed it

In this PR we propose to change the function of org.elasticsearch.action.admin.cluster.repositories.get.TransportGetRepositoriesAction#getRepositories to return the found repositories and the missing ones instead of throwing an exception when it encounters a missing repository.

Based on this change we adjusted the logic in the following places:

  • The GetRepositoriesAction now mentions in the exception all the missing repositories. For example:
{
  "error": {
    "root_cause": [
      {
        "type": "repository_missing_exception",
        "reason": "[missing_1, missing_2] missing"
      }
    ],
    "type": "repository_missing_exception",
    "reason": "[missing_1, missing_2] missing"
  },
  "status": 404
}

Instead of only one.

  • The SnapshotRetentionTask will now receive the snapshots from the repositories that can be retrieved and it will log in the debug level the missing repos.

Fixes: #92849

@gmarouli gmarouli changed the title Fix/ get snapshot fails when repo is missing Get repository metadata from the cluster state doesn't throw an exception if a repo is missing Jan 13, 2023
@gmarouli gmarouli requested a review from dakrone January 13, 2023 18:34
@gmarouli gmarouli added >bug :Data Management/ILM+SLM Index and Snapshot lifecycle management v8.6.1 v7.17.9 labels Jan 13, 2023
@elasticsearchmachine
Copy link
Collaborator

Hi @gmarouli, I've created a changelog YAML for you.

@gmarouli gmarouli added the auto-backport Automatically create backport pull requests when merged label Jan 13, 2023
@gmarouli
Copy link
Contributor Author

gmarouli commented Jan 16, 2023

@elasticmachine run elasticsearch-ci/part-1

1 similar comment
@gmarouli
Copy link
Contributor Author

@elasticmachine run elasticsearch-ci/part-1

@gmarouli gmarouli marked this pull request as ready for review January 16, 2023 12:37
@elasticsearchmachine elasticsearchmachine added the Team:Data Management Meta label for data/management team label Jan 16, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@dakrone
Copy link
Member

dakrone commented Jan 18, 2023

Since this is currently mainly touching the snapshots side of things, I think someone on the distributed team should take a look (they own snapshots). I do think though that we should be defensive on the SLM side and elide the missing repos from the fetching action, we can use the same check we use in SnapshotLifecycleService.validateRepositoryExists. What do you think Mary?

@gmarouli
Copy link
Contributor Author

@elasticmachine update branch

@gmarouli
Copy link
Contributor Author

I see your point. For me it's not that clear which one is best. I think it depends how we define TransportGetRepositoriesAction#getRepositories. I see (at least) two options:

A] Redefine it to TransportGetRepositoriesAction#getRepositoriesMetadata, then it would make sense to return the metadata and the repositories it wasn't able to retrieve and let the caller decide how to act with that information. The benefits of this path, I think is that we are able to give to the caller a more complete response. For example, now we can inform the user exactly which repositories do not exist instead of reporting only the first one. As an API user I do appreciate when I get all the info so I can fix my request in one go, that's the benefit of this approach I think.

B] The TransportGetRepositoriesAction#getRepositories expects that all the repositories must exist and if any of them doesn't it throws an exception, just like it works now. Then I completely agree, we can use SnapshotLifecycleService.validateRepositoryExists to filter out the non existing paths and only fetch those. It's a simpler change with a shorter scope.

I think it depends on how we want to define the scope of TransportGetRepositoriesAction#getRepositories. @elastic/es-distributed do you think there is value in implementing A or should do I just revert TransportGetRepositoriesAction#getRepositories and filter the repos in SLM side?

@DaveCTurner
Copy link
Contributor

I think option A (as implemented here) is the right approach. When we extended the get-snapshots action to work across repositories we designed it to report failures on a repository-by-repository basis. "Does not exist" is just another per-repository failure, but it's one that was missed in the implementation because it happens much earlier in the process than the other kinds of failure.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (one nit)

@gmarouli
Copy link
Contributor Author

@elasticmachine update branch

@gmarouli
Copy link
Contributor Author

@dakrone Given that the TransportGetRepositoriesAction#getRepositories is returning a complete response with the repository metadata of the found repos and the missing repos. Do you agree that we can keep SLM as it is? I think it's not unreasonable to expect getRepositories to do the filtering of the existing and non existing repos. It feels a bit of double work to have SLM check that first and the request them.

@dakrone
Copy link
Member

dakrone commented Jan 19, 2023

Thanks for following up on this Mary, I'm happy to leave this on the snapshot side. As something so we don't hit the bug again (perhaps in a different place), could you add an integration test that has two SLM policies, where the repository for one of the policies doesn't exist, and then run retention and ensure that snapshots are deleted from the repository that does exist? I don't mind if we want to add it either here or in a subsequent PR (up to you).

@gmarouli
Copy link
Contributor Author

Theoretically, it should be an easy test to write (famous last words). If that is indeed the case I would prefer to wrap it up here ;). I will give it a try.

Copy link
Member

@dakrone dakrone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I left one really minor comment, thanks for adding the test Mary!

Comment on lines 648 to 657
// Run retention every second
ClusterUpdateSettingsRequest req = new ClusterUpdateSettingsRequest();
req.persistentSettings(Settings.builder().put(LifecycleSettings.SLM_RETENTION_SCHEDULE, "*/1 * * * * ?"));
try (XContentBuilder builder = jsonBuilder()) {
req.toXContent(builder, ToXContent.EMPTY_PARAMS);
Request r = new Request("PUT", "/_cluster/settings");
r.setJsonEntity(Strings.toString(builder));
Response updateSettingsResp = client().performRequest(r);
assertAcked(updateSettingsResp);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than run retention every second, you could manually invoke the retention (we have the execute retention API)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was doubting about this, but in second thought it will be better and more efficient. I will change it :)

@gmarouli gmarouli merged commit 9cd0b38 into elastic:main Jan 19, 2023
@elasticsearchmachine
Copy link
Collaborator

💔 Backport failed

Status Branch Result
8.6 Commit could not be cherrypicked due to conflicts
7.17 Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 92914

gmarouli added a commit to gmarouli/elasticsearch that referenced this pull request Jan 19, 2023
…tion if a repo is missing (elastic#92914)

Change the implementation of `TransportGetRepositoriesAction#getRepositories` to report the found and missing repositories instead of throwing an exception when a repository is missing. When the get-snapshots action was extended to work across repositories, it was designed to report failures on a repository-by-repository basis. A missing repository is just another per-repository failure, so we report it but it doesn't interrupt the normal flow of the method.

(cherry picked from commit 9cd0b38)

# Conflicts:
#	server/src/main/java/org/elasticsearch/action/admin/cluster/snapshots/get/TransportGetSnapshotsAction.java
gmarouli added a commit to gmarouli/elasticsearch that referenced this pull request Jan 19, 2023
…tion if a repo is missing (elastic#92914)

Change the implementation of `TransportGetRepositoriesAction#getRepositories` to report the found and missing repositories instead of throwing an exception when a repository is missing. When the get-snapshots action was extended to work across repositories, it was designed to report failures on a repository-by-repository basis. A missing repository is just another per-repository failure, so we report it but it doesn't interrupt the normal flow of the method.

(cherry picked from commit 9cd0b38)

# Conflicts:
#	server/src/main/java/org/elasticsearch/action/admin/cluster/snapshots/get/TransportGetSnapshotsAction.java
elasticsearchmachine pushed a commit that referenced this pull request Jan 19, 2023
…tion if a repo is missing (#92914) (#93106)

Change the implementation of `TransportGetRepositoriesAction#getRepositories` to report the found and missing repositories instead of throwing an exception when a repository is missing. When the get-snapshots action was extended to work across repositories, it was designed to report failures on a repository-by-repository basis. A missing repository is just another per-repository failure, so we report it but it doesn't interrupt the normal flow of the method.

(cherry picked from commit 9cd0b38)

# Conflicts:
#	server/src/main/java/org/elasticsearch/action/admin/cluster/snapshots/get/TransportGetSnapshotsAction.java
@gmarouli
Copy link
Contributor Author

💚 All backports created successfully

Status Branch Result
8.6
7.17

Questions ?

Please refer to the Backport tool documentation

gmarouli added a commit to gmarouli/elasticsearch that referenced this pull request Jan 19, 2023
…tion if a repo is missing (elastic#92914)

Change the implementation of `TransportGetRepositoriesAction#getRepositories` to report the found and missing repositories instead of throwing an exception when a repository is missing. When the get-snapshots action was extended to work across repositories, it was designed to report failures on a repository-by-repository basis. A missing repository is just another per-repository failure, so we report it but it doesn't interrupt the normal flow of the method.

(cherry picked from commit 9cd0b38)

# Conflicts:
#	server/src/main/java/org/elasticsearch/action/admin/cluster/repositories/get/TransportGetRepositoriesAction.java
#	server/src/main/java/org/elasticsearch/action/admin/cluster/snapshots/get/TransportGetSnapshotsAction.java
#	x-pack/plugin/ilm/src/main/java/org/elasticsearch/xpack/slm/SnapshotRetentionTask.java
gmarouli added a commit that referenced this pull request Jan 20, 2023
…tion if a repo is missing (#92914) (#93108)

Change the implementation of `TransportGetRepositoriesAction#getRepositories` to report the found and missing repositories instead of throwing an exception when a repository is missing. When the get-snapshots action was extended to work across repositories, it was designed to report failures on a repository-by-repository basis. A missing repository is just another per-repository failure, so we report it but it doesn't interrupt the normal flow of the method.

(cherry picked from commit 9cd0b38)

# Conflicts:
#	server/src/main/java/org/elasticsearch/action/admin/cluster/repositories/get/TransportGetRepositoriesAction.java
#	server/src/main/java/org/elasticsearch/action/admin/cluster/snapshots/get/TransportGetSnapshotsAction.java
#	x-pack/plugin/ilm/src/main/java/org/elasticsearch/xpack/slm/SnapshotRetentionTask.java
@gmarouli gmarouli deleted the fix/82849-get-snapshot-fails-when-repo-is-missing branch January 20, 2023 13:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Automatically create backport pull requests when merged >bug :Data Management/ILM+SLM Index and Snapshot lifecycle management Team:Data Management Meta label for data/management team v7.17.9 v8.6.1 v8.7.0
Projects
None yet
5 participants