Async Snapshot Repository Deletes #40144

original-brownbear · 2019-03-18T07:50:43Z

Motivated by slow snapshot deletes reported in e.g. #39656 and the fact that these likely are a contributing factor to repositories accumulating stale files over time when deletes fail to finish in time and are interrupted before they can complete.

Makes snapshot deletion async and parallelizes some steps of the delete process that can be safely run concurrently via the snapshot thread poll
- I did not take the biggest potential speedup step here and parallelize the shard file deletion because that's probably better handled by moving to bulk deletes where possible (and can still be parallelized via the snapshot pool where it isn't). Also, I wanted to keep the size of the PR manageable.
See Making snapshot deletes distributed across data nodes. #39656 (comment)
Also, as a side effect this gives the SnapshotResiliencyTests a little more coverage for master failover scenarios (since parallel access to a blob store repository during deletes is now possible since a delete isn't a single task anymore).
By adding a ThreadPool reference to the repository this also lays the groundwork to parallelizing shard snapshot uploads to improve the situation reported in Introducing a new snapshot segments threadpool to uploads segments of shards in parallel #39657

elasticmachine · 2019-03-18T07:50:44Z

Pinging @elastic/es-distributed

ywelsch

In case where a snapshot deletion is running, it takes over the snapshot threadpool on the master, blocking "create snapshot" requests, which are also dispatched in TransportCreateSnapshotAction to the snapshot threadpool. I think we should have these requests come in on the generic threadpool and then go to the snapshot threadpool after actually checking on the cluster state update thread whether there is any ongoing deletes.

Also, while looking at SnapshotsService.deleteSnapshot I've noticed that we have callers of that method that just swallow the returned exception instead of calling the listener, grep e.g. for deleteSnapshot(snapshot.getRepository(), snapshot.getSnapshotId().getName(), listener, true);. Also worth fixing in a separate PR.

server/src/main/java/org/elasticsearch/plugins/RepositoryPlugin.java

server/src/main/java/org/elasticsearch/repositories/blobstore/BlobStoreRepository.java

ywelsch · 2019-03-22T10:33:36Z

server/src/main/java/org/elasticsearch/repositories/blobstore/BlobStoreRepository.java

+            }
+            final AtomicInteger outstandingIndices = new AtomicInteger(indices.size());
+            for (String index : indices) {
+                threadPool.executor(ThreadPool.Names.SNAPSHOT).execute(() -> {


ActionRunnable

We can further parallelize this to the shard level I think (can be a follow-up)

@ywelsch yea I wanted to wait on #40250 for that, because we may want to push the specifics of that down to the blob container and handle the parallelization dependent on whether or not bulk deletes are supported.

server/src/main/java/org/elasticsearch/repositories/blobstore/BlobStoreRepository.java

ywelsch · 2019-03-22T10:44:06Z

server/src/main/java/org/elasticsearch/repositories/blobstore/BlobStoreRepository.java

                        "but failed to clean up its index folder due to the directory not being empty.", metadata.name(), indexId), dnee);
-                } catch (IOException ioe) {
+                } catch (Exception e) {


revert this change?

This was a conscious choice, I'd actually rather log all exceptions here directly + it simplifies things since we don't need to use the ActionRunnable here if we catch everything. I adjusted the logic a little though and invoked the grouped listener in a finally to make things a little clearer (and avoid the potential odd block from a future assertion trippin somewhere).

server/src/main/java/org/elasticsearch/repositories/blobstore/BlobStoreRepository.java

…eletes

original-brownbear · 2019-03-22T17:56:51Z

@ywelsch all points addressed, left ~2 questions though.

andrershov

Seems that we need to switch TransportRestoreSnapshotAction to generic thread pool as well and then, once cluster state is checked, switch it to "snapshot" thread pool. @ywelsh do you agree?

server/src/main/java/org/elasticsearch/repositories/blobstore/BlobStoreRepository.java

andrershov · 2019-04-01T15:26:04Z

server/src/main/java/org/elasticsearch/repositories/blobstore/BlobStoreRepository.java

+                listener.onFailure(new RepositoryException(metadata.name(), "failed to delete snapshot [" + snapshotId + "]", ex));
+                return;
+            }
+            deleteSnapshotBlobs(snapshot, snapshotId, repositoryData, updatedRepositoryData, listener);


As far as I can see, there are three steps: deleteSnapshotBlobs, deleteIndices, deleteUnreferencedIndices. It would be nice to see this function calls in this method. Something like this:

deleteSnapshotBlobs(..., ActionListener.wrap(() -> deleteIndices( ActionListener.wrap(() -> deleteUnreferencedIndices(..., listener)) )));

I'm not sure I undestand your comment about "used in multiple places"

original-brownbear · 2019-04-03T16:37:45Z

@andrershov comments addressed :)

original-brownbear · 2019-04-04T05:11:01Z

Jenkins run elasticsearch-ci/bwc

andrershov

LGTM

original-brownbear · 2019-04-04T10:18:41Z

@andrershov thanks!
@ywelsch are you good with this one too? :)

ywelsch

I've left some smaller questions, looking good otherwise.

server/src/main/java/org/elasticsearch/repositories/blobstore/BlobStoreRepository.java

ywelsch · 2019-04-04T11:42:51Z

...a/org/elasticsearch/action/admin/cluster/snapshots/create/TransportCreateSnapshotAction.java

@@ -49,7 +49,7 @@ public TransportCreateSnapshotAction(TransportService transportService, ClusterS

    @Override
    protected String executor() {
-        return ThreadPool.Names.SNAPSHOT;
+        return ThreadPool.Names.GENERIC;


can you add a comment as to why we use generic instead of snapshot here? (same for TransportRestoreSnapshotAction). Otherwise, someone in the future might just set this back to snapshot.

ywelsch · 2019-04-04T11:47:45Z

server/src/main/java/org/elasticsearch/repositories/blobstore/BlobStoreRepository.java

@@ -161,6 +166,10 @@

    protected final RepositoryMetaData metadata;

+    protected final NamedXContentRegistry namedXContentRegistry;


this field is not used anywhere AFAICS

my bad ... already killed it in a previous PR but it snuck back in here from a merge mistake :)

ywelsch · 2019-04-04T11:52:39Z

server/src/main/java/org/elasticsearch/repositories/blobstore/BlobStoreRepository.java

+                blobContainer().deleteBlobsIgnoringIfNotExists(
+                    Arrays.asList(snapshotFormat.blobName(snapshotId.getUUID()), globalMetaDataFormat.blobName(snapshotId.getUUID())));
+            } catch (IOException e) {
+                logger.warn(() -> new ParameterizedMessage("[{}] Unable to delete global metadata files", snapshotId), e);


something we should not change in this PR, but I wonder if we should not be lenient here when it comes to IOException. What we're effectively doing here is creating the possibility to have a top-level entry that's referencing stuff that is deleted :(

Sort of, we have https://github.com/elastic/elasticsearch/pull/40144/files/36b2e3888b18745edf4bb4d6e4af4c5b3165f174#diff-83e9ae765eb2a80bbbedd251b686cc10R440 writing the latest global meta-data before this happens, so we don't have a chain of references from the root to these blobs anymore regardless. But I agree, this should be retried (but as I liked in my update, that's incoming shortly too :))

ywelsch · 2019-04-04T12:03:22Z

server/src/main/java/org/elasticsearch/repositories/blobstore/BlobStoreRepository.java

-
-                    deleteIndexMetaDataBlobIgnoringErrors(snapshot, indexId);
-
+                    deleteIndexMetaDataBlobIgnoringErrors(snapshotId, indexId);


again, unrelated to the PR, but we could think in the future of moving this to a later phase, as deleting this here makes the rest unretriable.

Yea, that's incoming in master...original-brownbear:delete-lock-via-cs shortly :)

ywelsch · 2019-04-04T12:16:13Z

server/src/main/java/org/elasticsearch/repositories/blobstore/BlobStoreRepository.java

+                snapshotId,
+                ActionListener.wrap(v -> {
+                    try {
+                        blobStore().blobContainer(basePath().add("indices")).deleteBlobsIgnoringIfNotExists(


do we have tests that check that the indices folders are cleaned up for FSRepository? Might be good to add some if we don't, so that we don't inadvertently break this.

We have a few that check that the path is gone, but I'll also open a PR that has more comprehensive tests for the state of the repository from https://github.com/elastic/elasticsearch/compare/master...original-brownbear:resilient-deletes-test?expand=1 (50%ish done) very shortly :)

ywelsch · 2019-04-04T12:20:31Z

server/src/main/java/org/elasticsearch/repositories/blobstore/BlobStoreRepository.java

+                    .map(info -> info.indices().stream().map(repositoryData::resolveIndexId).collect(Collectors.toList()))
+                    .orElse(Collections.emptyList()),
+                snapshotId,
+                ActionListener.wrap(v -> {


is this not equivalent to the simpler ActionListener.map(listener, v -> { ... })?

…eletes

original-brownbear · 2019-04-04T13:41:07Z

@ywelsch all points addressed now I think :)

ywelsch

LGTM

original-brownbear · 2019-04-05T04:56:19Z

Thanks @andrershov and @ywelsch !

* elastic/master: (36 commits) Remove unneded cluster config from test (elastic#40856) Make Fuzziness reject illegal values earlier (elastic#33511) Remove test-only customisation from TransReplAct (elastic#40863) Fix dense/sparse vector limit documentation (elastic#40852) Make -try xlint warning disabled by default. (elastic#40833) Async Snapshot Repository Deletes (elastic#40144) Revert "Replace usages RandomizedTestingTask with built-in Gradle Test (elastic#40564)" Init global checkpoint after copy commit in peer recovery (elastic#40823) Replace usages RandomizedTestingTask with built-in Gradle Test (elastic#40564) [DOCS] Removed redundant (not quite right) information about upgrades. Remove string usages of old transport settings (elastic#40818) Rollup ignores time_zone on date histogram (elastic#40844) HLRC: fix uri encode bug when url path starts with '/' (elastic#34436) Mutes GatewayIndexStateIT.testRecoverBrokenIndexMetadata Docs: Pin two IDs in the rest client (elastic#40785) Adds version 6.7.2 [DOCS] Remind users to include @ symbol when applying license file (elastic#40688) HLRC: Convert xpack methods to client side objects (elastic#40705) Allow ILM to stop if indices have nonexistent policies (elastic#40820) Add an OpenID Connect authentication realm (elastic#40674) ...

Motivated by slow snapshot deletes reported in e.g. elastic#39656 and the fact that these likely are a contributing factor to repositories accumulating stale files over time when deletes fail to finish in time and are interrupted before they can complete. * Makes snapshot deletion async and parallelizes some steps of the delete process that can be safely run concurrently via the snapshot thread poll * I did not take the biggest potential speedup step here and parallelize the shard file deletion because that's probably better handled by moving to bulk deletes where possible (and can still be parallelized via the snapshot pool where it isn't). Also, I wanted to keep the size of the PR manageable. * See elastic#39656 (comment) * Also, as a side effect this gives the `SnapshotResiliencyTests` a little more coverage for master failover scenarios (since parallel access to a blob store repository during deletes is now possible since a delete isn't a single task anymore). * By adding a `ThreadPool` reference to the repository this also lays the groundwork to parallelizing shard snapshot uploads to improve the situation reported in elastic#39657

Motivated by slow snapshot deletes reported in e.g. #39656 and the fact that these likely are a contributing factor to repositories accumulating stale files over time when deletes fail to finish in time and are interrupted before they can complete. * Makes snapshot deletion async and parallelizes some steps of the delete process that can be safely run concurrently via the snapshot thread poll * I did not take the biggest potential speedup step here and parallelize the shard file deletion because that's probably better handled by moving to bulk deletes where possible (and can still be parallelized via the snapshot pool where it isn't). Also, I wanted to keep the size of the PR manageable. * See #39656 (comment) * Also, as a side effect this gives the `SnapshotResiliencyTests` a little more coverage for master failover scenarios (since parallel access to a blob store repository during deletes is now possible since a delete isn't a single task anymore). * By adding a `ThreadPool` reference to the repository this also lays the groundwork to parallelizing shard snapshot uploads to improve the situation reported in #39657

Motivated by slow snapshot deletes reported in e.g. elastic#39656 and the fact that these likely are a contributing factor to repositories accumulating stale files over time when deletes fail to finish in time and are interrupted before they can complete. * Makes snapshot deletion async and parallelizes some steps of the delete process that can be safely run concurrently via the snapshot thread poll * I did not take the biggest potential speedup step here and parallelize the shard file deletion because that's probably better handled by moving to bulk deletes where possible (and can still be parallelized via the snapshot pool where it isn't). Also, I wanted to keep the size of the PR manageable. * See elastic#39656 (comment) * Also, as a side effect this gives the `SnapshotResiliencyTests` a little more coverage for master failover scenarios (since parallel access to a blob store repository during deletes is now possible since a delete isn't a single task anymore). * By adding a `ThreadPool` reference to the repository this also lays the groundwork to parallelizing shard snapshot uploads to improve the situation reported in elastic#39657

Backport of elastic/elasticsearch#40144 and related elastic/elasticsearch#36140

original-brownbear added 8 commits March 17, 2019 19:32

add threadpool to blobstore repository

878d731

async delete api

18cd785

still fails

05ffbad

tests pass

8f5081e

tests pass

a1b3a8a

Merge remote-tracking branch 'elastic/master' into async-shard-writes

cdcb7e8

shorter

305b500

add javadoc

d27e6c6

original-brownbear added >enhancement :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v8.0.0 labels Mar 18, 2019

original-brownbear added WIP and removed WIP labels Mar 18, 2019

original-brownbear marked this pull request as ready for review March 18, 2019 09:52

original-brownbear requested a review from ywelsch March 18, 2019 09:54

ywelsch suggested changes Mar 22, 2019

View reviewed changes

original-brownbear added 4 commits March 22, 2019 12:08

CR: Comments

dced6df

CR comments

4f8d93d

Merge remote-tracking branch 'elastic/master' into async-repository-d…

e55226d

…eletes

add threadpool to internal repo call

ae6fb85

original-brownbear requested a review from ywelsch March 22, 2019 17:55

Merge remote-tracking branch 'elastic/master' into async-repository-d…

26635c5

…eletes

original-brownbear requested a review from andrershov March 28, 2019 19:17

andrershov suggested changes Apr 1, 2019

View reviewed changes

original-brownbear requested a review from andrershov April 3, 2019 16:37

andrershov approved these changes Apr 4, 2019

View reviewed changes

ywelsch suggested changes Apr 4, 2019

View reviewed changes

original-brownbear added 2 commits April 4, 2019 15:20

Merge remote-tracking branch 'elastic/master' into async-repository-d…

a156b70

…eletes

CR comments

59bdc97

original-brownbear requested a review from ywelsch April 4, 2019 13:37

CR comments

c311311

ywelsch approved these changes Apr 4, 2019

View reviewed changes

original-brownbear merged commit 8a07522 into elastic:master Apr 5, 2019

original-brownbear deleted the async-repository-deletes branch April 5, 2019 04:56

original-brownbear added the backport pending label Apr 5, 2019

original-brownbear added v7.2.0 and removed backport pending labels Apr 25, 2019

original-brownbear mentioned this pull request Apr 26, 2019

Async Snapshot Repository Deletes (#40144) #41571

Merged

codebrain mentioned this pull request Aug 5, 2019

[meta] 7.2 Release elastic/elasticsearch-net#3980

Closed

37 tasks

mkleen mentioned this pull request May 28, 2020

Async Snapshot Repository Deletes crate/crate#10017

Merged

5 tasks

mkleen added a commit to crate/crate that referenced this pull request May 28, 2020

Async Snapshot Repository Deletes

667108a

Backport of elastic/elasticsearch#40144 and related elastic/elasticsearch#36140

mkleen added a commit to crate/crate that referenced this pull request May 28, 2020

Async Snapshot Repository Deletes

91551c0

Backport of elastic/elasticsearch#40144 and related elastic/elasticsearch#36140

mergify bot pushed a commit to crate/crate that referenced this pull request May 29, 2020

Async Snapshot Repository Deletes

b810e9d

Backport of elastic/elasticsearch#40144 and related elastic/elasticsearch#36140

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Async Snapshot Repository Deletes #40144

Async Snapshot Repository Deletes #40144

original-brownbear commented Mar 18, 2019 •

edited

Loading

elasticmachine commented Mar 18, 2019

ywelsch left a comment

ywelsch Mar 22, 2019

ywelsch Mar 22, 2019

original-brownbear Mar 22, 2019

ywelsch Mar 22, 2019

original-brownbear Mar 22, 2019

original-brownbear commented Mar 22, 2019

andrershov left a comment

andrershov Apr 1, 2019

original-brownbear commented Apr 3, 2019

original-brownbear commented Apr 4, 2019

andrershov left a comment

original-brownbear commented Apr 4, 2019

ywelsch left a comment

ywelsch Apr 4, 2019

ywelsch Apr 4, 2019

original-brownbear Apr 4, 2019

ywelsch Apr 4, 2019

original-brownbear Apr 4, 2019

ywelsch Apr 4, 2019

original-brownbear Apr 4, 2019

ywelsch Apr 4, 2019

original-brownbear Apr 4, 2019 •

edited

Loading

ywelsch Apr 4, 2019

original-brownbear commented Apr 4, 2019

ywelsch left a comment

original-brownbear commented Apr 5, 2019

		@@ -161,6 +166,10 @@

		protected final RepositoryMetaData metadata;

		protected final NamedXContentRegistry namedXContentRegistry;


		deleteIndexMetaDataBlobIgnoringErrors(snapshot, indexId);

		deleteIndexMetaDataBlobIgnoringErrors(snapshotId, indexId);

Async Snapshot Repository Deletes #40144

Async Snapshot Repository Deletes #40144

Conversation

original-brownbear commented Mar 18, 2019 • edited Loading

elasticmachine commented Mar 18, 2019

ywelsch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

original-brownbear commented Mar 22, 2019

andrershov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

original-brownbear commented Apr 3, 2019

original-brownbear commented Apr 4, 2019

andrershov left a comment

Choose a reason for hiding this comment

original-brownbear commented Apr 4, 2019

ywelsch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

original-brownbear Apr 4, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

original-brownbear commented Apr 4, 2019

ywelsch left a comment

Choose a reason for hiding this comment

original-brownbear commented Apr 5, 2019

original-brownbear commented Mar 18, 2019 •

edited

Loading

original-brownbear Apr 4, 2019 •

edited

Loading