Fix SnapshotShardStatus Reporting for Failed Shard #48556

original-brownbear · 2019-10-27T14:48:19Z

Fixes the shard snapshot status reporting for failed shards
in the corner case of failing the shard because of an exception
thrown in SnapshotShardsService and not the repository.
We were missing the update on the snapshotStatus instance in
this case which made the transport APIs using this field report
back an incorrect status.
Fixed by moving the failure handling to the SnapshotShardsService
for all cases (which also simplifies the code, the ex. wrapping in
the repository was pointless as we only used the ex. trace upstream
anyway).
Also, added an assertion to another test that explicitly checks this
failure situation (ex. in the SnapshotShardsService) already.

Closes #48526

Fixes the shard snapshot status reporting for failed shards in the corner case of failing the shard because of an exception thrown in `SnapshotShardsService` and not the repository. We were missing the update on the `snapshotStatus` instance in this case which made the transport APIs using this field report back an incorrect status. Fixed by moving the failure handling to the `SnapshotShardsService` for all cases (which also simplifies the code, the ex. wrapping in the repository was pointless as we only used the ex. trace upstream anyway). Also, added an assertion to another test that explicitly checks this failure situation (ex. in the `SnapshotShardsService`) already. Closes elastic#48526

elasticmachine · 2019-10-27T14:48:21Z

Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore)

DaveCTurner

I left a few comments.

DaveCTurner · 2019-10-27T20:22:25Z

server/src/main/java/org/elasticsearch/snapshots/SnapshotShardsService.java

@@ -298,8 +298,10 @@ public void onResponse(String newGeneration) {

                    @Override
                    public void onFailure(Exception e) {
+                        final String failure = ExceptionsHelper.stackTrace(e);


Ew. Can we follow up with a change that keeps the exception as an exception rather than converting it to a String here and in a few other places? Looks nontrivial because BWC, of course.

Jup, I'm happy to try :)

DaveCTurner · 2019-10-27T20:26:27Z

server/src/test/java/org/elasticsearch/snapshots/DedicatedClusterSnapshotRestoreIT.java

@@ -1219,6 +1219,12 @@ public void testDataNodeRestartWithBusyMasterDuringSnapshot() throws Exception {
        disruption.startDisrupting();
        logger.info("-->  restarting data node, which should cause primary shards to be failed");
        internalCluster().restartNode(dataNode, InternalTestCluster.EMPTY_CALLBACK);
+
+        logger.info("-->  wait for shard snapshots to show as failed");
+        assertBusy(() -> assertThat(


Before this change we would sometimes unblock the node and stop the disruption before the first shard failure. I think this change makes the test weaker. I'm guessing it's invalid to do this after disruption.stopDisrupting()? If so, can we for instance only do it sometimes (with a comment saying why we don't always do it)?

Before this change we would sometimes unblock the node and stop the disruption before the first shard failure.

I'd argue that's a good thing :) <= The whole point of this test was to test this situation (failure on the data node before CS updates resume). The case where we stop disrupting before anything fails is probably practically impossible and even if it wasn't something that's covered in SnapshotResiliencyTests (where want want that kind of randomness because we can reproduce things) anyway.
WDYT?

I'm not convinced yet. Practically impossible is not impossible enough for me :)

Do you think that the failure in #48526 is also captured, rarely, by the SnapshotResiliencyTests? Can we make a more focussed test there?

The situation we're running into is perfectly covered by SnapshotResiliencyTests but those tests don't use the status API (which is the only thing that's functionally broken here) so they don't fail.
We could maybe add a test that involves the status APIs to SnapshotResiliencyTests to cover these things there as well if we want more randomness. I think that would be better than purposely making this test run into an, let's say almost impossible, situation at some point and not be reproducible anyway?

I see, yes, SnapshotResiliencyTests sounds like a better place for this.

I'd still rather adjust this test to have some reproducible testing for the concrete bug here and then enhance the SnapshotResiliencyTests down the line. That's basically what has been the strategy for other tests as well: make the ITs deterministic to reproduce known issues and and keep the randomness in SnapshotResiliencyTests.
Everything covered by this test is covered by SnapshotResiliencyTests anyway and this test was just added for 6.x coverage, so I don't see us losing any coverage here by making this one more deterministic? :)

DaveCTurner · 2019-10-27T20:27:11Z

server/src/main/java/org/elasticsearch/repositories/blobstore/BlobStoreRepository.java

@@ -1042,10 +1041,6 @@ public void snapshotShard(Store store, MapperService mapperService, SnapshotId s
                              ActionListener<String> listener) {
        final ShardId shardId = store.shardId();
        final long startTime = threadPool.absoluteTimeInMillis();
-        final ActionListener<String> snapshotDoneListener = ActionListener.wrap(listener::onResponse, e -> {
-            snapshotStatus.moveToFailed(threadPool.absoluteTimeInMillis(), ExceptionsHelper.stackTrace(e));
-            listener.onFailure(e instanceof IndexShardSnapshotFailedException ? e : new IndexShardSnapshotFailedException(shardId, e));


One fewer instanceof in the world ❤️

ywelsch

LGTM

original-brownbear · 2019-10-29T12:30:09Z

Thanks Yannick and David! I'll open a follow-up with resiliency tests for the status APIs as discussed above :)

Fixes the shard snapshot status reporting for failed shards in the corner case of failing the shard because of an exception thrown in `SnapshotShardsService` and not the repository. We were missing the update on the `snapshotStatus` instance in this case which made the transport APIs using this field report back an incorrect status. Fixed by moving the failure handling to the `SnapshotShardsService` for all cases (which also simplifies the code, the ex. wrapping in the repository was pointless as we only used the ex. trace upstream anyway). Also, added an assertion to another test that explicitly checks this failure situation (ex. in the `SnapshotShardsService`) already. Closes elastic#48526

Fixes the shard snapshot status reporting for failed shards in the corner case of failing the shard because of an exception thrown in `SnapshotShardsService` and not the repository. We were missing the update on the `snapshotStatus` instance in this case which made the transport APIs using this field report back an incorrect status. Fixed by moving the failure handling to the `SnapshotShardsService` for all cases (which also simplifies the code, the ex. wrapping in the repository was pointless as we only used the ex. trace upstream anyway). Also, added an assertion to another test that explicitly checks this failure situation (ex. in the `SnapshotShardsService`) already. Closes #48526

Fixes the shard snapshot status reporting for failed shards in the corner case of failing the shard because of an exception thrown in `SnapshotShardsService` and not the repository. We were missing the update on the `snapshotStatus` instance in this case which made the transport APIs using this field report back an incorrect status. Fixed by moving the failure handling to the `SnapshotShardsService` for all cases (which also simplifies the code, the ex. wrapping in the repository was pointless as we only used the ex. trace upstream anyway). Also, added an assertion to another test that explicitly checks this failure situation (ex. in the `SnapshotShardsService`) already. Closes elastic#48526

Fixes the shard snapshot status reporting for failed shards in the corner case of failing the shard because of an exception thrown in `SnapshotShardsService` and not the repository. We were missing the update on the `snapshotStatus` instance in this case which made the transport APIs using this field report back an incorrect status. Fixed by moving the failure handling to the `SnapshotShardsService` for all cases (which also simplifies the code, the ex. wrapping in the repository was pointless as we only used the ex. trace upstream anyway). Also, added an assertion to another test that explicitly checks this failure situation (ex. in the `SnapshotShardsService`) already. Closes #48526

original-brownbear added >bug :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v8.0.0 v7.6.0 labels Oct 27, 2019

original-brownbear mentioned this pull request Oct 27, 2019

testDataNodeRestartAfterShardSnapshotFailure fails by leaking a shard snapshot on a failed node #48526

Closed

original-brownbear added v7.4.2 v7.5.0 labels Oct 27, 2019

original-brownbear requested review from ywelsch, tlrx and DaveCTurner October 27, 2019 18:49

DaveCTurner reviewed Oct 27, 2019

View reviewed changes

original-brownbear requested a review from DaveCTurner October 28, 2019 06:35

ywelsch approved these changes Oct 29, 2019

View reviewed changes

original-brownbear merged commit 752fa87 into elastic:master Oct 29, 2019

original-brownbear deleted the 48526 branch October 29, 2019 12:30

original-brownbear added the backport pending label Oct 29, 2019

original-brownbear mentioned this pull request Oct 30, 2019

Fix SnapshotShardStatus Reporting for Failed Shard (#48556) #48687

Merged

original-brownbear mentioned this pull request Oct 30, 2019

Fix SnapshotShardStatus Reporting for Failed Shard (#48556) #48689

Merged

original-brownbear removed backport pending v7.4.2 labels Oct 30, 2019

This was referenced Feb 3, 2020

[meta] 7.6 release elastic/elasticsearch-net#4340

Closed

[meta] 7.6 release elastic/elasticsearch-net#4341

Closed

original-brownbear restored the 48526 branch August 6, 2020 18:25

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix SnapshotShardStatus Reporting for Failed Shard #48556

Fix SnapshotShardStatus Reporting for Failed Shard #48556

original-brownbear commented Oct 27, 2019

elasticmachine commented Oct 27, 2019

DaveCTurner left a comment

DaveCTurner Oct 27, 2019

original-brownbear Oct 28, 2019

DaveCTurner Oct 27, 2019 •

edited

Loading

original-brownbear Oct 28, 2019

DaveCTurner Oct 28, 2019

original-brownbear Oct 28, 2019

DaveCTurner Oct 28, 2019

original-brownbear Oct 28, 2019

DaveCTurner Oct 27, 2019

ywelsch left a comment

original-brownbear commented Oct 29, 2019

Fix SnapshotShardStatus Reporting for Failed Shard #48556

Fix SnapshotShardStatus Reporting for Failed Shard #48556

Conversation

original-brownbear commented Oct 27, 2019

elasticmachine commented Oct 27, 2019

DaveCTurner left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DaveCTurner Oct 27, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ywelsch left a comment

Choose a reason for hiding this comment

original-brownbear commented Oct 29, 2019

DaveCTurner Oct 27, 2019 •

edited

Loading