-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix SnapshotShardStatus Reporting for Failed Shard #48556
Conversation
Fixes the shard snapshot status reporting for failed shards in the corner case of failing the shard because of an exception thrown in `SnapshotShardsService` and not the repository. We were missing the update on the `snapshotStatus` instance in this case which made the transport APIs using this field report back an incorrect status. Fixed by moving the failure handling to the `SnapshotShardsService` for all cases (which also simplifies the code, the ex. wrapping in the repository was pointless as we only used the ex. trace upstream anyway). Also, added an assertion to another test that explicitly checks this failure situation (ex. in the `SnapshotShardsService`) already. Closes elastic#48526
Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a few comments.
@@ -298,8 +298,10 @@ public void onResponse(String newGeneration) { | |||
|
|||
@Override | |||
public void onFailure(Exception e) { | |||
final String failure = ExceptionsHelper.stackTrace(e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ew. Can we follow up with a change that keeps the exception as an exception rather than converting it to a String
here and in a few other places? Looks nontrivial because BWC, of course.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Jup, I'm happy to try :)
@@ -1219,6 +1219,12 @@ public void testDataNodeRestartWithBusyMasterDuringSnapshot() throws Exception { | |||
disruption.startDisrupting(); | |||
logger.info("--> restarting data node, which should cause primary shards to be failed"); | |||
internalCluster().restartNode(dataNode, InternalTestCluster.EMPTY_CALLBACK); | |||
|
|||
logger.info("--> wait for shard snapshots to show as failed"); | |||
assertBusy(() -> assertThat( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before this change we would sometimes unblock the node and stop the disruption before the first shard failure. I think this change makes the test weaker. I'm guessing it's invalid to do this after disruption.stopDisrupting()
? If so, can we for instance only do it sometimes (with a comment saying why we don't always do it)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before this change we would sometimes unblock the node and stop the disruption before the first shard failure.
I'd argue that's a good thing :) <= The whole point of this test was to test this situation (failure on the data node before CS updates resume). The case where we stop disrupting before anything fails is probably practically impossible and even if it wasn't something that's covered in SnapshotResiliencyTests
(where want want that kind of randomness because we can reproduce things) anyway.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not convinced yet. Practically impossible is not impossible enough for me :)
Do you think that the failure in #48526 is also captured, rarely, by the SnapshotResiliencyTests
? Can we make a more focussed test there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The situation we're running into is perfectly covered by SnapshotResiliencyTests
but those tests don't use the status API (which is the only thing that's functionally broken here) so they don't fail.
We could maybe add a test that involves the status APIs to SnapshotResiliencyTests
to cover these things there as well if we want more randomness. I think that would be better than purposely making this test run into an, let's say almost impossible, situation at some point and not be reproducible anyway?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, yes, SnapshotResiliencyTests
sounds like a better place for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd still rather adjust this test to have some reproducible testing for the concrete bug here and then enhance the SnapshotResiliencyTests
down the line. That's basically what has been the strategy for other tests as well: make the ITs deterministic to reproduce known issues and and keep the randomness in SnapshotResiliencyTests
.
Everything covered by this test is covered by SnapshotResiliencyTests
anyway and this test was just added for 6.x
coverage, so I don't see us losing any coverage here by making this one more deterministic? :)
@@ -1042,10 +1041,6 @@ public void snapshotShard(Store store, MapperService mapperService, SnapshotId s | |||
ActionListener<String> listener) { | |||
final ShardId shardId = store.shardId(); | |||
final long startTime = threadPool.absoluteTimeInMillis(); | |||
final ActionListener<String> snapshotDoneListener = ActionListener.wrap(listener::onResponse, e -> { | |||
snapshotStatus.moveToFailed(threadPool.absoluteTimeInMillis(), ExceptionsHelper.stackTrace(e)); | |||
listener.onFailure(e instanceof IndexShardSnapshotFailedException ? e : new IndexShardSnapshotFailedException(shardId, e)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One fewer instanceof
in the world ❤️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks Yannick and David! I'll open a follow-up with resiliency tests for the status APIs as discussed above :) |
Fixes the shard snapshot status reporting for failed shards in the corner case of failing the shard because of an exception thrown in `SnapshotShardsService` and not the repository. We were missing the update on the `snapshotStatus` instance in this case which made the transport APIs using this field report back an incorrect status. Fixed by moving the failure handling to the `SnapshotShardsService` for all cases (which also simplifies the code, the ex. wrapping in the repository was pointless as we only used the ex. trace upstream anyway). Also, added an assertion to another test that explicitly checks this failure situation (ex. in the `SnapshotShardsService`) already. Closes elastic#48526
Fixes the shard snapshot status reporting for failed shards in the corner case of failing the shard because of an exception thrown in `SnapshotShardsService` and not the repository. We were missing the update on the `snapshotStatus` instance in this case which made the transport APIs using this field report back an incorrect status. Fixed by moving the failure handling to the `SnapshotShardsService` for all cases (which also simplifies the code, the ex. wrapping in the repository was pointless as we only used the ex. trace upstream anyway). Also, added an assertion to another test that explicitly checks this failure situation (ex. in the `SnapshotShardsService`) already. Closes #48526
Fixes the shard snapshot status reporting for failed shards in the corner case of failing the shard because of an exception thrown in `SnapshotShardsService` and not the repository. We were missing the update on the `snapshotStatus` instance in this case which made the transport APIs using this field report back an incorrect status. Fixed by moving the failure handling to the `SnapshotShardsService` for all cases (which also simplifies the code, the ex. wrapping in the repository was pointless as we only used the ex. trace upstream anyway). Also, added an assertion to another test that explicitly checks this failure situation (ex. in the `SnapshotShardsService`) already. Closes elastic#48526
Fixes the shard snapshot status reporting for failed shards in the corner case of failing the shard because of an exception thrown in `SnapshotShardsService` and not the repository. We were missing the update on the `snapshotStatus` instance in this case which made the transport APIs using this field report back an incorrect status. Fixed by moving the failure handling to the `SnapshotShardsService` for all cases (which also simplifies the code, the ex. wrapping in the repository was pointless as we only used the ex. trace upstream anyway). Also, added an assertion to another test that explicitly checks this failure situation (ex. in the `SnapshotShardsService`) already. Closes #48526
Fixes the shard snapshot status reporting for failed shards
in the corner case of failing the shard because of an exception
thrown in
SnapshotShardsService
and not the repository.We were missing the update on the
snapshotStatus
instance inthis case which made the transport APIs using this field report
back an incorrect status.
Fixed by moving the failure handling to the
SnapshotShardsService
for all cases (which also simplifies the code, the ex. wrapping in
the repository was pointless as we only used the ex. trace upstream
anyway).
Also, added an assertion to another test that explicitly checks this
failure situation (ex. in the
SnapshotShardsService
) already.Closes #48526