-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Snapshot/Restore: Ensure that shard failure reasons are correctly stored in CS #25941
Snapshot/Restore: Ensure that shard failure reasons are correctly stored in CS #25941
Conversation
…red in CS The failure reason for snapshot shard failures might not be propagated properly if the master node changes after the errors were reported by other data nodes. This commits ensures that the snapshot shard failure reason is preserved properly and adds workaround for reading old snapshot files where this information might not have been preserved. Closes elastic#25878
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left some minor suggestions and a question around JSON deserialization. Thanks @imotov
} else { | ||
String nodeId = in.readOptionalString(); | ||
State shardState = State.fromValue(in.readByte()); | ||
String reason = shardState.failed() ? "" : null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a comment here saying why we set reason
to ""
?
// Workaround for https://github.com/elastic/elasticsearch/issues/25878 | ||
// Some old snapshot might still have null in shard failure reasons | ||
if (snapshotShardFailure.reason == null) { | ||
snapshotShardFailure.reason = ""; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I don't quite understand: Why will it happily parse the reason
field if it is null? Currently we parse it using text()
, shouldn't that fail and should we use textOrNull()
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, it should be textOrNull()
.
@@ -1128,7 +1128,7 @@ public ClusterState execute(ClusterState currentState) throws Exception { | |||
for (ObjectObjectCursor<ShardId, ShardSnapshotStatus> shardEntry : snapshotEntry.shards()) { | |||
ShardSnapshotStatus status = shardEntry.value; | |||
if (!status.state().completed()) { | |||
shardsBuilder.put(shardEntry.key, new ShardSnapshotStatus(status.nodeId(), State.ABORTED)); | |||
shardsBuilder.put(shardEntry.key, new ShardSnapshotStatus(status.nodeId(), State.ABORTED, "aborted")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe extend the message to "aborted by snapshot deletion"
@@ -135,6 +135,13 @@ public static String blockMasterFromFinalizingSnapshot(final String repositoryNa | |||
return masterName; | |||
} | |||
|
|||
public static String blockMasterFromCreatingSnapshot(final String repositoryName) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method name made me think that it would prevent the master from creating a snapshot at all. Maybe we can call it something along the lines of "blockMasterFromFinalizingSnapshot"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@ywelsch thanks a lot for review and help! I will let it "cook" in CI on the master branch over the weekend and then I will backport it to 5.6 and update the version in writeTo and readFrom if everything goes well. |
…red in CS (#25941) The failure reason for snapshot shard failures might not be propagated properly if the master node changes after the errors were reported by other data nodes. This commits ensures that the snapshot shard failure reason is preserved properly and adds workaround for reading old snapshot files where this information might not have been preserved. Closes #25878
…red in CS (#25941) The failure reason for snapshot shard failures might not be propagated properly if the master node changes after the errors were reported by other data nodes. This commits ensures that the snapshot shard failure reason is preserved properly and adds workaround for reading old snapshot files where this information might not have been preserved. Closes #25878
…red in CS (#26127) The failure reasons for snapshot shard failures might not be propagated properly if the master node changes after errors were reported by other data nodes, which causes them to be stored as null in snapshot files. This commits adds a workaround for reading such snapshot files where this information might not have been preserved and makes sure that the reason is not null if it gets cluster state from another master. This is a partial backport of #25941 to 5.6. Closes #25878
The failure reason for snapshot shard failures might not be propagated properly if the master node changes after the errors were reported by other data nodes. This commits ensures that the snapshot shard failure reason is preserved properly and adds workaround for reading old snapshot files where this information might not have been preserved.
Closes #25878