-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix recovery stage transition with sync_id #57754
Conversation
Pinging @elastic/es-distributed (:Distributed/Recovery) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this assertion valid? If so, can we add it?
diff --git a/server/src/main/java/org/elasticsearch/indices/recovery/RecoverySourceHandler.java b/server/src/main/java/org/elasticsearch/indices/recovery/RecoverySourceHandler.java
index 33139912920..8000bbabc53 100644
--- a/server/src/main/java/org/elasticsearch/indices/recovery/RecoverySourceHandler.java
+++ b/server/src/main/java/org/elasticsearch/indices/recovery/RecoverySourceHandler.java
@@ -542,6 +542,8 @@ public class RecoverySourceHandler {
phase1ExistingFileSizes, existingTotalSize, took));
}, listener::onFailure);
} else {
+ assert shard.indexSettings().getIndexVersionCreated().before(Version.V_7_2_0) ||
+ request.startingSeqNo() != SequenceNumbers.UNASSIGNED_SEQ_NO;
logger.trace("skipping [phase1] since source and target have identical sync id [{}]", recoverySourceMetadata.getSyncId());
// but we must still create a retention lease
LGTM apart from that and one other request.
...er/src/internalClusterTest/java/org/elasticsearch/gateway/ReplicaShardAllocatorSyncIdIT.java
Outdated
Show resolved
Hide resolved
Hi @dnhatn , based on this comment:
That means in recovering shard from pre-7.2 version to 7.2+ version, only if the recovery fails at the first time and retry again, then it cannot move to TRANSLOG stage? Is that possible it cannot move to TRANSLOG stage in a normal recovery process without any fail at the first time? Since in our case, 6.8 add 7.5 nodes to recovery #57708, we didn't see fails except the wrong stage exception. Also why after performing |
By the way, in #57708, both replica and primary are on 7.5.1 node. The flow is:
|
Thanks @howardhuanghua. I am trying to reproduce your situation. |
@dnhatn Thanks, there is a point that in our situation, we could see the blocked index is empty, no docs contained. |
Hi @howardhuanghua, @hubbleview was kind to help me to reproduce the scenario that you provided. It matches what we outline in the PR. When migrating to new nodes, we first synced flush an index, then we exclude the old nodes in the allocation filter. ES will relocate both the primary and replica to the new nodes at the same time. If phase2 of the recovery of the replica starts after the relocation of the primary completes, then it will hit an IllegalIndexShardStateException. This exception is retry-able, hence we log it at the trace level and retry another recovery. At this point, the primary is on the new node, and the replica has an index commit with a sync_id, but that commit is not safe (because we do not have the global checkpoint in the clean_files step). The first retry recovery will fail due to the improper transition, but the subsequent recovery will succeed as we should have the global checkpoint in the finalize_step. Thank you for reporting the issue. |
@DaveCTurner Thanks for reviewing. The assertion is great, but we will trip it if we hit a simulated I/O exception while we are recovering locally. |
Hi @dnhatn, thanks for the explanation, now I understand the issue. Just want to confirm the relocation scenarios,
|
Hi @howardhuanghua,
Yes, that's correct.
A replica always recovers from its primary.
A replica always recovers from its primary, and the recovery tries to reuse the existing data when possible. |
Is that ok? If we couldn't recover the shard locally does it make sense to proceed with the rest of the recovery like that? |
When the local translog is corrupted, we won't be able to recover locally up to the global checkpoint. In this case, we will try to reuse the existing index commit, which can have a sync_id. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, since the whole sync marker thing is going away soon we don't need to dwell on this further so this LGTM.
Thanks David! |
If the recovery source is on an old node (before 7.2), then the recovery target won't have the safe commit after phase1 because the recovery source does not send the global checkpoint in the clean_files step. And if the recovery fails and retries, then the recovery stage won't transition properly. If a sync_id is used in peer recovery, then the clean_files step won't be executed to move the stage to TRANSLOG. Relates ##7187 Closes #57708
If the recovery source is on an old node (before 7.2), then the recovery target won't have the safe commit after phase1 because the recovery source does not send the global checkpoint in the clean_files step. And if the recovery fails and retries, then the recovery stage won't transition properly. If a sync_id is used in peer recovery, then the clean_files step won't be executed to move the stage to TRANSLOG. Relates ##7187 Closes #57708
If the recovery source is on an old node (before 7.2), then the recovery target won't have the safe commit after phase1 because the recovery source does not send the global checkpoint in the clean_files step. And if the recovery fails and retries, then the recovery stage won't transition properly. If a sync_id is used in peer recovery, then the clean_files step won't be executed to move the stage to TRANSLOG.
This issue was addressed in #57187, but not forward-ported to 8.0. I think we should do it as this issue can occur in 8.0. (requires a full cluster restart to 8.0 after a peer recovery on 7.1 fails after it has completed phase 1).
Closes #57708