-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recover peers using history from Lucene #44853
Recover peers using history from Lucene #44853
Conversation
Thanks to peer recovery retention leases we now retain the history needed to perform peer recoveries from the index instead of from the translog. This commit adjusts the peer recovery process to do so, and also adjusts it to use the existence of a retention lease to decide whether or not to attempt an operations-based recovery. Reverts elastic#38904 and elastic#42211 Relates elastic#41536
Pinging @elastic/es-distributed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Marked as WIP as there's a few small points to discuss still (highlighted below).
server/src/main/java/org/elasticsearch/indices/recovery/RecoverySourceHandler.java
Outdated
Show resolved
Hide resolved
assertThat(shard.refreshStats().getTotal(), equalTo(2L)); // refresh on: finalize and end of recovery | ||
// refresh on: finalize and end of recovery | ||
// finalizing a replica involves two refreshes with soft deletes because of estimateNumberOfHistoryOperations() | ||
final long initialRefreshes = shard.routingEntry().primary() || shard.indexSettings().isSoftDeleteEnabled() == false ? 2L : 3L; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this is expected or not. Seems a bit awkward. Maybe there's a better solution?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's okay to initialize initialRefreshes
from shard.refreshStats().getTotal()
and remove this assertion.
long externalRefreshCount = shard.refreshStats().getExternalTotal(); | ||
|
||
final long externalRefreshCount = shard.refreshStats().getExternalTotal(); | ||
final long extraInternalRefreshes = shard.routingEntry().primary() || shard.indexSettings().isSoftDeleteEnabled() == false ? 0 : 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this is expected or not. Seems a bit awkward. Maybe there's a better solution?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did an initial pass and left some comments. This looks good.
@@ -255,6 +291,11 @@ public void recoverToTarget(ActionListener<RecoveryResponse> listener) { | |||
}, onFailure); | |||
|
|||
establishRetentionLeaseStep.whenComplete(r -> { | |||
if (useRetentionLeases) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just always close once at the end as here's a noop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not a no-op if we are doing a file-based recovery. In that case we have a proper retention lock at this point and should close it now so that we can start discarding history during phase 2. I clarified the condition to useRetentionLeases && isSequenceNumberBasedRecovery == false
in 190649d.
assertThat(shard.refreshStats().getTotal(), equalTo(2L)); // refresh on: finalize and end of recovery | ||
// refresh on: finalize and end of recovery | ||
// finalizing a replica involves two refreshes with soft deletes because of estimateNumberOfHistoryOperations() | ||
final long initialRefreshes = shard.routingEntry().primary() || shard.indexSettings().isSoftDeleteEnabled() == false ? 2L : 3L; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's okay to initialize initialRefreshes
from shard.refreshStats().getTotal()
and remove this assertion.
server/src/main/java/org/elasticsearch/indices/recovery/RecoverySourceHandler.java
Outdated
Show resolved
Hide resolved
// temporarily prevent any history from being discarded, and do this before acquiring the safe commit so that we can | ||
// be certain that all operations after the safe commit's local checkpoint will be retained for the duration of this | ||
// recovery. | ||
retentionLock = shard.acquireRetentionLock(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to acquire the retention lock before calling hasCompleteHistoryOperations
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oof good catch, thanks. Addressed in 190649d.
final boolean isSequenceNumberBasedRecovery | ||
= request.startingSeqNo() != SequenceNumbers.UNASSIGNED_SEQ_NO | ||
&& isTargetSameHistory() | ||
&& shard.hasCompleteHistoryOperations("peer-recovery", request.startingSeqNo()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And if we rely on "peer recovery leases" to retain the history (when soft-deletes enabled), we might not need to check hasCompleteHistoryOperations
.
@@ -158,13 +164,32 @@ public void recoverToTarget(ActionListener<RecoveryResponse> listener) { | |||
throw new DelayRecoveryException("source node does not have the shard listed in its state as allocated on the node"); | |||
} | |||
assert targetShardRouting.initializing() : "expected recovery target to be initializing but was " + targetShardRouting; | |||
retentionLeaseRef.set(useRetentionLeases ? shard.getRetentionLeases().get( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any issue if we use any existing retention leases for the peer recovery purpose? I mean to rely on hasCompleteHistoryOperations
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the issue is that we do not guarantee that the primary retains every operation required by every retention lease, so we must use hasCompleteHistoryOperations
to ensure this. The problem is that in a file-based recovery we create a retention lease at the local checkpoint of this shard's safe commit, which may be behind every other lease, so we cannot be certain that every other peer is also able to respect this lease; if this primary were to fail then another primary may be elected without all the history needed for all its leases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On reflection, it might actually be possible to do this: we'd need to keep the retention lock open until at least having replayed history to the global checkpoint. I'll think about this a bit more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, Yannick and I discussed this and you're right, there's no real drawback to creating the new replica's retention lease according to the current (persisted) global checkpoint rather than the local checkpoint of the safe commit. This means that we've much greater chance that this retention lease is satisfied on every in-sync shard copy, at which point we should not obviously need to call hasCompleteHistoryOperations
in many cases.
It is, however, not totally watertight, because today we sometimes create unsatisfied leases for BWC reasons. I haven't gone through the details to see if there are other cases too, but given that it's not something we're asserting today I have concerns that it might be possible. Also it's cheap to call hasCompleteHistoryOperations
and will save us from disaster, so let's keep it in.
I've adjusted the creation logic in 5fb8bda.
server/src/main/java/org/elasticsearch/indices/recovery/RecoverySourceHandler.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @dnhatn, comments addressed or responded.
server/src/main/java/org/elasticsearch/indices/recovery/RecoverySourceHandler.java
Show resolved
Hide resolved
@@ -158,13 +164,32 @@ public void recoverToTarget(ActionListener<RecoveryResponse> listener) { | |||
throw new DelayRecoveryException("source node does not have the shard listed in its state as allocated on the node"); | |||
} | |||
assert targetShardRouting.initializing() : "expected recovery target to be initializing but was " + targetShardRouting; | |||
retentionLeaseRef.set(useRetentionLeases ? shard.getRetentionLeases().get( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the issue is that we do not guarantee that the primary retains every operation required by every retention lease, so we must use hasCompleteHistoryOperations
to ensure this. The problem is that in a file-based recovery we create a retention lease at the local checkpoint of this shard's safe commit, which may be behind every other lease, so we cannot be certain that every other peer is also able to respect this lease; if this primary were to fail then another primary may be elected without all the history needed for all its leases.
// temporarily prevent any history from being discarded, and do this before acquiring the safe commit so that we can | ||
// be certain that all operations after the safe commit's local checkpoint will be retained for the duration of this | ||
// recovery. | ||
retentionLock = shard.acquireRetentionLock(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oof good catch, thanks. Addressed in 190649d.
@@ -255,6 +291,11 @@ public void recoverToTarget(ActionListener<RecoveryResponse> listener) { | |||
}, onFailure); | |||
|
|||
establishRetentionLeaseStep.whenComplete(r -> { | |||
if (useRetentionLeases) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not a no-op if we are doing a file-based recovery. In that case we have a proper retention lock at this point and should close it now so that we can start discarding history during phase 2. I clarified the condition to useRetentionLeases && isSequenceNumberBasedRecovery == false
in 190649d.
By cloning the primary's lease we need not worry about creating a lease that retains history which has already been discarded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left one question. Looking good o.w.
server/src/main/java/org/elasticsearch/index/seqno/ReplicationTracker.java
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/indices/recovery/RecoverySourceHandler.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks @DaveCTurner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
The test failure is meaningful: https://scans.gradle.com/s/bcfrvej2xrypy/console-log?task=:x-pack:plugin:ccr:internalClusterTest What happened is that there was ongoing indexing while a replica was recovering, which plays out like this:
The replica's retention lease is now ahead of its GCP, and this trips an assertion at renewal time. Yannick and I discussed a couple of options:
@dnhatn WDYT? |
test failure will require some rework
@DaveCTurner Thanks for the ping. Both options are good to me. |
@elasticmachine please run elasticsearch-ci/docs (failure looks unrelated and bogus) |
@elasticmachine please run elasticsearch-ci/1 |
1 similar comment
@elasticmachine please run elasticsearch-ci/1 |
Ok I'm happy with this and it's passed a bunch of runs through Changes since the last reviews:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks to peer recovery retention leases we now retain the history needed to perform peer recoveries from the index instead of from the translog. This commit adjusts the peer recovery process to do so, and also adjusts it to use the existence of a retention lease to decide whether or not to attempt an operations-based recovery. Reverts #38904 and #42211 Relates #41536
Thanks to peer recovery retention leases we now retain the history needed to
perform peer recoveries from the index instead of from the translog. This
commit adjusts the peer recovery process to do so, and also adjusts it to use
the existence of a retention lease to decide whether or not to attempt an
operations-based recovery.
Reverts #38904 and #42211
Relates #41536