Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recover peers using history from Lucene #44853

Conversation

DaveCTurner
Copy link
Contributor

Thanks to peer recovery retention leases we now retain the history needed to
perform peer recoveries from the index instead of from the translog. This
commit adjusts the peer recovery process to do so, and also adjusts it to use
the existence of a retention lease to decide whether or not to attempt an
operations-based recovery.

Reverts #38904 and #42211
Relates #41536

Thanks to peer recovery retention leases we now retain the history needed to
perform peer recoveries from the index instead of from the translog. This
commit adjusts the peer recovery process to do so, and also adjusts it to use
the existence of a retention lease to decide whether or not to attempt an
operations-based recovery.

Reverts elastic#38904 and elastic#42211
Relates elastic#41536
@DaveCTurner DaveCTurner added >enhancement :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. labels Jul 25, 2019
@DaveCTurner DaveCTurner requested review from ywelsch and dnhatn July 25, 2019 12:10
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

Copy link
Contributor Author

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Marked as WIP as there's a few small points to discuss still (highlighted below).

assertThat(shard.refreshStats().getTotal(), equalTo(2L)); // refresh on: finalize and end of recovery
// refresh on: finalize and end of recovery
// finalizing a replica involves two refreshes with soft deletes because of estimateNumberOfHistoryOperations()
final long initialRefreshes = shard.routingEntry().primary() || shard.indexSettings().isSoftDeleteEnabled() == false ? 2L : 3L;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this is expected or not. Seems a bit awkward. Maybe there's a better solution?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's okay to initialize initialRefreshes from shard.refreshStats().getTotal() and remove this assertion.

long externalRefreshCount = shard.refreshStats().getExternalTotal();

final long externalRefreshCount = shard.refreshStats().getExternalTotal();
final long extraInternalRefreshes = shard.routingEntry().primary() || shard.indexSettings().isSoftDeleteEnabled() == false ? 0 : 1;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this is expected or not. Seems a bit awkward. Maybe there's a better solution?

Copy link
Member

@dnhatn dnhatn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did an initial pass and left some comments. This looks good.

@@ -255,6 +291,11 @@ public void recoverToTarget(ActionListener<RecoveryResponse> listener) {
}, onFailure);

establishRetentionLeaseStep.whenComplete(r -> {
if (useRetentionLeases) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just always close once at the end as here's a noop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a no-op if we are doing a file-based recovery. In that case we have a proper retention lock at this point and should close it now so that we can start discarding history during phase 2. I clarified the condition to useRetentionLeases && isSequenceNumberBasedRecovery == false in 190649d.

assertThat(shard.refreshStats().getTotal(), equalTo(2L)); // refresh on: finalize and end of recovery
// refresh on: finalize and end of recovery
// finalizing a replica involves two refreshes with soft deletes because of estimateNumberOfHistoryOperations()
final long initialRefreshes = shard.routingEntry().primary() || shard.indexSettings().isSoftDeleteEnabled() == false ? 2L : 3L;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's okay to initialize initialRefreshes from shard.refreshStats().getTotal() and remove this assertion.

// temporarily prevent any history from being discarded, and do this before acquiring the safe commit so that we can
// be certain that all operations after the safe commit's local checkpoint will be retained for the duration of this
// recovery.
retentionLock = shard.acquireRetentionLock();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to acquire the retention lock before calling hasCompleteHistoryOperations.

Copy link
Contributor Author

@DaveCTurner DaveCTurner Jul 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oof good catch, thanks. Addressed in 190649d.

final boolean isSequenceNumberBasedRecovery
= request.startingSeqNo() != SequenceNumbers.UNASSIGNED_SEQ_NO
&& isTargetSameHistory()
&& shard.hasCompleteHistoryOperations("peer-recovery", request.startingSeqNo())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And if we rely on "peer recovery leases" to retain the history (when soft-deletes enabled), we might not need to check hasCompleteHistoryOperations.

@@ -158,13 +164,32 @@ public void recoverToTarget(ActionListener<RecoveryResponse> listener) {
throw new DelayRecoveryException("source node does not have the shard listed in its state as allocated on the node");
}
assert targetShardRouting.initializing() : "expected recovery target to be initializing but was " + targetShardRouting;
retentionLeaseRef.set(useRetentionLeases ? shard.getRetentionLeases().get(
Copy link
Member

@dnhatn dnhatn Jul 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any issue if we use any existing retention leases for the peer recovery purpose? I mean to rely on hasCompleteHistoryOperations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the issue is that we do not guarantee that the primary retains every operation required by every retention lease, so we must use hasCompleteHistoryOperations to ensure this. The problem is that in a file-based recovery we create a retention lease at the local checkpoint of this shard's safe commit, which may be behind every other lease, so we cannot be certain that every other peer is also able to respect this lease; if this primary were to fail then another primary may be elected without all the history needed for all its leases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On reflection, it might actually be possible to do this: we'd need to keep the retention lock open until at least having replayed history to the global checkpoint. I'll think about this a bit more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, Yannick and I discussed this and you're right, there's no real drawback to creating the new replica's retention lease according to the current (persisted) global checkpoint rather than the local checkpoint of the safe commit. This means that we've much greater chance that this retention lease is satisfied on every in-sync shard copy, at which point we should not obviously need to call hasCompleteHistoryOperations in many cases.

It is, however, not totally watertight, because today we sometimes create unsatisfied leases for BWC reasons. I haven't gone through the details to see if there are other cases too, but given that it's not something we're asserting today I have concerns that it might be possible. Also it's cheap to call hasCompleteHistoryOperations and will save us from disaster, so let's keep it in.

I've adjusted the creation logic in 5fb8bda.

Copy link
Contributor Author

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dnhatn, comments addressed or responded.

@@ -158,13 +164,32 @@ public void recoverToTarget(ActionListener<RecoveryResponse> listener) {
throw new DelayRecoveryException("source node does not have the shard listed in its state as allocated on the node");
}
assert targetShardRouting.initializing() : "expected recovery target to be initializing but was " + targetShardRouting;
retentionLeaseRef.set(useRetentionLeases ? shard.getRetentionLeases().get(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the issue is that we do not guarantee that the primary retains every operation required by every retention lease, so we must use hasCompleteHistoryOperations to ensure this. The problem is that in a file-based recovery we create a retention lease at the local checkpoint of this shard's safe commit, which may be behind every other lease, so we cannot be certain that every other peer is also able to respect this lease; if this primary were to fail then another primary may be elected without all the history needed for all its leases.

// temporarily prevent any history from being discarded, and do this before acquiring the safe commit so that we can
// be certain that all operations after the safe commit's local checkpoint will be retained for the duration of this
// recovery.
retentionLock = shard.acquireRetentionLock();
Copy link
Contributor Author

@DaveCTurner DaveCTurner Jul 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oof good catch, thanks. Addressed in 190649d.

@@ -255,6 +291,11 @@ public void recoverToTarget(ActionListener<RecoveryResponse> listener) {
}, onFailure);

establishRetentionLeaseStep.whenComplete(r -> {
if (useRetentionLeases) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a no-op if we are doing a file-based recovery. In that case we have a proper retention lock at this point and should close it now so that we can start discarding history during phase 2. I clarified the condition to useRetentionLeases && isSequenceNumberBasedRecovery == false in 190649d.

@DaveCTurner DaveCTurner requested a review from dnhatn July 30, 2019 09:15
Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left one question. Looking good o.w.

dnhatn
dnhatn previously approved these changes Jul 30, 2019
Copy link
Member

@dnhatn dnhatn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @DaveCTurner.

ywelsch
ywelsch previously approved these changes Jul 30, 2019
Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@DaveCTurner
Copy link
Contributor Author

The test failure is meaningful: https://scans.gradle.com/s/bcfrvej2xrypy/console-log?task=:x-pack:plugin:ccr:internalClusterTest

What happened is that there was ongoing indexing while a replica was recovering, which plays out like this:

  • we capture the GCP of the primary and use that in the cleanFiles() step when creating the new translog.

  • the GCP of the primary advances, as does its retention lease.

  • we clone the primary's (newly-advanced) retention lease for the replica.

  • the replica's GCP does not advance.

The replica's retention lease is now ahead of its GCP, and this trips an assertion at renewal time. Yannick and I discussed a couple of options:

  • weaken the assertion to ignore cases where the shard copy is not in sync. This makes it quite a bit weaker, but I can't see any terrible consequences of this.

  • create the retention lease slightly earlier, within phase 1, and base the GCP of the replica's new translog on the retained seqno of this new retention lease.

@dnhatn WDYT?

@DaveCTurner DaveCTurner dismissed stale reviews from dnhatn and ywelsch July 30, 2019 16:32

test failure will require some rework

@DaveCTurner DaveCTurner requested a review from dnhatn July 30, 2019 16:33
@dnhatn
Copy link
Member

dnhatn commented Jul 30, 2019

@DaveCTurner Thanks for the ping. Both options are good to me.

@DaveCTurner
Copy link
Contributor Author

@elasticmachine please run elasticsearch-ci/docs (failure looks unrelated and bogus)

@DaveCTurner
Copy link
Contributor Author

@elasticmachine please run elasticsearch-ci/1

1 similar comment
@DaveCTurner
Copy link
Contributor Author

@elasticmachine please run elasticsearch-ci/1

@DaveCTurner DaveCTurner removed the WIP label Aug 1, 2019
@DaveCTurner
Copy link
Contributor Author

Ok I'm happy with this and it's passed a bunch of runs through :server:test and :server:integtest overnight. Worth a final pass.

Changes since the last reviews:

  • d63e777 to go back to today's behaviour of using the GCP of the primary as the starting GCP of the replica, but adjusting this to be sampled after copying the files over and cloning the primary's lease so we can be sure that it's ahead of the leased checkpoint.

  • f47e56e to also create the lease when phase 1 is a no-op thanks to a synced flush marker (caught by BWC tests, but I added a proper test for it too).

@DaveCTurner DaveCTurner requested a review from ywelsch August 1, 2019 08:20
Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@DaveCTurner DaveCTurner merged commit 5322b00 into elastic:peer-recovery-retention-leases Aug 1, 2019
@DaveCTurner DaveCTurner deleted the 2019-07-25-prrl-integrate-recovery branch August 1, 2019 12:19
DaveCTurner added a commit that referenced this pull request Aug 1, 2019
Thanks to peer recovery retention leases we now retain the history needed to
perform peer recoveries from the index instead of from the translog. This
commit adjusts the peer recovery process to do so, and also adjusts it to use
the existence of a retention lease to decide whether or not to attempt an
operations-based recovery.

Reverts #38904 and #42211
Relates #41536
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. >enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants